wiki:Notes_on_CUDA_Semantics

Notes on CUDA Semantics

A loosely organized collection of notes on the various aspects of CUDA and its semantics. Meant for assisting CIVL developers in understanding how to properly implement support for CUDA in CIVL-C.

Cuda Programming Model

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

Kernels

Declaration Syntax:

__global__ void kernel_name(formals) {
...
}

Call Syntax:

kernel_name<<<GridDim, BlockDim, BlockHeapSize, Stream>>> (actuals);

The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration). Execution Configuration parameters are evaluated before regular function parameters. I cannot find information about the order in which execution parameters are evaluated.

  • GridDim - a dim3 (or int) specifying the number of blocks along each dimension of the grid
  • BlockDim - a dim3 (or int) specifying the number of threads alone each dimension of a block
  • BlockHeapSize(optional) - dynamically allocated memory for each block (default: 0, might be able to be ignored in the verification process)
  • Stream - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)

Maximum block size is 1024 threads on current GPUs.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels

Thread Hierarchy

Inside a kernel, each thread has access to two vectors containing thread id information.

  • threadIdx - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
  • blockIdx - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]

The total number of kernel executions will be equal to GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z.

Thread blocks must be able to execute independently. Threads in the same block can be synchronized using syncthreads (acts as a barrier) and can communicate using shared memory.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

Memory Hierarchy

Cuda threads have access to a number of different memory spaces while executing.

  • Private - memory accessible only by a single thread
  • Shared - memory accessible by all threads in a block
  • Global - memory accessible by all threads in all blocks, optimized for general purpose usage
    • Constant - read-only global memory
    • Texture - read-only global memory optimized for certain access patterns and equipped with special access capabilities

All globally accessible memory is persistent across multiple kernel invocations by the same program.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy

Heterogenous Programming

Cuda threads are assumed to be executed on a device separated from the CPU (e.g. GPU) so the CPU (host) and the GPU (device) have separated memory spaces, called host memory and device memory respectively. Because the host cannot directly access memory on the device, memory management must be performed with calls to the Cuda Runtime.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming

Cuda-C Language Extensions

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions

Function Type Qualifiers

  • __device__ - Executed on device and callable from device.
  • __global__ - Executed on device and callable from host and some devices. Calls must include execution configuration and execute asynchronously.
  • __host__ - Executed on host and callable from host. If no type qualifier is specified for a function, __host__ is implicitly assumed to be the default.

` __global__ and __host__ cannot both qualify the same function, but __host__ and __device__ can.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-type-qualifiers

Variable Type Qualifiers

  • __device__ -
    • Resides in the device's global memory space for the duration of the application
    • Accessible from all threads and from host through the use of runtime calls
    • __managed__ or __managed__ __device__ - Like __device__, but may additionally be directly referenced from host code
  • __constant__ __device__ - like __device__, but located in read-only constant space instead of global space
  • __shared__ __device__ -
    • Resides in a block's shared memory space during the duration of the block's life
    • Only accessible by threads within that block.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#variable-type-qualifiers

Built-in Vector Types

Vector types are constructed from basic numeric C types. The 1st, 2nd, 3rd and 4th elements of a vector type can be accessed using v.x, v.y, v.z and v.w respectively. A vector with X elements of type T is of type TX (e.g. int3). Vectors of type <type_name> are created using the constructor function make_<type_name> which takes the elements of the vector as parameters. dim3 is a vector type like uint3 except less than 3 dimensions can be specified. Unspecified elements are set to 1.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types

Built-in Variables

  • gridDim - a dim3 containing the number of blocks along each dimension of the grid
  • blockIdx - a uint3 containing the index of the block inside the grid
  • blockDim - a dim3 containing the number of threads along each dimension of a block
  • threadIdx - a uint3 containing index of the thread inside its block
  • warpSize - an int containing the size of a warp. warp size depends on the hardware

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-variables

Memory Fence Functions

Cuda's memory model is weakly ordered, meaning that, along with being able to arbitrarily interleave operations from separate threads, the Cuda Runtime may also reorder a thread's independent reads and write. Because they are independent, this recording will not affect the operation of a single thread, but it could affect the way that multiple threads interact.

Example of reordering (corrected from link):

__device__ int X = 1, Y = 2; 
__device__ void writeXY() {
  X = 10; 
  Y = 20; 
} 
__device__ void readXY() { 
  int B = Y; 
  int A = X; 
}

In readXY and writeXY run in separate threads, it is possible that A = 1, and B = 20 even though this violates sequential consistency. For this reason, one should not attempt to communicate between threads in separate blocks in this manner.

There are 3 memory fence functions that one can use to prevent reordering. The scope of each varies but all three ensure that

  • calling thread writes before the fence are observable by other threads before calling threads writes after then fence
  • calling thread reads before the fence are performed before calling thread reads after the fence

Essentially, writes cannot be reordered across the fence, and reads cannot be reorder across the fence.

The three functions:

  • __threadfence_block() - Write observability constraint applies to all threads in the same block the calling thread
  • __threadfence() - Write observability constraint applies to all threads in the same device the calling thread
  • __threadfence_system() - Write observability constraint applies to all threads, even those on other devices

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

Synchronization Functions

The main synchronization function is called __syncthreads(). It acts as a barrier for the threads in a block; that is, no thread continues past the call until all threads have reached it. It can be used to prevent race conditions between threads in a block. A call to __syncthreads() can exist in conditionally executed code as long as all threads agree on the execution path. Behavior is undefined if some threads execute the call and some don't.

Three variations of __synthreads():

  • int __syncthreads_count(int predicate) - also returns the number of threads for which the predicate was non-zero
  • int __syncthreads_and(int predicate) - also returns non-zero iff predicate was non-zero for all threads
  • int __syncthreads_or(int predicate) - also returns non-zero iff predicate was non-zero for any thread

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

Cuda Translation Rules

In progress...

Ideas:

Thread Hierarchy Translation

For each kernel K, create a function __kernel_K that takes the same arguments as K in addition to the execution configuration parameters. In the transformed program, __kernel_K is called wherever K was called in the original.

Each kernel spawns a nested function Block for each block as designated by the execution configuration. Each block spawns a nested function Thread that performs the kernel instructions concurrently with all other threads.

__global__ void dot(float *a, float *b, float *c) {
...
}

becomes

void __kernel_dot(dim3 _gridDim, dim3 _blockDim, size_t _blockHeapSize, cudaStream_t __s, float *a, float *b, float *c) {

  void Block(uint3 blockIdx, float *a, float *b, float *c) {
  
  // shared memory declarations

    void Thread(uint3 threadIdx, float *a, float *b, float *c) {
      ...
    }
    // spawn all threads and wait
  }
  // spawn all blocks and wait
}

Memory Hierarchy Translation

To help enforce the Cuda memory hierarchy, we introduce two root level variables, $host_scope and $device_scope. Here is the proposed layout.

  $scope _host_scope;
  $scope _device_scope = $here;

  // declarations of variables/functions that are accessible (including through the use of runtime calls)
  // from both host and device
  
  // global/device functions from original program
  __global__ void _kernel_f (<params>) {
    ...
  }

  int main () {
    $host_scope = $here;

    // host functions declarations from original program

    int _main () { // old main
      ...
    }
  }

Pointers to device memory cannot be dereferenced in host code in general, though there are some exceptions. Checks must be made to determine whether direct access is allowed. For example, in host code:

  int *p = (int *)malloc(10 * sizeof(int));
  ...
  p[i] = 0;

becomes

  int *p = (int *)$malloc($host_scope, 10 * sizeof(int));
  ...
  $assert($scopeof(p[i]) <= $host_scope);
  *p = 0;

and

  int *p;
  cudaMalloc((void **) &p, 10 * sizeof(int));
  ...
  p[i] = 0;

becomes

  int *p;
  p = (int *)$malloc($device_scope, 10 * sizeof(int));
  ...
  $assert($scopeof(p[i]) <= $host_scope);
  *p = 0;

Similarly, device-only functions cannot be called from host-code. However, I will not be implementing this feature until it becomes clear that correct verification of Cuda programs relies on detecting such memory accesses.

Vector Types

Vector types (including dim3) should be able to be represented with a struct with members x, y, z, and w.

Streams and Events

A Cuda stream is represented as a queue of kernel instances. When a kernel is launched on a stream, a kernel instance is push onto the back of the queue, and will block until reaching the front of the queue, at which point it can begin executing. The null stream, however, must be treated differently. The null stream is the default stream and it has different semantics. Kernels enqueued on the default stream cannot be executed concurrently with any other kernels, so all kernels before a kernel in the null stream must finish before it can execute, and all kernels after a kernel in the null stream must block until it is finished. All non-null streams will be stored in a list in a _cudaContext struct. Additionally, the null stream is stored separately in a Cuda context because of its special semantics.

Cuda events are a way to save a reference to a particular point in a Cuda stream. This allows events to be compared to perform timing measurements on certain actions, but more importantly, allows the programmer an additional way to synchronize streams with each and the host. Events are represented as a reference to the last kernel instance that must complete before the event is said to have occurred.

A kernel instance is a struct containing the $proc that is the kernel itself and a flag indicating whether the kernel is blocked and waiting to execute, executing, or finished executing.

Other Resources

Streams and Queues - http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

Warp Semantics

Threads within a block are broken down even further on a hardware level into groups of 32 threads called "warps" that execute together, usually in lock-step. This means that most of the time, the threads in a warp will execute the same line of code at the same time (however this is no longer guaranteed at all times). Threads in a warp are called "lanes", and a thread's laneID can be calculated using threadID % 32.

Warp Level Primitives Blog Post - https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Incidental Thread Divergence and __activemask() - https://stackoverflow.com/questions/54055195/activemask-vs-ballot-sync/54055576#54055576

__shfl_*_sync Intrinsics

The Cuda __shfl_*_sync intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the mask parameter, which is an unsigned int, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching laneID is required to call the collective with the same mask parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called var. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional width parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The width, if specified, must be 2, 4, 8, 16, or 32.

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
  • __shfl_sync - Each thread included in mask converges and returns the value of var obtained from srcLane. If srcLane is outside of the range 0..width-1, a new source lane is calculated with srcLane % width
  • __shfl_up_sync - Each thread included in mask converges and returns the value of var obtained from the lane delta lanes below itself in the warp. If laneID - delta < 0, the thread will return its original var.
  • __shfl_down_sync - Each thread included in mask converges and returns the value of var obtained from the lane delta lanes above itself in the warp. If laneID + delta >= width, the thread will return its original var.
  • __shfl_xor_sync - Each thread included in mask converges and returns the value of var obtained from a source lane with ID calculated from a bitwise exclusive or done on its own laneID and the laneMask parameter. If the source lane determined by laneID ^ laneMask is within the same sub-warp as the calling thread, or in a sub-warp that has lower threadIDs within the same warp, the data can be exchanged. However, if the source lane is in a sub-warp that has higher threadIDs or is out of bounds of the current 32 thread warp, the calling thread's original var is returned.

Undefined behavior can be caused by accessing data from a thread not participating in the call. The code will not run if an invalid width is passed in, threads call the intrinsics with inconsistent masks, or a thread included in mask does not call the collective.

Mask Parameter - https://stackoverflow.com/questions/58833808/insight-into-the-first-argument-mask-in-shfl-sync

Our emulation of the _sync semantics:

  • If a thread t is not in its own mask then it will not synchronize with any other threads
    • The value returned will be its own value if sourceLane = laneID or sourceLane is outside width in the cases of up_sync and down_sync
    • The value returned will be havoced in any other case since there is no guarantee that this thread will synchronize with the requested sourceLane.
  • If a thread t is in its own mask then it will participate in the barrier
    • If sourceLane is in t's mask then t requests a message from sourceLane and returns the value obtained.
    • If sourceLane is not in t's mask then we cannot guarantee that sourceLane will participate with t and so t simply makes no request and just returns a havoced value at the end.
    • Regardless of these two cases, t will always check for requests sent to it after the barrier call and fulfill these requests.

Last modified 4 years ago Last modified on 07/27/22 11:43:58
Note: See TracWiki for help on using the wiki.