wiki:Notes_on_CUDA_Semantics

Version 5 (modified by andrevm, 12 years ago) ( diff )

--

Cuda Programming Model

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

Kernels

Declaration Syntax:

__global__ void kernel_name(formals) {
...
}

Call Syntax:

kernel_name<<<GridDim, BlockDim, BlockHeapSize, Stream>>> (actuals);

The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration). Execution Configuration parameters are evaluated before regular function parameters. I cannot find information about the order in which execution parameters are evaluated.

  • GridDim - a dim3 (or int) specifying the number of blocks along each dimension of the grid
  • BlockDim - a dim3 (or int) specifying the number of threads alone each dimension of a block
  • BlockHeapSize(optional) - dynamically allocated memory for each block (default: 0, might be able to be ignored in the verification process)
  • Stream - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)

Maximum block size is 1024 threads on current GPUs.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels

Thread Hierarchy

Inside a kernel, each thread has access to two vectors containing thread id information.

  • threadIdx - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
  • blockIdx - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]

The total number of kernel executions will be equal to GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z.

Thread blocks must be able to execute independently. Threads in the same block can be synchronized using syncthreads (acts as a barrier) and can communicate using shared memory.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

Memory Hierarchy

Cuda threads have access to a number of different memory spaces while executing.

  • Private - memory accessible only by a single thread
  • Shared - memory accessible by all threads in a block
  • Global - memory accessible by all threads in all blocks, optimized for general purpose usage
    • Constant - read-only global memory
    • Texture - read-only global memory optimized for certain access patterns and equipped with special access capabilities

All globally accessible memory is persistent across multiple kernel invocations by the same program.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy

Heterogenous Programming

Cuda threads are assumed to be executed on a device separated from the CPU (e.g. GPU) so the CPU (host) and the GPU (device) have separated memory spaces, called host memory and device memory respectively. Because the host cannot directly access memory on the device, memory management must be performed with calls to the Cuda Runtime.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming

Cuda-C Language Extensions

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions

Function Type Qualifiers

  • __device__ - Executed on device and callable from device.
  • __global__ - Executed on device and callable from host and some devices. Calls must include execution configuration and execute asynchronously.
  • __host__ - Executed on host and callable from host. If no type qualifier is specified for a function, __host__ is implicitly assumed to be the default.

` __global__ and __host__ cannot both qualify the same function, but __host__ and __device__ can.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-type-qualifiers

Variable Type Qualifiers

  • __device__ -
    • Resides in the device's global memory space for the duration of the application
    • Accessible from all threads and from host through the use of runtime calls
    • __managed__ or __managed__ __device__ - Like __device__, but may additionally be directly referenced from host code
  • __constant__ __device__ - like __device__, but located in read-only constant space instead of global space
  • __shared__ __device__ -
    • Resides in a block's shared memory space during the duration of the block's life
    • Only accessible by threads within that block.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#variable-type-qualifiers

Built-in Vector Types

Vector types are constructed from basic numeric C types. The 1st, 2nd, 3rd and 4th elements of a vector type can be accessed using v.x, v.y, v.z and v.w respectively. A vector with X elements of type T is of type TX (e.g. int3). Vectors of type <type_name> are created using the constructor function make_<type_name> which takes the elements of the vector as parameters. dim3 is a vector type like uint3 except less than 3 dimensions can be specified. Unspecified elements are set to 1.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types

Built-in Variables

  • gridDim - a dim3 containing the number of blocks along each dimension of the grid
  • blockIdx - a uint3 containing the index of the block inside the grid
  • blockDim - a dim3 containing the number of threads along each dimension of a block
  • threadIdx - a uint3 containing index of the thread inside its block
  • warpSize - an int containing the size of a warp. warp size depends on the hardware

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-variables

Memory Fence Functions

Cuda's memory model is weakly ordered, meaning that, along with being able to arbitrarily interleave operations from separate threads, the Cuda Runtime may also reorder a thread's independent reads and write. Because they are independent, this recording will not affect the operation of a single thread, but it could affect the way that multiple threads interact.

Example of reordering (corrected from link):

__device__ int X = 1, Y = 2; 
__device__ void writeXY() {
  X = 10; 
  Y = 20; 
} 
__device__ void readXY() { 
  int B = Y; 
  int A = X; 
}

In readXY and writeXY run in separate threads, it is possible that A = 1, and B = 20 even though this violates sequential consistency. For this reason, one should not attempt to communicate between threads in separate blocks in this manner.

There are 3 memory fence functions that one can use to prevent reordering. The scope of each varies but all three ensure that

  • calling thread writes before the fence are observable by other threads before calling threads writes after then fence
  • calling thread reads before the fence are performed before calling thread reads after the fence

Essentially, writes cannot be reordered across the fence, and reads cannot be reorder across the fence.

The three functions:

  • __threadfence_block() - Write observability constraint applies to all threads in the same block the calling thread
  • __threadfence() - Write observability constraint applies to all threads in the same device the calling thread
  • __threadfence_system() - Write observability constraint applies to all threads, even those on other devices

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

Synchronization Functions

The main synchronization function is called __syncthreads(). It acts as a barrier for the threads in a block; that is, no thread continues past the call until all threads have reached it. It can be used to prevent race conditions between threads in a block. A call to __syncthreads() can exist in conditionally executed code as long as all threads agree on the execution path. Behavior is undefined if some threads execute the call and some don't.

Three variations of __synthreads():

  • int __syncthreads_count(int predicate) - also returns the number of threads for which the predicate was non-zero
  • int __syncthreads_and(int predicate) - also returns non-zero iff predicate was non-zero for all threads
  • int __syncthreads_or(int predicate) - also returns non-zero iff predicate was non-zero for any thread

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

Cuda Translation Rules

In progress...

Ideas:

For each kernel K, create a function __kernel_K that takes the same arguments as K in addition to the execution configuration parameters. In the transformed program, __kernel_K is called wherever K was called in the original.

Each kernel spawns a nested function Block for each block as designated by the execution configuration. Each block spawns a nested function Thread that performs the kernel instructions concurrently with all other threads.

__global__ void dot(float *a, float *b, float *c) {
...
}

becomes

void __kernel_dot(dim3 _gridDim, dim3 _blockDim, size_t _blockHeapSize, cudaStream_t __s, float *a, float *b, float *c) {

  void Block(uint3 blockIdx, float *a, float *b, float *c) {
  
  // shared memory declarations

    void Thread(uint3 threadIdx, float *a, float *b, float *c) {
      ...
    }
    // spawn all threads and wait
  }
  // spawn all blocks and wait
}

To help enforce the Cuda memory hierarchy, we introduce two root level variables, $host_scope and $device_scope. Here is the proposed layout.

  $scope _host_scope;
  $scope _device_scope = $here;

  // declarations of variables/functions that are accessible (including through the use of runtime calls)
  // from both host and device
  
  // global/device functions from original program
  __global__ void _kernel_f (<params>) {
    ...
  }

  int main () {
    $host_scope = $here;

    // host functions declarations from original program

    int _main () { // old main
      ...
    }
  }

Pointers to device memory cannot be dereferenced in host code in general, though there are some exceptions. Checks must be made to determine whether direct access is allowed. For example, in host code:

  int *p = (int *)malloc(10 * sizeof(int));
  ...
  p[i] = 0;

becomes

  int *p = (int *)$malloc($host_scope, 10 * sizeof(int));
  ...
  $assert($scopeof(p[i]) <= $host_scope);
  *p = 0;

and

  int *p;
  cudaMalloc((void **) &p, 10 * sizeof(int));
  ...
  p[i] = 0;

becomes

  int *p;
  p = (int *)$malloc($device_scope, 10 * sizeof(int));
  ...
  $assert($scopeof(p[i]) <= $host_scope);
  *p = 0;

Similarly, device-only functions cannot be called from host-code.

Other Resources

Streams and Queues - http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf

Note: See TracWiki for help on using the wiki.