wiki:Notes_on_CUDA_Semantics

Version 3 (modified by andrevm, 12 years ago) ( diff )

--

Cuda Programming Model

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

Kernels

Declaration Syntax:

__global__ void kernel_name(formals) {
...
}

Call Syntax:

kernel_name<<<GridDim, BlockDim, BlockHeapSize, Stream>>> (actuals);

The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)

  • GridDim - a dim3 (or int) specifying the dimensions of the grid in units of # of blocks
  • BlockDim - a dim3 (or int) specifying the dimensions of each block in units of # of threads
  • BlockHeapSize(optional) - dynamically allocated memory for each block (default: 0)
  • Stream - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)

Maximum block size is 1024 threads on current GPUs.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels

Thread Hierarchy

Inside a kernel, each thread has access to two vectors containing thread id information.

  • threadIdx - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
  • blockIdx - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]
  • blockDim - 3 component vector equal to the BlockDim passed to the kernel as part of the execution configuration.

The total number of kernel executions will be equal to GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z.

Thread blocks must be able to execute independently. Threads in the same block can be synchronized using syncthreads (acts as a barrier) and can communicate using shared memory.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy

Memory Hierarchy

Cuda threads have access to a number of different memory spaces while executing.

  • Private - memory accessible only by a single thread
  • Shared - memory accessible by all threads in a block
  • Global - memory accessible by all threads in all blocks, optimized for general purpose usage
    • Constant - read-only global memory
    • Texture - read-only global memory optimized for certain access patterns and equipped with special access capabilities

All globally accessible memory is persistent across multiple kernel invocations by the same program.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy

Heterogenous Programming

Cuda threads are assumed to be executed on a device separated from the CPU (e.g. GPU) so the CPU (host) and the GPU (device) have separated memory spaces, called host memory and device memory respectively. Because the host cannot directly access memory on the device, memory management must be performed with calls to the Cuda Runtime.

More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming

Cuda Translation Rules

In progress

Note: See TracWiki for help on using the wiki.