| Version 3 (modified by , 12 years ago) ( diff ) |
|---|
Cuda Programming Model
More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model
Kernels
Declaration Syntax:
__global__ void kernel_name(formals) {
...
}
Call Syntax:
kernel_name<<<GridDim, BlockDim, BlockHeapSize, Stream>>> (actuals);
The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
GridDim- a dim3 (or int) specifying the dimensions of the grid in units of # of blocksBlockDim- a dim3 (or int) specifying the dimensions of each block in units of # of threadsBlockHeapSize(optional) - dynamically allocated memory for each block (default: 0)Stream- the Cuda stream on which to enqueue this kernel (default: 0/Null stream)
Maximum block size is 1024 threads on current GPUs.
More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
Thread Hierarchy
Inside a kernel, each thread has access to two vectors containing thread id information.
threadIdx- 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})blockIdx- 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]blockDim- 3 component vector equal to theBlockDimpassed to the kernel as part of the execution configuration.
The total number of kernel executions will be equal to GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z.
Thread blocks must be able to execute independently. Threads in the same block can be synchronized using syncthreads (acts as a barrier) and can communicate using shared memory.
More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
Memory Hierarchy
Cuda threads have access to a number of different memory spaces while executing.
- Private - memory accessible only by a single thread
- Shared - memory accessible by all threads in a block
- Global - memory accessible by all threads in all blocks, optimized for general purpose usage
- Constant - read-only global memory
- Texture - read-only global memory optimized for certain access patterns and equipped with special access capabilities
All globally accessible memory is persistent across multiple kernel invocations by the same program.
More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy
Heterogenous Programming
Cuda threads are assumed to be executed on a device separated from the CPU (e.g. GPU) so the CPU (host) and the GPU (device) have separated memory spaces, called host memory and device memory respectively. Because the host cannot directly access memory on the device, memory management must be performed with calls to the Cuda Runtime.
More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming
Cuda Translation Rules
In progress
