Context Navigation

Changes between Version 2 and Version 3 of Notes_on_CUDA_Semantics

Timestamp:: 06/06/14 10:42:10 (12 years ago)
Author:: andrevm
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Notes_on_CUDA_Semantics

-              v2
+              v3
+== Cuda Programming Model ==
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model
+=== Kernels ===
+Declaration Syntax:
+{{{
+__global__ void kernel_name(formals) {
+...
+}
+}}}
+Call Syntax:
+{{{
+kernel_name<<<GridDim, BlockDim, BlockHeapSize, Stream>>> (actuals);
+}}}
+The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
+* `GridDim` - a dim3 (or int) specifying the dimensions of the grid in units of # of blocks
+* `BlockDim` - a dim3 (or int) specifying the dimensions of each block in units of # of threads
+* `BlockHeapSize`(optional) - dynamically allocated memory for each block (default: 0)
+* `Stream` - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)
+Maximum block size is 1024 threads on current GPUs.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
+=== Thread Hierarchy ===
+Inside a kernel, each thread has access to two vectors containing thread id information.
+* `threadIdx` - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
+* `blockIdx` - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]
+* `blockDim` - 3 component vector equal to the `BlockDim` passed to the kernel as part of the execution configuration.
+The total number of kernel executions will be equal to `GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z`.
+Thread blocks must be able to execute independently. Threads in the same block can be synchronized using __syncthreads (acts as a barrier) and can communicate using shared memory.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
+=== Memory Hierarchy ===
+Cuda threads have access to a number of different memory spaces while executing.
+* Private - memory accessible only by a single thread
+* Shared - memory accessible by all threads in a block
+* Global - memory accessible by all threads in all blocks, optimized for general purpose usage
+  * Constant - read-only global memory
+  * Texture - read-only global memory optimized for certain access patterns and equipped with special access capabilities
+All globally accessible memory is persistent across multiple kernel invocations by the same program.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy
+=== Heterogenous Programming ===
+Cuda threads are assumed to be executed on a device separated from the CPU (e.g. GPU) so the CPU (host) and the GPU (device) have separated memory spaces, called host memory and device memory respectively. Because the host cannot directly access memory on the device, memory management must be performed with calls to the Cuda Runtime.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming
 == Cuda Translation Rules ==
+Example Lists
+* `code`
+* non-code
+  * nested element
+== Header ==
+=== Subheader ===
+{{{
+  example_code_block();
+}}}
+In progress