Changes between Version 3 and Version 4 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
06/12/14 13:59:18 (12 years ago)
Author:
andrevm
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v3 v4  
    1818}}}
    1919
    20 The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
    21 
    22 * `GridDim` - a dim3 (or int) specifying the dimensions of the grid in units of # of blocks
    23 * `BlockDim` - a dim3 (or int) specifying the dimensions of each block in units of # of threads
    24 * `BlockHeapSize`(optional) - dynamically allocated memory for each block (default: 0)
     20The `<<<...>>>` is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration). Execution Configuration parameters are evaluated before regular function parameters. I cannot find information about the order in which execution parameters are evaluated.
     21
     22* `GridDim` - a `dim`3 (or `int`) specifying the number of blocks along each dimension of the grid
     23* `BlockDim` - a `dim3` (or `int`) specifying the number of threads alone each dimension of a block
     24* `BlockHeapSize`(optional) - dynamically allocated memory for each block (default: 0, might be able to be ignored in the verification process)
    2525* `Stream` - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)
    2626
     
    3737* `threadIdx` - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
    3838* `blockIdx` - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]
    39 * `blockDim` - 3 component vector equal to the `BlockDim` passed to the kernel as part of the execution configuration.
    4039
    4140The total number of kernel executions will be equal to `GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z`.
     
    6564More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming
    6665
     66== Cuda-C Language Extensions ==
     67
     68More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions
     69
     70=== Function Type Qualifiers ===
     71
     72* `__device__` - Executed on device and callable from device.
     73* `__global__` - Executed on device and callable from host and some devices. Calls must include execution configuration and execute asynchronously.
     74* `__host__` - Executed on host and callable from host. If no type qualifier is specified for a function, `__host__` is implicitly assumed to be the default.
     75`
     76`__global__` and `__host__` cannot both qualify the same function, but `__host__` and `__device__` can.
     77
     78More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-type-qualifiers
     79
     80=== Variable Type Qualifiers ===
     81
     82* `__device__` -
     83  * Resides in the device's global memory space for the duration of the application
     84  * Accessible from all threads and from host through the use of runtime calls
     85  * `__managed__` or `__managed__ __device__` - Like `__device__`, but may additionally be directly referenced from host code
     86* `__constant__ __device__` - like `__device__`, but located in read-only constant space instead of global space
     87* `__shared__ __device__` -   
     88  * Resides in a block's shared memory space during the duration of the block's life
     89  * Only accessible by threads within that block.
     90
     91More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#variable-type-qualifiers
     92
     93=== Built-in Vector Types ===
     94
     95Vector types are constructed from basic numeric C types. The 1st, 2nd, 3rd and 4th elements of a vector type can be accessed using `v.x`, `v.y`, `v.z` and `v.w` respectively. A vector with X elements of type T is of type TX (e.g. int3). Vectors of type <type_name> are created using the constructor function `make_<type_name>` which takes the elements of the vector as parameters. `dim3` is a vector type like `uint3` except less than 3 dimensions can be specified. Unspecified elements are set to 1.
     96
     97More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types
     98
     99=== Built-in Variables ===
     100
     101* `gridDim` - a `dim3` containing the number of blocks along each dimension of the grid
     102* `blockIdx` - a `uint3` containing the index of the block inside the grid
     103* `blockDim` - a `dim3` containing the number of threads along each dimension of a block
     104* `threadIdx` - a `uint3` containing index of the thread inside its block
     105* `warpSize` - an `int` containing the size of a warp. warp size depends on the hardware
     106
     107More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-variables
     108
     109=== Memory Fence Functions ===
     110
     111Cuda's memory model is weakly ordered, meaning that, along with being able to arbitrarily interleave operations from separate threads, the Cuda Runtime may also reorder a thread's independent reads and write. Because they are independent, this recording will not affect the operation of a single thread, but it could affect the way that multiple threads interact.
     112
     113Example of reordering (corrected from link):
     114
     115{{{
     116__device__ int X = 1, Y = 2;
     117__device__ void writeXY() {
     118  X = 10;
     119  Y = 20;
     120}
     121__device__ void readXY() {
     122  int B = Y;
     123  int A = X;
     124}
     125}}}
     126
     127In `readXY` and `writeXY` run in separate threads, it is possible that A = 1, and B = 20 even though this violates sequential consistency. For this reason, one should not attempt to communicate between threads in separate blocks in this manner.
     128
     129There are 3 memory fence functions that one can use to prevent reordering. The scope of each varies but all three ensure that
     130
     131* calling thread writes before the fence are observable by other threads before calling threads writes after then fence
     132* calling thread reads before the fence are performed before calling thread reads after the fence
     133
     134Essentially, writes cannot be reordered across the fence, and reads cannot be reorder across the fence.
     135
     136The three functions:
     137* `__threadfence_block()` - Write observability constraint applies to all threads in the same block the calling thread
     138* `__threadfence()` - Write observability constraint applies to all threads in the same device the calling thread
     139* `__threadfence_system()` - Write observability constraint applies to all threads, even those on other devices
     140
     141More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
     142
     143=== Synchronization Functions ===
     144
     145The main synchronization function is called `__syncthreads()`. It acts as a barrier for the threads in a block; that is, no thread continues past the call until all threads have reached it. It can be used to prevent race conditions between threads in a block. A call to `__syncthreads()` can exist in conditionally executed code as long as all threads agree on the execution path. Behavior is undefined if some threads execute the call and some don't.
     146
     147Three variations of `__synthreads()`:
     148* `int __syncthreads_count(int predicate)` - also returns the number of threads for which the predicate was non-zero
     149* `int __syncthreads_and(int predicate)` - also returns non-zero iff predicate was non-zero for all threads
     150* `int __syncthreads_or(int predicate)` - also returns non-zero iff predicate was non-zero for any thread
     151
     152More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions
     153
    67154== Cuda Translation Rules ==
    68155
    69 In progress
     156In progress...
     157
     158Ideas:
     159
     160For each kernel `K`, create a function `__kernel_K` that takes the same arguments as K in addition to the execution configuration parameters.
     161In the transformed program, `__kernel_K` is called wherever `K` was called in the original.
     162
     163Each kernel spawns a nested function Block for each block as designated by the execution configuration. Each block spawns a nested function Thread that performs the kernel instructions concurrently with all other threads.
     164
     165{{{
     166__global__ void dot(float *a, float *b, float *c) {
     167...
     168}
     169}}}
     170becomes
     171{{{
     172void __kernel_dot(dim3 _gridDim, dim3 _blockDim, size_t _blockHeapSize, cudaStream_t __s, float *a, float *b, float *c) {
     173
     174  void Block(uint3 blockIdx, float *a, float *b, float *c) {
     175 
     176  // shared memory declarations
     177
     178    void Thread(uint3 threadIdx, float *a, float *b, float *c) {
     179      ...
     180    }
     181    // spawn all threads and wait
     182  }
     183  // spawn all blocks and wait
     184}
     185}}}
     186
     187To help enforce the Cuda memory hierarchy, we introduce two root level variables, $host_scope and $device_scope. Here is the proposed layout.
     188
     189{{{
     190  $scope _host_scope;
     191  $scope _device_scope = $here;
     192
     193  // declarations of variables/functions that are accessible (including through the use of runtime calls)
     194  // from both host and device
     195 
     196  // global/device functions from original program
     197  __global__ void _kernel_f (<params>) {
     198    ...
     199  }
     200
     201  int main () {
     202    $host_scope = $here;
     203
     204    // host functions declarations from original program
     205
     206    int _main () { // old main
     207      ...
     208    }
     209  }
     210}}}
     211   
     212
     213Pointers to device memory cannot be dereferenced in host code in general, though there are some exceptions. Checks must be made to determine whether direct access is allowed. For example, in host code:
     214{{{
     215  int *p = (int *)malloc(10 * sizeof(int));
     216  ...
     217  p[i] = 0;
     218}}}
     219becomes
     220{{{
     221  int *p = (int *)$malloc($host_scope, 10 * sizeof(int));
     222  ...
     223  $assert($scopeof(p[i]) <= $host_scope);
     224  *p = 0;
     225}}}
     226and
     227{{{
     228  int *p;
     229  cudaMalloc((void **) &p, 10 * sizeof(int));
     230  ...
     231  p[i] = 0;
     232}}}
     233becomes
     234{{{
     235  int *p;
     236  p = (int *)$malloc($device_scope, 10 * sizeof(int));
     237  ...
     238  $assert($scopeof(p[i]) <= $host_scope);
     239  *p = 0;
     240}}}
     241
     242Similarly, device-only functions cannot be called from host-code.
     243