Context Navigation

Changes between Version 3 and Version 4 of Notes_on_CUDA_Semantics

Timestamp:: 06/12/14 13:59:18 (12 years ago)
Author:: andrevm
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Notes_on_CUDA_Semantics

-              v3
+              v4
 }}}
 The <<<...>>> is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration)
 * `GridDim` - a dim3 (or int) specifying the dimensions of the grid in units of # of blocks
 * `BlockDim` - a dim3 (or int) specifying the dimensions of each block in units of # of threads
 * `BlockHeapSize`(optional) - dynamically allocated memory for each block (default: 0)
+The `<<<...>>>` is called the Execution Configuration (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration). Execution Configuration parameters are evaluated before regular function parameters. I cannot find information about the order in which execution parameters are evaluated.
+* `GridDim` - a `dim`3 (or `int`) specifying the number of blocks along each dimension of the grid
+* `BlockDim` - a `dim3` (or `int`) specifying the number of threads alone each dimension of a block
+* `BlockHeapSize`(optional) - dynamically allocated memory for each block (default: 0, might be able to be ignored in the verification process)
 * `Stream` - the Cuda stream on which to enqueue this kernel (default: 0/Null stream)
 …
 * `threadIdx` - 3 component vector specifying the index of the executing thread inside its containing block. threadIdx.d is in the range [0, BlockDim.d - 1] where d is the coordinate of the vector you wish to use (d belongs to the set {x, y, z})
 * `blockIdx` - 3 component vector specifying the index of the containing block of the executing thread. blockIdx.d is in range [0, GridDim.d - 1]
-* `blockDim` - 3 component vector equal to the `BlockDim` passed to the kernel as part of the execution configuration.
 The total number of kernel executions will be equal to `GridDim.x * GridDim.y * GridDim.z * BlockDim.x * BlockDim.y * BlockDim.z`.
 …
 More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#heterogeneous-programming
+== Cuda-C Language Extensions ==
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions
+=== Function Type Qualifiers ===
+* `__device__` - Executed on device and callable from device.
+* `__global__` - Executed on device and callable from host and some devices. Calls must include execution configuration and execute asynchronously.
+* `__host__` - Executed on host and callable from host. If no type qualifier is specified for a function, `__host__` is implicitly assumed to be the default.
+`
+`__global__` and `__host__` cannot both qualify the same function, but `__host__` and `__device__` can.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-type-qualifiers
+=== Variable Type Qualifiers ===
+* `__device__` -
+  * Resides in the device's global memory space for the duration of the application
+  * Accessible from all threads and from host through the use of runtime calls
+  * `__managed__` or `__managed__ __device__` - Like `__device__`, but may additionally be directly referenced from host code
+* `__constant__ __device__` - like `__device__`, but located in read-only constant space instead of global space
+* `__shared__ __device__` -
+  * Resides in a block's shared memory space during the duration of the block's life
+  * Only accessible by threads within that block.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#variable-type-qualifiers
+=== Built-in Vector Types ===
+Vector types are constructed from basic numeric C types. The 1st, 2nd, 3rd and 4th elements of a vector type can be accessed using `v.x`, `v.y`, `v.z` and `v.w` respectively. A vector with X elements of type T is of type TX (e.g. int3). Vectors of type <type_name> are created using the constructor function `make_<type_name>` which takes the elements of the vector as parameters. `dim3` is a vector type like `uint3` except less than 3 dimensions can be specified. Unspecified elements are set to 1.
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-vector-types
+=== Built-in Variables ===
+* `gridDim` - a `dim3` containing the number of blocks along each dimension of the grid
+* `blockIdx` - a `uint3` containing the index of the block inside the grid
+* `blockDim` - a `dim3` containing the number of threads along each dimension of a block
+* `threadIdx` - a `uint3` containing index of the thread inside its block
+* `warpSize` - an `int` containing the size of a warp. warp size depends on the hardware
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-variables
+=== Memory Fence Functions ===
+Cuda's memory model is weakly ordered, meaning that, along with being able to arbitrarily interleave operations from separate threads, the Cuda Runtime may also reorder a thread's independent reads and write. Because they are independent, this recording will not affect the operation of a single thread, but it could affect the way that multiple threads interact.
+Example of reordering (corrected from link):
+{{{
+__device__ int X = 1, Y = 2;
+__device__ void writeXY() {
+  X = 10;
+  Y = 20;
+}
+__device__ void readXY() {
+  int B = Y;
+  int A = X;
+}
+}}}
+In `readXY` and `writeXY` run in separate threads, it is possible that A = 1, and B = 20 even though this violates sequential consistency. For this reason, one should not attempt to communicate between threads in separate blocks in this manner.
+There are 3 memory fence functions that one can use to prevent reordering. The scope of each varies but all three ensure that
+* calling thread writes before the fence are observable by other threads before calling threads writes after then fence
+* calling thread reads before the fence are performed before calling thread reads after the fence
+Essentially, writes cannot be reordered across the fence, and reads cannot be reorder across the fence.
+The three functions:
+* `__threadfence_block()` - Write observability constraint applies to all threads in the same block the calling thread
+* `__threadfence()` - Write observability constraint applies to all threads in the same device the calling thread
+* `__threadfence_system()` - Write observability constraint applies to all threads, even those on other devices
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions
+=== Synchronization Functions ===
+The main synchronization function is called `__syncthreads()`. It acts as a barrier for the threads in a block; that is, no thread continues past the call until all threads have reached it. It can be used to prevent race conditions between threads in a block. A call to `__syncthreads()` can exist in conditionally executed code as long as all threads agree on the execution path. Behavior is undefined if some threads execute the call and some don't.
+Three variations of `__synthreads()`:
+* `int __syncthreads_count(int predicate)` - also returns the number of threads for which the predicate was non-zero
+* `int __syncthreads_and(int predicate)` - also returns non-zero iff predicate was non-zero for all threads
+* `int __syncthreads_or(int predicate)` - also returns non-zero iff predicate was non-zero for any thread
+More information at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions
 == Cuda Translation Rules ==
+In progress
+In progress...
+Ideas:
+For each kernel `K`, create a function `__kernel_K` that takes the same arguments as K in addition to the execution configuration parameters.
+In the transformed program, `__kernel_K` is called wherever `K` was called in the original.
+Each kernel spawns a nested function Block for each block as designated by the execution configuration. Each block spawns a nested function Thread that performs the kernel instructions concurrently with all other threads.
+{{{
+__global__ void dot(float *a, float *b, float *c) {
+...
+}
+}}}
+becomes
+{{{
+void __kernel_dot(dim3 _gridDim, dim3 _blockDim, size_t _blockHeapSize, cudaStream_t __s, float *a, float *b, float *c) {
+  void Block(uint3 blockIdx, float *a, float *b, float *c) {
+  // shared memory declarations
+    void Thread(uint3 threadIdx, float *a, float *b, float *c) {
+      ...
+    }
+    // spawn all threads and wait
+  }
+  // spawn all blocks and wait
+}
+}}}
+To help enforce the Cuda memory hierarchy, we introduce two root level variables, $host_scope and $device_scope. Here is the proposed layout.
+{{{
+  $scope _host_scope;
+  $scope _device_scope = $here;
+  // declarations of variables/functions that are accessible (including through the use of runtime calls)
+  // from both host and device
+  // global/device functions from original program
+  __global__ void _kernel_f (<params>) {
+    ...
+  }
+  int main () {
+    $host_scope = $here;
+    // host functions declarations from original program
+    int _main () { // old main
+      ...
+    }
+  }
+}}}
+Pointers to device memory cannot be dereferenced in host code in general, though there are some exceptions. Checks must be made to determine whether direct access is allowed. For example, in host code:
+{{{
+  int *p = (int *)malloc(10 * sizeof(int));
+  ...
+  p[i] = 0;
+}}}
+becomes
+{{{
+  int *p = (int *)$malloc($host_scope, 10 * sizeof(int));
+  ...
+  $assert($scopeof(p[i]) <= $host_scope);
+  *p = 0;
+}}}
+and
+{{{
+  int *p;
+  cudaMalloc((void **) &p, 10 * sizeof(int));
+  ...
+  p[i] = 0;
+}}}
+becomes
+{{{
+  int *p;
+  p = (int *)$malloc($device_scope, 10 * sizeof(int));
+  ...
+  $assert($scopeof(p[i]) <= $host_scope);
+  *p = 0;
+}}}
+Similarly, device-only functions cannot be called from host-code.