Context Navigation

Changes between Version 6 and Version 7 of Implementation_of_CUDA_in_CIVL

Timestamp:: 03/18/22 07:35:20 (4 years ago)
Author:: Alex Wilton
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Implementation_of_CUDA_in_CIVL

-              v6
+              v7
+}
 }}}
+* `$cuda_wait_in_queue` is the call that does precisely what we just described. It waits on all of the events held by `_cuda_event` and then destroys the event. Then it sets the status of _cuda_this to running.
+* `$cuda_run_procs` spawns an instance of `_cuda_block(blockIdx)` for each `blockIdx` possible within the dimensions of `gridDim` and then waits for these processes to complete.
+* `$cuda_kernel_finish` sets the status of `_cuda_this` to finished. This is what flags to other kernels that may be waiting on this one that we are done.
+Now we can investigate what `_cuda_block` really does by looking at the next layer of our kernel:
+{{{
+void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
+  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
+    void _cuda_block(uint3 blockIdx) {
+      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
+      void _cuda_thread(uint3 threadIdx) {
+        ...
+      }
+      $cuda_run_procs(blockDim, _cuda_thread);
+      $gbarrier_destroy(_cuda_block_barrier);
+    }
+    $cuda_wait_in_queue(_cuda_this, _cuda_event);
+    $cuda_run_procs(gridDim, _cuda_block);
+    $cuda_kernel_finish(_cuda_this);
+  }
+  $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
+}
+}}}
+Each `_cuda_block` that is spawned thus creates its own global barrier, with a size equal to the number of threads that the block holds. Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys the barrier with `$gbarrier_destroy`.
+We can now see the last layer, finally revealing the full transformation of our kernel:
+{{{
+void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
+  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
+    void _cuda_block(uint3 blockIdx) {
+      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
+      void _cuda_thread(uint3 threadIdx) {
+        int _cuda_tid = $cuda_index(blockDim, threadIdx);
+        $barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid);
+        ...Kernel definition of K...
+        $barrier_destroy(_cuda_thread_barrier);
+      }
+      $cuda_run_procs(blockDim, _cuda_thread);
+      $gbarrier_destroy(_cuda_block_barrier);
+    }
+    $cuda_wait_in_queue(_cuda_this, _cuda_event);
+    $cuda_run_procs(gridDim, _cuda_block);
+    $cuda_kernel_finish(_cuda_this);
+  }
+  $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
+}
+}}}
+We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done.