Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 9 and Version 10 of Implementation_of_CUDA_in_CIVL

Timestamp:: 07/26/22 14:40:24 (4 years ago)
Author:: zgrnhlt
Comment:: Updated block level and thread level transformation to include communicators, warp level barriers, and data race related function calls

Legend:

: Unmodified
: Added
: Removed
: Modified

Implementation_of_CUDA_in_CIVL

-              v9
+              v10
   void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
     void _cuda_block(uint3 blockIdx) {
+      int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
+      int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
+      $gcomm gComm = $gcomm_create($here, numThreads);
+      $gbarrier  warpBarriers[numWarps];
+      $scope _block_root = $here;
+      for(int i = 0; i < numWarps - 1; i++){
+        warpBarriers[i] = $gbarrier_create(_block_root, 32);
+      }
+      warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
       $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
       void _cuda_thread(uint3 threadIdx) {
 …
       $cuda_run_procs(blockDim, _cuda_thread);
       $gbarrier_destroy(_cuda_block_barrier);
+      for(int i = 0; i < numWarps; i++){
+        $gbarrier_destroy(warpBarriers[i]);
+      }
+      $gcomm_destroy(gComm, (void*)0);
+    }
     $cuda_wait_in_queue(_cuda_this, _cuda_event);
 …
+}
 }}}
 Each `_cuda_block` that is spawned thus creates its own global barrier, with a size equal to the number of threads that the block holds. Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys the barrier with `$gbarrier_destroy`.
+Each `_cuda_block` that is spawned thus creates its own global barrier and global communicator, with size equal to the number of threads that the block holds. It also creates warp level barriers (warps are groups of 32 threads within a block). Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys all global barriers with `$gbarrier_destroy` and destroys the global communicator with `$gcomm_destroy`.
 We can now see the last layer, finally revealing the full transformation of our kernel:
 …
   void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
     void _cuda_block(uint3 blockIdx) {
+      int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
+      int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
+      $gcomm gComm = $gcomm_create($here, numThreads);
+      $gbarrier  warpBarriers[numWarps];
+      $scope _block_root = $here;
+      for(int i = 0; i < numWarps - 1; i++){
+        warpBarriers[i] = $gbarrier_create(_block_root, 32);
+      }
+      warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
       $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
       void _cuda_thread(uint3 threadIdx) {
+        $local_start();
         int _cuda_tid = $cuda_index(blockDim, threadIdx);
+        int _cuda_kid = $cuda_kernel_index(gridDim, blockDim, blockIdx, threadIdx);
+        $comm comm = $comm_create($here, gComm, _cuda_tid);
         $barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid);
+        $read_set_push();
+        $write_set_push();
         ...Kernel definition of K...
+        $check_data_race(_cuda_this, _cuda_kid);
+        $read_set_pop();
+        $write_set_pop();
         $barrier_destroy(_cuda_thread_barrier);
+        $comm_destroy(comm);
+        $local_end();
+      }
       $cuda_run_procs(blockDim, _cuda_thread);
       $gbarrier_destroy(_cuda_block_barrier);
+      for(int i = 0; i < numWarps; i++){
+        $gbarrier_destroy(warpBarriers[i]);
+      }
+      $gcomm_destroy(gComm, (void*)0);
+    }
     $cuda_wait_in_queue(_cuda_this, _cuda_event);
 …
+}
 }}}
+We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done.
+The thread definition begins with `$local_start`, which treats the entire code section as atomic until a `$yield` or `$local_end`. This has the purpose of improving the efficiency of data race checking. Next, the thread's thread id (`_cuda_tid`) and kernel id (`_cuda_kid`) are calculated using the helper functions `$cuda_index` and `$cuda_kernel_index` respectively. `_cuda_tid` is a thread's unique position within the block based on the value of `threadIdx`, and `_cuda_kid` is the thread's unique position within the kernel based on the value of `threadIdx` and `blockIdx`.
+Each thread defines a `$barrier` created from the block-level `$gbarrier` and a `$comm` created from the block-level `$gcomm` and positions itself using `_cuda_tid`. Next, the `$read_set_push` and `$write_set_push` function calls are important for data race checking, as they push new memory sets onto the read and write set stacks held by the kernel. When a call to `$check_data_race` is made, only the reads and writes that have been monitored in the current read and write sets will be compared to the other threads. Thus, the calls to `$read_set_push` and `$write_set_push` create fresh memory sets for the purpose of running the kernel.
+Then, the thread runs the actual code found in the original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. After the original kernel completes, a call to `$check_data_race` is made to search for any data races that can be detected from the reads and writes that each thread has collected in its memory sets. Next, there are calls to `$read_set_pop` and `$write_set_pop` to pop off the kernel's stacks the memory sets we previously pushed. Then, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)`, and destroys its local communicator with the call `$comm_destroy(comm)`. Finally, a call to `$local_end` is made to end the atomic section, and then it is done.