Changes between Version 9 and Version 10 of Implementation_of_CUDA_in_CIVL


Ignore:
Timestamp:
07/26/22 14:40:24 (4 years ago)
Author:
zgrnhlt
Comment:

Updated block level and thread level transformation to include communicators, warp level barriers, and data race related function calls

Legend:

Unmodified
Added
Removed
Modified
  • Implementation_of_CUDA_in_CIVL

    v9 v10  
    124124  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
    125125    void _cuda_block(uint3 blockIdx) {
     126      int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
     127      int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
     128      $gcomm gComm = $gcomm_create($here, numThreads);
     129      $gbarrier  warpBarriers[numWarps];
     130      $scope _block_root = $here;
     131      for(int i = 0; i < numWarps - 1; i++){
     132        warpBarriers[i] = $gbarrier_create(_block_root, 32);   
     133      }
     134      warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
    126135      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
    127136      void _cuda_thread(uint3 threadIdx) {
     
    130139      $cuda_run_procs(blockDim, _cuda_thread);
    131140      $gbarrier_destroy(_cuda_block_barrier);
     141      for(int i = 0; i < numWarps; i++){
     142        $gbarrier_destroy(warpBarriers[i]);
     143      }
     144      $gcomm_destroy(gComm, (void*)0);
    132145    }
    133146    $cuda_wait_in_queue(_cuda_this, _cuda_event);
     
    138151}
    139152}}}
    140 Each `_cuda_block` that is spawned thus creates its own global barrier, with a size equal to the number of threads that the block holds. Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys the barrier with `$gbarrier_destroy`.
     153Each `_cuda_block` that is spawned thus creates its own global barrier and global communicator, with size equal to the number of threads that the block holds. It also creates warp level barriers (warps are groups of 32 threads within a block). Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys all global barriers with `$gbarrier_destroy` and destroys the global communicator with `$gcomm_destroy`.
    141154
    142155We can now see the last layer, finally revealing the full transformation of our kernel:
     
    145158  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
    146159    void _cuda_block(uint3 blockIdx) {
     160      int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
     161      int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
     162      $gcomm gComm = $gcomm_create($here, numThreads);
     163      $gbarrier  warpBarriers[numWarps];
     164      $scope _block_root = $here;
     165      for(int i = 0; i < numWarps - 1; i++){
     166        warpBarriers[i] = $gbarrier_create(_block_root, 32);   
     167      }
     168      warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
    147169      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
    148170      void _cuda_thread(uint3 threadIdx) {
     171        $local_start();
    149172        int _cuda_tid = $cuda_index(blockDim, threadIdx);
     173        int _cuda_kid = $cuda_kernel_index(gridDim, blockDim, blockIdx, threadIdx);
     174        $comm comm = $comm_create($here, gComm, _cuda_tid);
    150175        $barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid);
     176        $read_set_push();
     177        $write_set_push();
    151178        ...Kernel definition of K...
     179        $check_data_race(_cuda_this, _cuda_kid);
     180        $read_set_pop();
     181        $write_set_pop();
    152182        $barrier_destroy(_cuda_thread_barrier);
     183        $comm_destroy(comm);
     184        $local_end();
    153185      }
    154186      $cuda_run_procs(blockDim, _cuda_thread);
    155187      $gbarrier_destroy(_cuda_block_barrier);
     188      for(int i = 0; i < numWarps; i++){
     189        $gbarrier_destroy(warpBarriers[i]);
     190      }
     191      $gcomm_destroy(gComm, (void*)0);
    156192    }
    157193    $cuda_wait_in_queue(_cuda_this, _cuda_event);
     
    162198}
    163199}}}
    164 We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done.
     200The thread definition begins with `$local_start`, which treats the entire code section as atomic until a `$yield` or `$local_end`. This has the purpose of improving the efficiency of data race checking. Next, the thread's thread id (`_cuda_tid`) and kernel id (`_cuda_kid`) are calculated using the helper functions `$cuda_index` and `$cuda_kernel_index` respectively. `_cuda_tid` is a thread's unique position within the block based on the value of `threadIdx`, and `_cuda_kid` is the thread's unique position within the kernel based on the value of `threadIdx` and `blockIdx`.
     201Each thread defines a `$barrier` created from the block-level `$gbarrier` and a `$comm` created from the block-level `$gcomm` and positions itself using `_cuda_tid`. Next, the `$read_set_push` and `$write_set_push` function calls are important for data race checking, as they push new memory sets onto the read and write set stacks held by the kernel. When a call to `$check_data_race` is made, only the reads and writes that have been monitored in the current read and write sets will be compared to the other threads. Thus, the calls to `$read_set_push` and `$write_set_push` create fresh memory sets for the purpose of running the kernel.
     202
     203Then, the thread runs the actual code found in the original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. After the original kernel completes, a call to `$check_data_race` is made to search for any data races that can be detected from the reads and writes that each thread has collected in its memory sets. Next, there are calls to `$read_set_pop` and `$write_set_pop` to pop off the kernel's stacks the memory sets we previously pushed. Then, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)`, and destroys its local communicator with the call `$comm_destroy(comm)`. Finally, a call to `$local_end` is made to end the atomic section, and then it is done.