Changes between Version 6 and Version 7 of Implementation_of_CUDA_in_CIVL


Ignore:
Timestamp:
03/18/22 07:35:20 (4 years ago)
Author:
Alex Wilton
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Implementation_of_CUDA_in_CIVL

    v6 v7  
    115115}
    116116}}}
     117* `$cuda_wait_in_queue` is the call that does precisely what we just described. It waits on all of the events held by `_cuda_event` and then destroys the event. Then it sets the status of _cuda_this to running.
     118* `$cuda_run_procs` spawns an instance of `_cuda_block(blockIdx)` for each `blockIdx` possible within the dimensions of `gridDim` and then waits for these processes to complete.
     119* `$cuda_kernel_finish` sets the status of `_cuda_this` to finished. This is what flags to other kernels that may be waiting on this one that we are done.
     120
     121Now we can investigate what `_cuda_block` really does by looking at the next layer of our kernel:
     122{{{
     123void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
     124  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
     125    void _cuda_block(uint3 blockIdx) {
     126      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
     127      void _cuda_thread(uint3 threadIdx) {
     128        ...
     129      }
     130      $cuda_run_procs(blockDim, _cuda_thread);
     131      $gbarrier_destroy(_cuda_block_barrier);
     132    }
     133    $cuda_wait_in_queue(_cuda_this, _cuda_event);
     134    $cuda_run_procs(gridDim, _cuda_block);
     135    $cuda_kernel_finish(_cuda_this);
     136  }
     137  $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
     138}
     139}}}
     140Each `_cuda_block` that is spawned thus creates its own global barrier, with a size equal to the number of threads that the block holds. Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys the barrier with `$gbarrier_destroy`.
     141
     142We can now see the last layer, finally revealing the full transformation of our kernel:
     143{{{
     144void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
     145  void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
     146    void _cuda_block(uint3 blockIdx) {
     147      $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
     148      void _cuda_thread(uint3 threadIdx) {
     149        int _cuda_tid = $cuda_index(blockDim, threadIdx);
     150        $barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid);
     151        ...Kernel definition of K...
     152        $barrier_destroy(_cuda_thread_barrier);
     153      }
     154      $cuda_run_procs(blockDim, _cuda_thread);
     155      $gbarrier_destroy(_cuda_block_barrier);
     156    }
     157    $cuda_wait_in_queue(_cuda_this, _cuda_event);
     158    $cuda_run_procs(gridDim, _cuda_block);
     159    $cuda_kernel_finish(_cuda_this);
     160  }
     161  $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
     162}
     163}}}
     164We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done.