| | 117 | * `$cuda_wait_in_queue` is the call that does precisely what we just described. It waits on all of the events held by `_cuda_event` and then destroys the event. Then it sets the status of _cuda_this to running. |
| | 118 | * `$cuda_run_procs` spawns an instance of `_cuda_block(blockIdx)` for each `blockIdx` possible within the dimensions of `gridDim` and then waits for these processes to complete. |
| | 119 | * `$cuda_kernel_finish` sets the status of `_cuda_this` to finished. This is what flags to other kernels that may be waiting on this one that we are done. |
| | 120 | |
| | 121 | Now we can investigate what `_cuda_block` really does by looking at the next layer of our kernel: |
| | 122 | {{{ |
| | 123 | void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) { |
| | 124 | void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) { |
| | 125 | void _cuda_block(uint3 blockIdx) { |
| | 126 | $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z); |
| | 127 | void _cuda_thread(uint3 threadIdx) { |
| | 128 | ... |
| | 129 | } |
| | 130 | $cuda_run_procs(blockDim, _cuda_thread); |
| | 131 | $gbarrier_destroy(_cuda_block_barrier); |
| | 132 | } |
| | 133 | $cuda_wait_in_queue(_cuda_this, _cuda_event); |
| | 134 | $cuda_run_procs(gridDim, _cuda_block); |
| | 135 | $cuda_kernel_finish(_cuda_this); |
| | 136 | } |
| | 137 | $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel); |
| | 138 | } |
| | 139 | }}} |
| | 140 | Each `_cuda_block` that is spawned thus creates its own global barrier, with a size equal to the number of threads that the block holds. Then it uses the `$cuda_run_procs` function again to spawn an instance of `_cuda_thread(threadIdx)` for each `threadIdx` possible within the dimensions of `blockDim` and wait for their completion. After all threads in the block have finished, it destroys the barrier with `$gbarrier_destroy`. |
| | 141 | |
| | 142 | We can now see the last layer, finally revealing the full transformation of our kernel: |
| | 143 | {{{ |
| | 144 | void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) { |
| | 145 | void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) { |
| | 146 | void _cuda_block(uint3 blockIdx) { |
| | 147 | $gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z); |
| | 148 | void _cuda_thread(uint3 threadIdx) { |
| | 149 | int _cuda_tid = $cuda_index(blockDim, threadIdx); |
| | 150 | $barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid); |
| | 151 | ...Kernel definition of K... |
| | 152 | $barrier_destroy(_cuda_thread_barrier); |
| | 153 | } |
| | 154 | $cuda_run_procs(blockDim, _cuda_thread); |
| | 155 | $gbarrier_destroy(_cuda_block_barrier); |
| | 156 | } |
| | 157 | $cuda_wait_in_queue(_cuda_this, _cuda_event); |
| | 158 | $cuda_run_procs(gridDim, _cuda_block); |
| | 159 | $cuda_kernel_finish(_cuda_this); |
| | 160 | } |
| | 161 | $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel); |
| | 162 | } |
| | 163 | }}} |
| | 164 | We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done. |