| | 126 | int numThreads = (blockDim.x * blockDim.y) * blockDim.z; |
| | 127 | int numWarps = (numThreads / 32) + ((numThreads % 32) != 0); |
| | 128 | $gcomm gComm = $gcomm_create($here, numThreads); |
| | 129 | $gbarrier warpBarriers[numWarps]; |
| | 130 | $scope _block_root = $here; |
| | 131 | for(int i = 0; i < numWarps - 1; i++){ |
| | 132 | warpBarriers[i] = $gbarrier_create(_block_root, 32); |
| | 133 | } |
| | 134 | warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32)); |
| | 160 | int numThreads = (blockDim.x * blockDim.y) * blockDim.z; |
| | 161 | int numWarps = (numThreads / 32) + ((numThreads % 32) != 0); |
| | 162 | $gcomm gComm = $gcomm_create($here, numThreads); |
| | 163 | $gbarrier warpBarriers[numWarps]; |
| | 164 | $scope _block_root = $here; |
| | 165 | for(int i = 0; i < numWarps - 1; i++){ |
| | 166 | warpBarriers[i] = $gbarrier_create(_block_root, 32); |
| | 167 | } |
| | 168 | warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32)); |
| 164 | | We can see that each thread simply defines a `$barrier` created from the block-level `$gbarrier` and it positions itself in the barrier based on the value of `$cuda_index(blockDim, threadIdx)` which is just a helper function to calculate a unique position within the block based on the value of `threadIdx`. Then it runs the actual code found in our original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. Finally, after executing the kernel code, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)` and then it is done. |
| | 200 | The thread definition begins with `$local_start`, which treats the entire code section as atomic until a `$yield` or `$local_end`. This has the purpose of improving the efficiency of data race checking. Next, the thread's thread id (`_cuda_tid`) and kernel id (`_cuda_kid`) are calculated using the helper functions `$cuda_index` and `$cuda_kernel_index` respectively. `_cuda_tid` is a thread's unique position within the block based on the value of `threadIdx`, and `_cuda_kid` is the thread's unique position within the kernel based on the value of `threadIdx` and `blockIdx`. |
| | 201 | Each thread defines a `$barrier` created from the block-level `$gbarrier` and a `$comm` created from the block-level `$gcomm` and positions itself using `_cuda_tid`. Next, the `$read_set_push` and `$write_set_push` function calls are important for data race checking, as they push new memory sets onto the read and write set stacks held by the kernel. When a call to `$check_data_race` is made, only the reads and writes that have been monitored in the current read and write sets will be compared to the other threads. Thus, the calls to `$read_set_push` and `$write_set_push` create fresh memory sets for the purpose of running the kernel. |
| | 202 | |
| | 203 | Then, the thread runs the actual code found in the original kernel `K`, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to `__syncthreads` is simply replaced by a call to `$barrier_call(_cuda_thread_barrier)` so that we properly emulate block synchronization at these points. After the original kernel completes, a call to `$check_data_race` is made to search for any data races that can be detected from the reads and writes that each thread has collected in its memory sets. Next, there are calls to `$read_set_pop` and `$write_set_pop` to pop off the kernel's stacks the memory sets we previously pushed. Then, each thread destroys its local barrier with the call `$barrier_destroy(_cuda_thread_barrier)`, and destroys its local communicator with the call `$comm_destroy(comm)`. Finally, a call to `$local_end` is made to end the atomic section, and then it is done. |