| 76 | | The transformation described above certainly allows us to support the use of the configuration parameters. However, we still have to somehow add code which will emulate the true execution of a CUDA kernel with the 4 configuration parameters given. That means we must spawn the appropriate number of threads, each with appropriate local CUDA parameters `blockIdx` and `threadIdx` in scope and given a value, and then appropriately enqueue the kernel into the given stream, waiting as necessary on other cuda kernels in the stream. |
| | 76 | The transformation described above certainly allows us to support the use of the configuration parameters. However, we still have to somehow add code which will emulate the true execution of a CUDA kernel with the 4 configuration parameters given. That means we must spawn the appropriate number of threads, each with appropriate local CUDA parameters `blockIdx` and `threadIdx` declared in scope and given a value, and then appropriately enqueue the kernel into the given stream, waiting as necessary on other cuda kernels in the stream. The kernel is thus transformed to accomplish this. |
| | 77 | |
| | 78 | The transformed kernel is composed of several simple layers that we will discuss one at a time here, revealing more information as we go. The first layer handles creating the kernel instance and enqueuing it onto the appropriate stream. |
| | 79 | {{{ |
| | 80 | void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) { |
| | 81 | void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) { |
| | 82 | ... |
| | 83 | } |
| | 84 | $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel); |
| | 85 | } |
| | 86 | }}} |
| | 87 | For clarity, we will refer to `_cuda_kernel` as the ''inner kernel'' of our original kernel `K`. |
| | 88 | |
| | 89 | `$cuda_enqueue_kernel` does the following: |
| | 90 | 1. Creates a new `cudaEvent_t` called `e` based on the stream being used. (see below for further details) |
| | 91 | 2. Creates a new `$cuda_kernel_instance_t` and enqueues it onto the stream. |
| | 92 | 3. Spawns the inner kernel as a new process, passing in the `$cuda_kernel_instance_t` created in step 2 and the `cudaEvent_t` created in step 3 as its parameters. |
| | 93 | 4. Sets the `process` field of the `$cuda_kernel_instance_t` from step 2 to be the spawned process from step 3. |
| | 94 | |
| | 95 | Recall that a CUDA kernel in a non-null stream, call it `s`, must wait for all other kernels that were enqueued in `s` or the null stream at the time that the kernel was launched. Additionally, any kernel launched on the null stream must wait for all kernels enqueued in any stream at the time of launch. A `cudaEvent_t` serves as a structure that is meant to store some set of kernels that we can wait on. |
| | 96 | {{{ |
| | 97 | typedef struct _CUevent cudaEvent_t; |
| | 98 | struct _CUevent{ |
| | 99 | $cuda_kernel_instance_t** instances; |
| | 100 | int numInstances; |
| | 101 | }; |
| | 102 | }}} |
| | 103 | Therefore, when we create the `cudaEvent_t` in step 1, we are simply grabbing the most recent kernel from the streams that we want to wait on, and storing it in this new event. We then pass this event to the inner kernel so that the inner kernel can wait on these other kernels before actually acting running itself. This can be seen in the next layer of our transformed kernel `K`: |
| | 104 | {{{ |
| | 105 | void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) { |
| | 106 | void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) { |
| | 107 | void _cuda_block(uint3 blockIdx) { |
| | 108 | ... |
| | 109 | } |
| | 110 | $cuda_wait_in_queue(_cuda_this, _cuda_event); |
| | 111 | $cuda_run_procs(gridDim, _cuda_block); |
| | 112 | $cuda_kernel_finish(_cuda_this); |
| | 113 | } |
| | 114 | $cuda_enqueue_kernel(_cuda_stream, _cuda_kernel); |
| | 115 | } |
| | 116 | }}} |