= CUDA Overview = == Introduction == This page describes how we translate CUDA programs into CIVL-C code. Primarily, we focus on how the cuda-civl library is organized and is used in our final translation of a CUDA program. We assume basic knowledge of CUDA concepts such as steams, kernels, blocks, and threads. == The CUDA Context and CUDA Streams == === Structural Organization === cuda-civl provides a structure called `$cuda_context_t` which is meant to house all CUDA information that pertains globally to a CUDA program. As such, our translation only creates one instance of this structure as a global variable simply called `$cuda_current_context`. Currently, the only information that `$cuda_context_t` manages is the set of CUDA streams being used in the program (including the null stream which is present in every program). {{{ typedef struct $cuda_context $cuda_context_t; struct $cuda_context { $cuda_stream_node_t* headNode; cudaStream_t nullStream; int numStreams; }; }}} `$cuda_stream_node_t` is simply a structure which holds a `cudaStream_t` and a pointer to another `$cuda_stream_node_t`. In other words it is a linked list of `cudaStream_t`'s. In general, we use the pattern in which types of the form `_node_t` are structures representing nodes of a linked list containing type `T`. The streams in this list are the "non-default" or "non-null" CUDA streams meant for asynchronous execution of kernels. The integer `numStreams` represents the size of this list. `nullStream` obviously holds the null stream which is used by default when executing kernels. Thus, the number of total streams at any given time of the program is `$cuda_current_context.numStreams + 1`. `cudaStream_t` is an actual struct in CUDA code, however cuda-civl gives its own definition of it so that we can actually use it in CIVL code and analyze it. Because a CUDA stream is essentially just a queue of kernels, our definition of `cudaStream_t` is very simple: {{{ typedef struct _CUstream _CUstream; typedef _CUstream* cudaStream_t; struct _CUstream { $cuda_kernel_instance_node_t* mostRecent; _Bool usable; }; }}} Here `mostRecent` is the front of the queue, which is implemented as a list of `$cuda_kernel_instance_t`'s. `usable` is a boolean meant to signal whether kernels can be enqueued onto it (when `usable == true`) or whether the stream can be destroyed (when `usable == false`). The purpose of `cudaStream_t` being a pointer is a bit unclear to me as of writing this. One thing to note about the use of a pointer here is that in the places in which `cudaStream_t` is used in the cuda-civl library, there is logic to basically interpret the null pointer as the null stream. The last component of these structures is the `$cuda_kernel_instance_t` datatype. It is a simple bundling of a $proc which holds a reference to the spawned kernel (details of this are discussed later in section '''REF HERE''') along with an enumerator that states whether the kernel is waiting (for the completion of other kernels in its stream), running, or finished. Here is the definition of the struct along with the definition of the enum it uses for its status: {{{ typdef enum $cuda_kernel_status { $cuda_kernel_status_waiting, $cuda_kernel_status_running, $cuda_kernel_status_finished } $cuda_kernel_status; typedef struct $cuda_kernel_instance $cuda_kernel_instance_t; struct $cuda_kernel_instance { $proc process; $cuda_kernel_status status; }; }}} Because a kernel is simply a $proc with a status, the only things one can really do with a kernel is check its status or wait for the proc to finish. Therefore, the entire purpose of all of the structures discussed above is to logically organize the processes created (from CUDA kernels) in a centralized, globally accessible place so that we can easily check the status of these kernels or select subsets of these kernels to wait on. === Management of CUDA Streams and Kernels === Before executing any CUDA functions, we need to initialize our `$cuda_current_context`. Additionally, after we are done executing all CUDA functions, we need to destroy all the kernels and streams that were created over the lifetime of the program. The way we do this is by renaming the original `main()` function to `_civl_main()` and then creating a new `main()` function that looks like this: {{{ int main() { $cuda_init(); _civl_main(); $cuda_finalize(); } }}} `$cuda_finalize()` waits for all kernels in all of its streams to finish completion, and then frees all memory relating to kernels and streams. `$cuda_init()` creates the null stream of `$cuda_current_context` with the function `$cuda_stream_create()`. The method `$cuda_stream_create()` mallocs a new stream, sets `usable` to true, and then adds a dummy `$cuda_kernel_instance_t` to the list with a finished status. The purpose of this dummy kernel is to simplify the logic in our library so that we don't need to check for an empty list each time we use a stream. In standard CUDA code, the way one creates new streams are with the CUDA library function `cudaStreamCreate(cudaStream_t* pStream)`. Our implementation of this function also calls `$cuda_stream_create()`, but it then adds this to the non-default stream list of `$cuda_current_context` and returns `cudaSuccess`. There are two primary actions one can do with a stream. One can wait on a stream (that is, wait until all kernels of the stream are finished) with the function `$cuda_stream_wait(cudaStream_t)`, or one can add a kernel instance to the stream with the function `$cuda_enqueue_kernel(cudaStream_t, (void ($cuda_kernel_instance_t*, cudaEvent_t)))`. However, to explain the details of these functions we now need to learn more about how we emulate cuda kernels in CIVL. == CUDA Kernels == === The Signature of a Kernel === A typical cuda kernel call has the form: {{{ K<<>>(args); }}} The first four parameters inside the angle brackets are called the ''execution configuration parameters'' or simply the ''configuration parameters''. The parameters `gridDim` and `blockDim` are then implicitly available from within the kernel under those names. To support these features, we simply transformation the definition of the kernel to be a regular function with the function signature {{{ void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args); }}} The function call is then transformed to match this new signature. === The Layers of a CUDA Kernel in CIVL-C === The transformation described above certainly allows us to support the use of the configuration parameters. However, we still have to somehow add code which will emulate the true execution of a CUDA kernel with the 4 configuration parameters given. That means we must spawn the appropriate number of threads, each with appropriate local CUDA parameters `blockIdx` and `threadIdx` in scope and given a value, and then appropriately enqueue the kernel into the given stream, waiting as necessary on other cuda kernels in the stream.