| Version 5 (modified by , 4 years ago) ( diff ) |
|---|
CUDA Overview
Introduction
This page describes how we translate CUDA programs into CIVL-C code. Primarily, we focus on how the cuda-civl library is organized and is used in our final translation of a CUDA program. We assume basic knowledge of CUDA concepts such as steams, kernels, blocks, and threads.
The CUDA Context and CUDA Streams
Structural Organization
cuda-civl provides a structure called $cuda_context_t which is meant to house all CUDA information that pertains globally to a CUDA program. As such, our translation only creates one instance of this structure as a global variable simply called $cuda_current_context. Currently, the only information that $cuda_context_t manages is the set of CUDA streams being used in the program (including the null stream which is present in every program).
typedef struct $cuda_context $cuda_context_t;
struct $cuda_context {
$cuda_stream_node_t* headNode;
cudaStream_t nullStream;
int numStreams;
};
$cuda_stream_node_t is simply a structure which holds a cudaStream_t and a pointer to another $cuda_stream_node_t. In other words it is a linked list of cudaStream_t's. In general, we use the pattern in which types of the form <T>_node_t are structures representing nodes of a linked list containing type T. The streams in this list are the "non-default" or "non-null" CUDA streams meant for asynchronous execution of kernels. The integer numStreams represents the size of this list. nullStream obviously holds the null stream which is used by default when executing kernels. Thus, the number of total streams at any given time of the program is
$cuda_current_context.numStreams + 1.
cudaStream_t is an actual struct in CUDA code, however cuda-civl gives its own definition of it so that we can actually use it in CIVL code and analyze it. Because a CUDA stream is essentially just a queue of kernels, our definition of cudaStream_t is very simple:
typedef struct _CUstream _CUstream;
typedef _CUstream* cudaStream_t;
struct _CUstream {
$cuda_kernel_instance_node_t* mostRecent;
_Bool usable;
};
Here mostRecent is the front of the queue, which is implemented as a list of $cuda_kernel_instance_t's. usable is a boolean meant to signal whether kernels can be enqueued onto it (when usable == true) or whether the stream can be destroyed (when usable == false). The purpose of cudaStream_t being a pointer is a bit unclear to me as of writing this. One thing to note about the use of a pointer here is that in the places in which cudaStream_t is used in the cuda-civl library, there is logic to basically interpret the null pointer as the null stream.
The last component of these structures is the $cuda_kernel_instance_t datatype. It is a simple bundling of a $proc which holds a reference to the spawned kernel (details of this are discussed later in section REF HERE) along with an enumerator that states whether the kernel is waiting (for the completion of other kernels in its stream), running, or finished. Here is the definition of the struct along with the definition of the enum it uses for its status:
typdef enum $cuda_kernel_status {
$cuda_kernel_status_waiting,
$cuda_kernel_status_running,
$cuda_kernel_status_finished
} $cuda_kernel_status;
typedef struct $cuda_kernel_instance $cuda_kernel_instance_t;
struct $cuda_kernel_instance {
$proc process;
$cuda_kernel_status status;
};
Because a kernel is simply a $proc with a status, the only things one can really do with a kernel is check its status or wait for the proc to finish. Therefore, the entire purpose of all of the structures discussed above is to logically organize the processes created (from CUDA kernels) in a centralized, globally accessible place so that we can easily check the status of these kernels or select subsets of these kernels to wait on.
Management of CUDA Streams and Kernels
Before executing any CUDA functions, we need to initialize our $cuda_current_context. Additionally, after we are done executing all CUDA functions, we need to destroy all the kernels and streams that were created over the lifetime of the program. The way we do this is by renaming the original main() function to _civl_main() and then creating a new main() function that looks like this:
int main() {
$cuda_init();
_civl_main();
$cuda_finalize();
}
$cuda_finalize() waits for all kernels in all of its streams to finish completion, and then frees all memory relating to kernels and streams. $cuda_init() creates the null stream of $cuda_current_context with the function $cuda_stream_create(). The method $cuda_stream_create() mallocs a new stream, sets usable to true, and then adds a dummy $cuda_kernel_instance_t to the list with a finished status. The purpose of this dummy kernel is to simplify the logic in our library so that we don't need to check for an empty list each time we use a stream.
In standard CUDA code, the way one creates new streams are with the CUDA library function cudaStreamCreate(cudaStream_t* pStream). Our implementation of this function also calls $cuda_stream_create(), but it then adds this to the non-default stream list of $cuda_current_context and returns cudaSuccess.
There are two primary actions one can do with a stream. One can wait on a stream (that is, wait until all kernels of the stream are finished) with the function $cuda_stream_wait(cudaStream_t), or one can add a kernel instance to the stream with the function $cuda_enqueue_kernel(cudaStream_t, (void ($cuda_kernel_instance_t*, cudaEvent_t))). However, to explain the details of these functions we now need to learn more about how we emulate cuda kernels in CIVL.
CUDA Kernels
The Signature of a Kernel
A typical cuda kernel call has the form:
K<<<gridDim, blockDim, memSize, cudaStream>>>(args);
The first four parameters inside the angle brackets are called the execution configuration parameters or simply the configuration parameters. The parameters gridDim and blockDim are then implicitly available from within the kernel under those names. To support these features, we simply transformation the definition of the kernel to be a regular function with the function signature
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args);
The function call is then transformed to match this new signature.
The Layers of a CUDA Kernel in CIVL-C
The transformation described above certainly allows us to support the use of the configuration parameters. However, we still have to somehow add code which will emulate the true execution of a CUDA kernel with the 4 configuration parameters given. That means we must spawn the appropriate number of threads, each with appropriate local CUDA parameters blockIdx and threadIdx in scope and given a value, and then appropriately enqueue the kernel into the given stream, waiting as necessary on other cuda kernels in the stream.
