| Version 10 (modified by , 4 years ago) ( diff ) |
|---|
Implementation of CUDA in CIVL
Introduction
This page describes how we translate CUDA programs into CIVL-C code. Primarily, we focus on how the cuda-civl library is organized and is used in our final translation of a CUDA program. We assume basic knowledge of CUDA concepts such as streams, kernels, blocks, and threads.
The CUDA Context and CUDA Streams
Structural Organization
cuda-civl provides a structure called $cuda_context_t which is meant to house all CUDA information that pertains globally to a CUDA program. As such, our translation only creates one instance of this structure as a global variable simply called $cuda_current_context. Currently, the only information that $cuda_context_t manages is the set of CUDA streams being used in the program (including the null stream which is present in every program).
typedef struct $cuda_context $cuda_context_t;
struct $cuda_context {
$cuda_stream_node_t* headNode;
cudaStream_t nullStream;
int numStreams;
};
$cuda_stream_node_t is simply a structure which holds a cudaStream_t and a pointer to another $cuda_stream_node_t. In other words it is a linked list of cudaStream_t's. In general, we use the pattern in which types of the form <T>_node_t are structures representing nodes of a linked list containing type T. The streams in this list are the "non-default" or "non-null" CUDA streams meant for asynchronous execution of kernels. The integer numStreams represents the size of this list. nullStream obviously holds the null stream which is used by default when executing kernels. Thus, the number of total streams at any given time of the program is
$cuda_current_context.numStreams + 1.
cudaStream_t is an actual struct in CUDA code, however cuda-civl gives its own definition of it so that we can actually use it in CIVL code and analyze it. Because a CUDA stream is essentially just a queue of kernels, our definition of cudaStream_t is very simple:
typedef struct _CUstream _CUstream;
typedef _CUstream* cudaStream_t;
struct _CUstream {
$cuda_kernel_instance_node_t* mostRecent;
_Bool usable;
};
Here mostRecent is the front of the queue, which is implemented as a list of $cuda_kernel_instance_t's. usable is a boolean meant to signal whether kernels can be enqueued onto it (when usable == true) or whether the stream can be destroyed (when usable == false). The purpose of cudaStream_t being a pointer is a bit unclear to me as of writing this. One thing to note about the use of a pointer here is that in the places in which cudaStream_t is used in the cuda-civl library, there is logic to basically interpret the null pointer as the null stream.
The last component of these structures is the $cuda_kernel_instance_t datatype. It is a simple bundling of a $proc which holds a reference to the spawned kernel (details of this are discussed later in section REF HERE) along with an enumerator that states whether the kernel is waiting (for the completion of other kernels in its stream), running, or finished. Here is the definition of the struct along with the definition of the enum it uses for its status:
typdef enum $cuda_kernel_status {
$cuda_kernel_status_waiting,
$cuda_kernel_status_running,
$cuda_kernel_status_finished
} $cuda_kernel_status;
typedef struct $cuda_kernel_instance $cuda_kernel_instance_t;
struct $cuda_kernel_instance {
$proc process;
$cuda_kernel_status status;
};
Because a kernel is simply a $proc with a status, the only things one can really do with a kernel is check its status or wait for the proc to finish. Therefore, the entire purpose of all of the structures discussed above is to logically organize the processes created (from CUDA kernels) in a centralized, globally accessible place so that we can easily check the status of these kernels or select subsets of these kernels to wait on.
Management of CUDA Streams and Kernels
Before executing any CUDA functions, we need to initialize our $cuda_current_context. Additionally, after we are done executing all CUDA functions, we need to destroy all the kernels and streams that were created over the lifetime of the program. The way we do this is by renaming the original main() function to _civl_main() and then creating a new main() function that looks like this:
int main() {
$cuda_init();
_civl_main();
$cuda_finalize();
}
$cuda_finalize() waits for all kernels in all of its streams to finish completion, and then frees all memory relating to kernels and streams. $cuda_init() creates the null stream of $cuda_current_context with the function $cuda_stream_create(). The method $cuda_stream_create() mallocs a new stream, sets usable to true, and then adds a dummy $cuda_kernel_instance_t to the list with a finished status. The purpose of this dummy kernel is to simplify the logic in our library so that we don't need to check for an empty list each time we use a stream.
In standard CUDA code, the way one creates new streams are with the CUDA library function cudaStreamCreate(cudaStream_t* pStream). Our implementation of this function also calls $cuda_stream_create(), but it then adds this to the non-default stream list of $cuda_current_context and returns cudaSuccess.
There are two primary actions one can do with a stream. One can wait on a stream (that is, wait until all kernels of the stream are finished) with the function $cuda_stream_wait(cudaStream_t), or one can add a kernel instance to the stream with the function $cuda_enqueue_kernel(cudaStream_t, (void ($cuda_kernel_instance_t*, cudaEvent_t))). However, to explain the details of these functions we now need to learn more about how we emulate cuda kernels in CIVL.
CUDA Kernels
The Signature of a Kernel
A typical cuda kernel call has the form:
K<<<gridDim, blockDim, memSize, cudaStream>>>(args);
The first four parameters inside the angle brackets are called the execution configuration parameters or simply the configuration parameters. The parameters gridDim and blockDim are then implicitly available from within the kernel under those names. To support these features, we simply transformation the definition of the kernel to be a regular function with the function signature
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args);
The function call is then transformed to match this new signature.
The Layers of a CUDA Kernel in CIVL-C
The transformation described above certainly allows us to support the use of the configuration parameters. However, we still have to somehow add code which will emulate the true execution of a CUDA kernel with the 4 configuration parameters given. That means we must spawn the appropriate number of threads, each with appropriate local CUDA parameters blockIdx and threadIdx declared in scope and given a value, and then appropriately enqueue the kernel into the given stream, waiting as necessary on other cuda kernels in the stream. The kernel is thus transformed to accomplish this.
The transformed kernel is composed of several simple layers that we will discuss one at a time here, revealing more information as we go. The first layer handles creating the kernel instance and enqueuing it onto the appropriate stream.
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
...
}
$cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
}
For clarity, we will refer to _cuda_kernel as the inner kernel of our original kernel K.
$cuda_enqueue_kernel does the following:
- Creates a new
cudaEvent_tcalledebased on the stream being used. (see below for further details) - Creates a new
$cuda_kernel_instance_tand enqueues it onto the stream. - Spawns the inner kernel as a new process, passing in the
$cuda_kernel_instance_tcreated in step 2 and thecudaEvent_tcreated in step 3 as its parameters. - Sets the
processfield of the$cuda_kernel_instance_tfrom step 2 to be the spawned process from step 3.
Recall that a CUDA kernel in a non-null stream, call it s, must wait for all other kernels that were enqueued in s or the null stream at the time that the kernel was launched. Additionally, any kernel launched on the null stream must wait for all kernels enqueued in any stream at the time of launch. A cudaEvent_t serves as a structure that is meant to store some set of kernels that we can wait on.
typedef struct _CUevent cudaEvent_t;
struct _CUevent{
$cuda_kernel_instance_t** instances;
int numInstances;
};
Therefore, when we create the cudaEvent_t in step 1, we are simply grabbing the most recent kernel from the streams that we want to wait on, and storing it in this new event. We then pass this event to the inner kernel so that the inner kernel can wait on these other kernels before actually acting running itself. This can be seen in the next layer of our transformed kernel K:
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
void _cuda_block(uint3 blockIdx) {
...
}
$cuda_wait_in_queue(_cuda_this, _cuda_event);
$cuda_run_procs(gridDim, _cuda_block);
$cuda_kernel_finish(_cuda_this);
}
$cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
}
$cuda_wait_in_queueis the call that does precisely what we just described. It waits on all of the events held by_cuda_eventand then destroys the event. Then it sets the status of _cuda_this to running.$cuda_run_procsspawns an instance of_cuda_block(blockIdx)for eachblockIdxpossible within the dimensions ofgridDimand then waits for these processes to complete.$cuda_kernel_finishsets the status of_cuda_thisto finished. This is what flags to other kernels that may be waiting on this one that we are done.
Now we can investigate what _cuda_block really does by looking at the next layer of our kernel:
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
void _cuda_block(uint3 blockIdx) {
int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
$gcomm gComm = $gcomm_create($here, numThreads);
$gbarrier warpBarriers[numWarps];
$scope _block_root = $here;
for(int i = 0; i < numWarps - 1; i++){
warpBarriers[i] = $gbarrier_create(_block_root, 32);
}
warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
$gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
void _cuda_thread(uint3 threadIdx) {
...
}
$cuda_run_procs(blockDim, _cuda_thread);
$gbarrier_destroy(_cuda_block_barrier);
for(int i = 0; i < numWarps; i++){
$gbarrier_destroy(warpBarriers[i]);
}
$gcomm_destroy(gComm, (void*)0);
}
$cuda_wait_in_queue(_cuda_this, _cuda_event);
$cuda_run_procs(gridDim, _cuda_block);
$cuda_kernel_finish(_cuda_this);
}
$cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
}
Each _cuda_block that is spawned thus creates its own global barrier and global communicator, with size equal to the number of threads that the block holds. It also creates warp level barriers (warps are groups of 32 threads within a block). Then it uses the $cuda_run_procs function again to spawn an instance of _cuda_thread(threadIdx) for each threadIdx possible within the dimensions of blockDim and wait for their completion. After all threads in the block have finished, it destroys all global barriers with $gbarrier_destroy and destroys the global communicator with $gcomm_destroy.
We can now see the last layer, finally revealing the full transformation of our kernel:
void _cuda_K(dim3 gridDim, dim3 blockDim, size_t _cuda_mem_size, cudaStream_t _cuda_stream, args) {
void _cuda_kernel($cuda_kernel_instance_t* _cuda_this, cudaEvent_t _cuda_event) {
void _cuda_block(uint3 blockIdx) {
int numThreads = (blockDim.x * blockDim.y) * blockDim.z;
int numWarps = (numThreads / 32) + ((numThreads % 32) != 0);
$gcomm gComm = $gcomm_create($here, numThreads);
$gbarrier warpBarriers[numWarps];
$scope _block_root = $here;
for(int i = 0; i < numWarps - 1; i++){
warpBarriers[i] = $gbarrier_create(_block_root, 32);
}
warpBarriers[numWarps - 1] = $gbarrier_create(_block_root, numThreads - ((numWarps - 1) * 32));
$gbarrier _cuda_block_barrier = $gbarrier_create($here, blockDim.x * blockDim.y * blockDim.z);
void _cuda_thread(uint3 threadIdx) {
$local_start();
int _cuda_tid = $cuda_index(blockDim, threadIdx);
int _cuda_kid = $cuda_kernel_index(gridDim, blockDim, blockIdx, threadIdx);
$comm comm = $comm_create($here, gComm, _cuda_tid);
$barrier _cuda_thread_barrier = $barrier_create($here, _cuda_block_barrier, _cuda_tid);
$read_set_push();
$write_set_push();
...Kernel definition of K...
$check_data_race(_cuda_this, _cuda_kid);
$read_set_pop();
$write_set_pop();
$barrier_destroy(_cuda_thread_barrier);
$comm_destroy(comm);
$local_end();
}
$cuda_run_procs(blockDim, _cuda_thread);
$gbarrier_destroy(_cuda_block_barrier);
for(int i = 0; i < numWarps; i++){
$gbarrier_destroy(warpBarriers[i]);
}
$gcomm_destroy(gComm, (void*)0);
}
$cuda_wait_in_queue(_cuda_this, _cuda_event);
$cuda_run_procs(gridDim, _cuda_block);
$cuda_kernel_finish(_cuda_this);
}
$cuda_enqueue_kernel(_cuda_stream, _cuda_kernel);
}
The thread definition begins with $local_start, which treats the entire code section as atomic until a $yield or $local_end. This has the purpose of improving the efficiency of data race checking. Next, the thread's thread id (_cuda_tid) and kernel id (_cuda_kid) are calculated using the helper functions $cuda_index and $cuda_kernel_index respectively. _cuda_tid is a thread's unique position within the block based on the value of threadIdx, and _cuda_kid is the thread's unique position within the kernel based on the value of threadIdx and blockIdx.
Each thread defines a $barrier created from the block-level $gbarrier and a $comm created from the block-level $gcomm and positions itself using _cuda_tid. Next, the $read_set_push and $write_set_push function calls are important for data race checking, as they push new memory sets onto the read and write set stacks held by the kernel. When a call to $check_data_race is made, only the reads and writes that have been monitored in the current read and write sets will be compared to the other threads. Thus, the calls to $read_set_push and $write_set_push create fresh memory sets for the purpose of running the kernel.
Then, the thread runs the actual code found in the original kernel K, but with some transformations applied to certain types of statements. The most notable of these transformed statements is that every call to __syncthreads is simply replaced by a call to $barrier_call(_cuda_thread_barrier) so that we properly emulate block synchronization at these points. After the original kernel completes, a call to $check_data_race is made to search for any data races that can be detected from the reads and writes that each thread has collected in its memory sets. Next, there are calls to $read_set_pop and $write_set_pop to pop off the kernel's stacks the memory sets we previously pushed. Then, each thread destroys its local barrier with the call $barrier_destroy(_cuda_thread_barrier), and destroys its local communicator with the call $comm_destroy(comm). Finally, a call to $local_end is made to end the atomic section, and then it is done.
