Changes between Version 16 and Version 17 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
07/25/22 16:15:15 (4 years ago)
Author:
zgrnhlt
Comment:

Started adding documentation about warp semantics and the shfl_*_sync intrinsics

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v16 v17  
    267267
    268268== Warp Semantics ==
    269 Mask Parameter in shfl_*_sync Functions - https://stackoverflow.com/questions/58833808/insight-into-the-first-argument-mask-in-shfl-sync
     269Warp Level Primitives Blog Post - https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
     270
     271Threads within a block are broken down even further on a hardware level into groups of 32 threads called "warps" that execute together, usually in lock-step. This means that most of the time, the threads in a warp will execute the same line of code at the same time (however this is no longer guaranteed at all times). Threads in a warp are called "lanes", and a thread's `laneID` can be calculated using `threadID % 32`.
     272
     273=== `__shfl_*_sync` Intrinsics ===
     274Mask Parameter - https://stackoverflow.com/questions/58833808/insight-into-the-first-argument-mask-in-shfl-sync
     275
     276The Cuda shfl_*_sync intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.
     277{{{
     278T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
     279T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
     280T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
     281T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
     282}}}
     283* `__shfl_sync` - Each thread included in `mask` converges and returns the value of `var` obtained from `srcLane`. If `srcLane` is outside of the range `0..width-1`, a new source lane is calculated with `srcLane % width`
     284* `__shfl_up_sync` - Each thread included in `mask` converges and returns the value of `var` obtained from the lane `delta` lanes below itself in the warp. If `laneID - delta < 0`, the thread will return its original `var`.
     285* `__shfl_down_sync` - Each thread included in `mask` converges and returns the value of `var` obtained from the lane `delta` lanes above itself in the warp. If `laneID + delta >= width`, the thread will return its original `var`.
     286* `__shfl_xor_sync` - Each thread included in `mask` converges and returns the value of `var` obtained from a source lane with ID calculated from a bitwise exclusive or done on its own `laneID` and the `laneMask` parameter. If the source lane determined by `laneID ^ laneMask` is within the same sub-warp as the calling thread, or in a sub-warp that has lower `threadID`s within the same warp, the data can be exchanged. However, if the source lane is in a sub-warp that has higher `threadID`s or is out of bounds of the current 32 thread warp, the calling thread's original `var` is returned.
     287
     288 
     289