Changes between Version 19 and Version 20 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
07/27/22 10:22:27 (4 years ago)
Author:
zgrnhlt
Comment:

Added link about incidental thread divergence within a warp

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v19 v20  
    271271Warp Level Primitives Blog Post - https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
    272272
     273Incidental Thread Divergence and `__activemask()` - https://stackoverflow.com/questions/54055195/activemask-vs-ballot-sync/54055576#54055576
     274
    273275=== `__shfl_*_sync` Intrinsics ===
    274276The Cuda `__shfl_*_sync` intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.