Changes between Version 17 and Version 18 of Notes_on_CUDA_Semantics


Ignore:
Timestamp:
07/25/22 16:16:44 (4 years ago)
Author:
zgrnhlt
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Notes_on_CUDA_Semantics

    v17 v18  
    274274Mask Parameter - https://stackoverflow.com/questions/58833808/insight-into-the-first-argument-mask-in-shfl-sync
    275275
    276 The Cuda shfl_*_sync intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.
     276The Cuda `__shfl_*_sync` intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.
    277277{{{
    278278T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);