Context Navigation

Changes between Version 17 and Version 18 of Notes_on_CUDA_Semantics

Timestamp:: 07/25/22 16:16:44 (4 years ago)
Author:: zgrnhlt
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

Notes_on_CUDA_Semantics

v17	v18
274	274	Mask Parameter - https://stackoverflow.com/questions/58833808/insight-into-the-first-argument-mask-in-shfl-sync
275	275
276		The Cuda ~~shfl_*_sync~~ intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.
	276	The Cuda `__shfl_*_sync` intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.
277	277	{{{
278	278	T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);