Context Navigation

Changes between Version 19 and Version 20 of Notes_on_CUDA_Semantics

v19	v20
271	271	Warp Level Primitives Blog Post - https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
272	272
	273	Incidental Thread Divergence and `__activemask()` - https://stackoverflow.com/questions/54055195/activemask-vs-ballot-sync/54055576#54055576
	274
273	275	=== `__shfl_*_sync` Intrinsics ===
274	276	The Cuda `__shfl_*_sync` intrinsics allow for an exchange of either 4 or 8 bytes of data among threads in a warp without requiring the use of shared memory. Each of these 4 functions requires a set of specified threads within a warp to converge before the function is executed. This set is determined by the `mask` parameter, which is an `unsigned int`, with each of its 32 bits corresponding to a single thread in a warp of 32 threads. If a bit in the mask is set to 1, the thread in the warp with matching `laneID` is required to call the collective with the same `mask` parameter as the other threads. Each of the intrinsics also requires the data that is to be exchanged to be passed in as an argument, which is called `var`. The third parameters of the functions are unique, but all are used to identify a source lane for the calling thread to obtain its new data from. Finally, all 4 functions have an optional `width` parameter that allows the user to divide 32 thread warps into even smaller sub-warps (which are for the most part treated as isolated warps for the purposes of these intrinsics). The `width`, if specified, must be 2, 4, 8, 16, or 32.