Hello I am analyzing CUDA.jl mapReduce.jl file and I see two things that I do not understand
First I see that function shfl_down_sync is avoided when type of data is another than Bool, Int32, Int64, Float32, Float64, ComplexF32, ComplexF64
"With the cuda_fp16.h header included, T can also be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. "
- can this header be included in CUDA.jl?
secondly in documentation - (Programming Guide :: CUDA Toolkit Documentation) __shfl_xor_sync() is suggested for reduction If I understand correctly why in this use case shfl_down_sync is better?