CUB wrapper ccall overhead on Windows

I had not profiled the code before in this manner. Thanks for the suggestion.

I’m not sure how to interpret the output, though. With the above code in cub_ccall_test.jl, and additionally

using BenchmarkTools
using Profile, PProf

Profile.clear()
@bprofile CUDA.@sync sort_pairs!($keys_in, $keys_out, $values_in, $values_out, $N)
pprof(from_c=true)

on my Windows machine I find for Flat%

65.94% in cuProfilerStop
10.56% in RtlQueryPerformanceCounter
10.24% in NtGdiDdDDIDestroyAllocation2
 8.06% in [unknown function]
 4.64% in NtGdiDdDDICreateAllocation

On my Linux machine I get

92.28% in [unknown function]
 7.35% in ioctl

Based on the Linux results, I would assume that [unknown function] then refers to cub::DeviceRadixSort::SortPairs itself, though that seems incompatible with the Windows results?


In the mean time I was also able to test the code on another Windows machine (Windows 11, RTX 2080, Julia 1.10.4, nvcc 11.8, cl 19.20.27508.1) and found

sort_pairs, time including ccall:
        19.622600000000002 ms

sort_pairs, time excluding ccall:
        19.670223 ms

i.e. now there is no overhead. Possibly there is then just an issue with my Windows pc in this context, and not Windows in general. I’ll try to find a third Windows machine for further testing.