I had not profiled the code before in this manner. Thanks for the suggestion.
I’m not sure how to interpret the output, though. With the above code in cub_ccall_test.jl, and additionally
using BenchmarkTools
using Profile, PProf
Profile.clear()
@bprofile CUDA.@sync sort_pairs!($keys_in, $keys_out, $values_in, $values_out, $N)
pprof(from_c=true)
on my Windows machine I find for Flat%
65.94% in cuProfilerStop
10.56% in RtlQueryPerformanceCounter
10.24% in NtGdiDdDDIDestroyAllocation2
 8.06% in [unknown function]
 4.64% in NtGdiDdDDICreateAllocation
On my Linux machine I get
92.28% in [unknown function]
 7.35% in ioctl
Based on the Linux results, I would assume that [unknown function] then refers to cub::DeviceRadixSort::SortPairs itself, though that seems incompatible with the Windows results?
In the mean time I was also able to test the code on another Windows machine (Windows 11, RTX 2080, Julia 1.10.4, nvcc 11.8, cl 19.20.27508.1) and found
sort_pairs, time including ccall:
        19.622600000000002 ms
sort_pairs, time excluding ccall:
        19.670223 ms
i.e. now there is no overhead. Possibly there is then just an issue with my Windows pc in this context, and not Windows in general. I’ll try to find a third Windows machine for further testing.