CUB wrapper ccall overhead on Windows

eldee · June 13, 2024, 2:57pm

I had not profiled the code before in this manner. Thanks for the suggestion.

I’m not sure how to interpret the output, though. With the above code in cub_ccall_test.jl, and additionally

using BenchmarkTools
using Profile, PProf

Profile.clear()
@bprofile CUDA.@sync sort_pairs!($keys_in, $keys_out, $values_in, $values_out, $N)
pprof(from_c=true)

on my Windows machine I find for Flat%

65.94% in cuProfilerStop
10.56% in RtlQueryPerformanceCounter
10.24% in NtGdiDdDDIDestroyAllocation2
 8.06% in [unknown function]
 4.64% in NtGdiDdDDICreateAllocation

On my Linux machine I get

92.28% in [unknown function]
 7.35% in ioctl

Based on the Linux results, I would assume that [unknown function] then refers to cub::DeviceRadixSort::SortPairs itself, though that seems incompatible with the Windows results?

In the mean time I was also able to test the code on another Windows machine (Windows 11, RTX 2080, Julia 1.10.4, nvcc 11.8, cl 19.20.27508.1) and found

sort_pairs, time including ccall:
        19.622600000000002 ms

sort_pairs, time excluding ccall:
        19.670223 ms

i.e. now there is no overhead. Possibly there is then just an issue with my Windows pc in this context, and not Windows in general. I’ll try to find a third Windows machine for further testing.

Topic		Replies	Views
How to make a C function compiled by myself available to `ccall`? General Usage ccall	39	5953	December 30, 2017
Ccall c++ sort vector of String New to Julia question , ccall , cxxwrap	37	4331	December 19, 2022
Accelerating calling a Julia function from Python via juliacall and ctypes Performance python , juliacall , ctypes	7	248	December 24, 2024
Improving ccall speed for many calls General Usage question	12	2302	March 17, 2017
Why is Julia faster than C++ for quicksort? Performance performance , quicksort	15	2117	August 15, 2023

CUB wrapper ccall overhead on Windows

Related topics