CUB wrapper ccall overhead on Windows

Hi,

To get a faster GPU sort + sortperm for a specific use case, I succesfully wrapped CUB’s SortPairs method for sorting 64-bit integer keys (sort) and 32-bit integer values (sortperm). However, when measuring the execution time of only the CUDA code, or the full ccall, I notice a significant overhead, at least on my Windows machine. This overhead does not appear on my Linux machine. Curiously, the overhead is also not there on Windows when I’m ccalling a simple addition kernel C-wrapper.

Here is the code I’m using:

cub_wrapper.cu:

#ifdef _WIN32
    #define EXPORT_SYMBOL __declspec(dllexport)
#else
    #define EXPORT_SYMBOL __attribute__((visibility("default")))
#endif

#include <cub/cub.cuh>
#include <cuda_runtime.h>
#include <cstdint>

// Basically cub::DeviceRadixSort::SortPairs, instantiated for 64-bit integer keys, 32-bit integer values
void sort_pairs_internal(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items)
{
    void* d_temp_storage = nullptr;
    size_t temp_storage_bytes = 0;

    // Determine temporary device storage requirements (this call performs no actual sorting)
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
    // Allocate necessary temporary storage
    cudaMalloc(&d_temp_storage, temp_storage_bytes);

    // Run actual sorting operation
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);

    // Free our temporary storage
    cudaFree(d_temp_storage);
}

// C wrapper
extern "C" EXPORT_SYMBOL void sort_pairs(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items) {
    sort_pairs_internal(d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
}

// C wrapper with timer
extern "C" EXPORT_SYMBOL float timed_sort_pairs(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    sort_pairs_internal(d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return milliseconds;
}

// Create shared libary using  
//      nvcc -Xcompiler -fPIC -shared -arch=sm_86 -o cub_wrapper.so cub_wrapper.cu
// (The -Xcompiler -fPIC flag will be ignored by cl.exe on Windows.)

cub_ccall_test.jl:

using Statistics
using Test
using CUDA

const lib = "./cub_wrapper.so"

N::Int32 = 2^24
keys_in = CUDA.rand(Int64, N)
keys_out = similar(keys_in)
values_in = CuArray{Int32}(undef, N)
values_in .= 1:N
values_out = similar(values_in)


function sort_pairs!(keys_in::CuArray{Int64}, keys_out::CuArray{Int64}, values_in::CuArray{Int32}, values_out::CuArray{Int32}, N::Int32)
    ccall((:sort_pairs, lib), Cvoid, 
          (CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
          keys_in, keys_out, values_in, values_out, N
     )
end

function sort_pairs_time!(keys_in::CuArray{Int64}, keys_out::CuArray{Int64}, values_in::CuArray{Int32}, values_out::CuArray{Int32}, N::Int32)
    ccall((:timed_sort_pairs, lib), Cfloat, 
          (CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
          keys_in, keys_out, values_in, values_out, N
     )
end

CUDA.@sync sort_pairs!(keys_in, keys_out, values_in, values_out, N)
@test all(keys_out .== sort(keys_in))
@test all(values_out .== sortperm(keys_in))


function benchmark_sort_pairs_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N)
    times = Array{Float64}(undef, nb_runs)
    for i = 1:nb_runs
        stats = @timed CUDA.@sync sort_pairs!(keys_in, keys_out, values_in, values_out, N)
        times[i] = stats.time
    end
    return median(times)
end
# Could also use BenchmarkTools here, but use same approach as for ..._cutime below, for consistency.

function benchmark_sort_pairs_cutime(nb_runs, keys_in, keys_out, values_in, values_out, N)
    times = Array{Float32}(undef, nb_runs)
    for i = 1:nb_runs
        times[i] = sort_pairs_time!(keys_in, keys_out, values_in, values_out, N)
    end
    return median(times)
end

nb_runs = 100

println("GPU add, time including ccall:")
println("\t$(benchmark_sort_pairs_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N) * 1000) ms")
println()
println("GPU add, time excluding ccall:")
println("\t$(benchmark_sort_pairs_cutime(nb_runs, keys_in, keys_out, values_in, values_out, N)) ms")

with outputs:

Windows (RTX 3070, Windows 10 22H2, Julia 1.10.0+0.x64.w64.mingw32, nvcc 11.8, cl 19.39.33522):

sort_pairs, time including ccall:
        17.63445 ms

sort_pairs, time excluding ccall:
        11.05888 ms

Linux (RTX 3080 Ti, Ubuntu 22.04.4 LTS, Julia 1.10.0+0.x64.linux.gnu, nvcc 11.8, g++ 11.4.0):

sort_pairs, time including ccall:
        7.131116499999999 ms

sort_pairs, time excluding ccall:
        7.1138716 ms

Does anyone know why there is this difference between my Windows and Linux machines concerning the ccall overhead, and how I can get rid of this overhead on the former?

Conversion of a CuArray to a CuPtr actually involves a lot of code, CUDA.jl/src/memory.jl at e1e5be2b6bf17f03a367cebeb18c4645e593f80d · JuliaGPU/CUDA.jl · GitHub, in order to ensure stream synchronization, memory pool visibility, etc. This has been carefully optimized, but it’s possible that on Windows some of the operations that are cheap on Linux perform badly. I would try to benchmark, or profile, those operations in isolation to find out what’s up.

Thanks for the response and suggestion!

If I understand correctly, the overhead should then appear anytime you pass CuArrays in ccall, because of the automatic conversion to CuPtrs. But when I try ccalling a simple addition kernel, I observe very little overhead:

cuda_add.cu:

#ifdef _WIN32
    #define EXPORT_SYMBOL __declspec(dllexport)
#else
    #define EXPORT_SYMBOL __attribute__((visibility("default")))
#endif

#include <cstdint>


__global__ void add_kernel(int32_t N, int32_t* a, int32_t* b, int32_t* c)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    while (i < N)
    {
        c[i] = a[i] + b[i];
        i += blockDim.x * gridDim.x;
    }
}


extern "C" EXPORT_SYMBOL void add(int32_t N, int32_t* a, int32_t* b, int32_t* c)
{
    int threads = 1024;
    int blocks = (N + threads - 1) / threads;  // equivalent to ceil(N / threads)
    add_kernel<<<blocks, threads>>>(N, a, b, c);
    cudaDeviceSynchronize();  // Relying on Julia's CUDA.@sync did not work, so we synchronize here instead (which is fine for our test)
}

extern "C" EXPORT_SYMBOL float time_add(int32_t N, int32_t* a, int32_t* b, int32_t* c)
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    int threads = 1024;
    int blocks = (N + threads - 1) / threads;
    add_kernel<<<blocks, threads>>>(N, a, b, c);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return milliseconds;
}

// Create shared libary using  
//      nvcc -Xcompiler -fPIC -shared -o cuda_add.so cuda_add.cu
// (The -Xcompiler -fPIC will be ignored on Windows.)

ccall_cuda_add_test.jl:

using Test
using Statistics
using CUDA

const lib = "./cuda_add.so"


N::Int32 = 2^29
a = CUDA.rand(Int32, N)
b = CUDA.rand(Int32, N)
c = similar(a)


function cu_gpu_add(N::Int32, a::CuArray{Int32}, b::CuArray{Int32}, c::CuArray{Int32})
    ccall((:add, lib), Cvoid, (Int32, CuRef{Int32}, CuRef{Int32}, CuRef{Int32}), N, a, b, c)
    # Includes synchronization in the CUDA code.
end

function cu_gpu_add_cutime(N::Int32, a::CuArray{Int32}, b::CuArray{Int32}, c::CuArray{Int32})
    return ccall((:time_add, lib), Cfloat, (Int32, CuRef{Int32}, CuRef{Int32}, CuRef{Int32}), N, a, b, c)
end


cu_gpu_add(N, a, b, c)  
@test all(c .== a .+ b)

function benchmark_jl_gpu_add_jltime(nb_runs, N, a, b, c)
    times = Array{Float32}(undef, nb_runs)
    for i = 1:nb_runs
        stats = @timed CUDA.@sync c .= a .+ b
        times[i] = stats.time
    end
    return median(times)
end

function benchmark_cu_gpu_add_jltime(nb_runs, N, a, b, c)
    times = Array{Float32}(undef, nb_runs)
    for i = 1:nb_runs
        stats = @timed cu_gpu_add(N, a, b, c)
        times[i] = stats.time
    end
    return median(times)
end

function benchmark_cu_gpu_add_cutime(nb_runs, N, a, b, c)
    times = Array{Float32}(undef, nb_runs)
    for i = 1:nb_runs
        times[i] = cu_gpu_add_cutime(N, a, b, c)
    end
    return median(times)
end


nb_runs = 500

println("GPU add, Julia native time:")
println("\t$(benchmark_jl_gpu_add_jltime(nb_runs, N, a, b, c) * 1000) ms")
println()
println("GPU add, time including ccall:")
println("\t$(benchmark_cu_gpu_add_jltime(nb_runs, N, a, b, c) * 1000) ms")
println()
println("GPU add, time excluding ccall:")
println("\t$(benchmark_cu_gpu_add_cutime(nb_runs, N, a, b, c)) ms")

yielding as output on my Windows machine

GPU add, Julia native time:
        17.371851 ms

GPU add, time including ccall:
        15.970249 ms

GPU add, time excluding ccall:
        15.964304 ms

i.e. the ccall overhead is tiny in this case.

Therefore, I suspect there is something else going on, which is somehow specific to calling more complex code (like CUB).

Have you tried profiling? You can pass C=true to Profile.print to also report samples that ended up in C. Also note that you can use BenchmarkTools.@bprofile, or increase the sampling rate, to ensure you have enough data.

I had not profiled the code before in this manner. Thanks for the suggestion.

I’m not sure how to interpret the output, though. With the above code in cub_ccall_test.jl, and additionally

using BenchmarkTools
using Profile, PProf

Profile.clear()
@bprofile CUDA.@sync sort_pairs!($keys_in, $keys_out, $values_in, $values_out, $N)
pprof(from_c=true)

on my Windows machine I find for Flat%

65.94% in cuProfilerStop
10.56% in RtlQueryPerformanceCounter
10.24% in NtGdiDdDDIDestroyAllocation2
 8.06% in [unknown function]
 4.64% in NtGdiDdDDICreateAllocation

On my Linux machine I get

92.28% in [unknown function]
 7.35% in ioctl

Based on the Linux results, I would assume that [unknown function] then refers to cub::DeviceRadixSort::SortPairs itself, though that seems incompatible with the Windows results?


In the mean time I was also able to test the code on another Windows machine (Windows 11, RTX 2080, Julia 1.10.4, nvcc 11.8, cl 19.20.27508.1) and found

sort_pairs, time including ccall:
        19.622600000000002 ms

sort_pairs, time excluding ccall:
        19.670223 ms

i.e. now there is no overhead. Possibly there is then just an issue with my Windows pc in this context, and not Windows in general. I’ll try to find a third Windows machine for further testing.

I’m not sure why cuProfilerStop is being called here. You’re not accidentally invoking CUDA.@profile somewhere, are you?

I’m definitely not invoking CUDA.@profile on purpose, and fail to see how I would invoke it accidentally. The only piece of code I can think of that gets ‘sneekily’ executed is that from my startup.jl, but that simply contains using OhMyREPL. With julia --startup-file=no I get the same timing discrepancy.

The problem should also not lie with other Julia packages (somehow interfering, although not being used), as in a clean environment with only CUDA and Statistics I encounter the problem.

I’ve also tried to use a different version of the NVIDIA CUDA toolkit (12.4), but still no dice.


Looking at the pprof output in more detail I notice that there is consistently one sample in timed_sort_pairs, though it is not on the code path of

@bprofile CUDA.@sync sort_pairs!($keys_in, $keys_out, $values_in, $values_out, $N)

It’s not a significant time loss or anything, but still strange that it gets called at all…

On a third Windows machine I can also not replicate the issue:

RTX 4070 Ti Super, Windows 11 pro 23H2, Julia 1.10.4+0.x64.w64.mingw32, nvcc 11.8, cl 19.39.33523:

sort_pairs, time including ccall:
        10.668149999999999 ms

sort_pairs, time excluding ccall:
        10.674544 ms


Most likely there is then something wrong with the configuration on my Windows pc, but a clean reinstall of everything NVIDIA and Julia still did not resolve the issue.

I’m not sure why exactly this solves the problem, but moving the memory allocation out of the CUDA code into Julia gets rid of the overhead.

(updated) cub_wrapper.so

#ifdef _WIN32
    #define EXPORT_SYMBOL __declspec(dllexport)
#else
    #define EXPORT_SYMBOL __attribute__((visibility("default")))
#endif

#include <cub/cub.cuh>
#include <cuda_runtime.h>
#include <cstdint>

// Basically cub::DeviceRadixSort::SortPairs, instantiated for 64-bit integer (long) keys, 32-bit int integer (int) values
void sort_pairs_internal(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items)
{
    void* d_temp_storage = nullptr;
    size_t temp_storage_bytes = 0;

    // Determine temporary device storage requirements (this call performs no actual sorting)
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
    // Allocate necessary temporary storage
    cudaMalloc(&d_temp_storage, temp_storage_bytes);

    // Run actual sorting operation
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);

    // Free our temporary storage
    cudaFree(d_temp_storage);
}

// C wrapper
extern "C" EXPORT_SYMBOL void sort_pairs(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items) {
    sort_pairs_internal(d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
}

// C wrapper with timer
extern "C" EXPORT_SYMBOL float timed_sort_pairs(int64_t* d_keys_in, int64_t* d_keys_out, int32_t* d_values_in, int32_t* d_values_out, int32_t num_items) {
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    cudaEventRecord(start);
    sort_pairs_internal(d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
    cudaEventRecord(stop);

    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return milliseconds;
}


// Determines the size in bytes of temporary storage we'll need to the actual SortPairs call.
extern "C" EXPORT_SYMBOL size_t determine_sort_pairs_temp_storage(std::int64_t* d_keys_in, std::int64_t* d_keys_out, std::int32_t* d_values_in, std::int32_t* d_values_out, std::int32_t num_items)
{
    size_t temp_storage_bytes = 0;
    cub::DeviceRadixSort::SortPairs(nullptr, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
    return temp_storage_bytes;
}

// C wrapper for the actual sorting in SortPairs, instantiated for 64-bit integer (long) keys, 32-bit int integer (int) values
extern "C" EXPORT_SYMBOL void sort_pairs_external_temp_storage(void* d_temp_storage, size_t temp_storage_bytes, std::int64_t* d_keys_in, std::int64_t* d_keys_out, std::int32_t* d_values_in, std::int32_t* d_values_out, std::int32_t num_items) 
{
    cub::DeviceRadixSort::SortPairs(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items);
}

// Create shared libary using  
//      nvcc -Xcompiler -fPIC -shared -arch=sm_86 -o cub_wrapper.so cub_wrapper.cu
// (The -Xcompiler -fPIC will be ignored on Windows.)

(updated) cub_ccall_test.jl

using Test
using CUDA
using Statistics

const lib = "./cub_wrapper.so"

N::Int32 = 2^24
keys_in = CUDA.rand(Int64, N)
keys_out = similar(keys_in)
values_in = CuArray{Int32}(undef, N)
values_in .= 1:N
values_out = similar(values_in)


function sort_pairs_cuda_malloc!(keys_in::CuArray{Int64}, keys_out::CuArray{Int64}, values_in::CuArray{Int32}, values_out::CuArray{Int32}, N::Int32)
    ccall((:sort_pairs, lib), Cvoid, 
          (CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
          keys_in, keys_out, values_in, values_out, N
     )
end

function sort_pairs_julia_alloc!(keys_in::CuArray{Int64}, keys_out::CuArray{Int64}, values_in::CuArray{Int32}, values_out::CuArray{Int32}, N::Int32)
    temp_storage_bytes = ccall((:determine_sort_pairs_temp_storage, lib), Csize_t, 
        (CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
        keys_in, keys_out, values_in, values_out, N)
    temp_storage = CuArray{UInt8}(undef, temp_storage_bytes)

    ccall((:sort_pairs_external_temp_storage, lib), Cvoid, 
        (Ptr{Nothing}, Csize_t, CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
        reinterpret(Ptr{Nothing}, pointer(temp_storage)), temp_storage_bytes, keys_in, keys_out, values_in, values_out, N)

    CUDA.unsafe_free!(temp_storage)
end

function sort_pairs_cuda_malloc_time!(keys_in::CuArray{Int64}, keys_out::CuArray{Int64}, values_in::CuArray{Int32}, values_out::CuArray{Int32}, N::Int32)
    ccall((:timed_sort_pairs, lib), Cfloat, 
          (CuRef{Int64}, CuRef{Int64}, CuRef{Int32}, CuRef{Int32}, Int32),
          keys_in, keys_out, values_in, values_out, N
     )
end

CUDA.@sync sort_pairs_cuda_malloc!(keys_in, keys_out, values_in, values_out, N)
@test all(keys_out .== sort(keys_in))
@test all(values_out .== sortperm(keys_in))

keys_out .= 0  # Reset to something wrong, just in case the function call below does nothing
values_out .= 0
CUDA.@sync sort_pairs_julia_alloc!(keys_in, keys_out, values_in, values_out, N)
@test all(keys_out .== sort(keys_in))
@test all(values_out .== sortperm(keys_in))


function benchmark_sort_pairs_cuda_malloc_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N)
    times = Array{Float64}(undef, nb_runs)
    for i = 1:nb_runs
        stats = @timed CUDA.@sync sort_pairs_cuda_malloc!(keys_in, keys_out, values_in, values_out, N)
        times[i] = stats.time
    end
    return median(times)
end

function benchmark_sort_pairs_julia_alloc_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N)
    times = Array{Float64}(undef, nb_runs)
    for i = 1:nb_runs
        stats = @timed CUDA.@sync sort_pairs_julia_alloc!(keys_in, keys_out, values_in, values_out, N)
        times[i] = stats.time
    end
    return median(times)
end

function benchmark_sort_pairs_cuda_malloc_cutime(nb_runs, keys_in, keys_out, values_in, values_out, N)
    times = Array{Float32}(undef, nb_runs)
    for i = 1:nb_runs
        times[i] = sort_pairs_cuda_malloc_time!(keys_in, keys_out, values_in, values_out, N)
    end
    return median(times)
end

nb_runs = 500
println("sort_pairs, CUDA malloc, time including ccall:")
println("\t$(benchmark_sort_pairs_cuda_malloc_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N) * 1000) ms")
println()
println("sort_pairs, CUDA malloc, time excluding ccall:")
println("\t$(benchmark_sort_pairs_cuda_malloc_cutime(nb_runs, keys_in, keys_out, values_in, values_out, N)) ms")
println()
println("sort_pairs, Julia alloc, time including ccall:")
println("\t$(benchmark_sort_pairs_julia_alloc_jltime(nb_runs, keys_in, keys_out, values_in, values_out, N) * 1000) ms")

with results

  • Windows 10 22H2, RTX 3070, Julia 1.10.4+0.x64.w64.mingw32, nvcc 12.5, cl 19.39.33522:
sort_pairs, CUDA malloc, time including ccall:
        16.796150000000004 ms

sort_pairs, CUDA malloc, time excluding ccall:
        10.902704 ms

sort_pairs, Julia alloc, time including ccall:
        9.198350000000001 ms
  • Ubuntu 22.04.4 LTS, RTX 3080 Ti, 1.10.0+0.x64.linux.gnu, nvcc 11.8, g++ 11.4.0:
sort_pairs, CUDA malloc, time including ccall:
        11.450226500000001 ms

sort_pairs, CUDA malloc, time excluding ccall:
        11.395584 ms

sort_pairs, Julia alloc, time including ccall:
        9.152368000000001 ms
  • Windows 11 23H2, RTX 2080, Julia 1.10.4+0.x64.w64.mingw32, nvcc 11.8, cl 19.20.27508.1:
sort_pairs, CUDA malloc, time including ccall:
        19.5213 ms

sort_pairs, CUDA malloc, time excluding ccall:
        19.657488 ms

sort_pairs, Julia alloc, time including ccall:
        13.2453 ms
  • Windows 11 pro 23H2, RTX 4070 Ti Super, Julia 1.10.4+0.x64.w64.mingw32, nvcc 11.8, cl 19.39.33523:
sort_pairs, CUDA malloc, time including ccall:
        11.28335 ms

sort_pairs, CUDA malloc, time excluding ccall:
        11.235408 ms

sort_pairs, Julia alloc, time including ccall:
        5.9287 ms

So not only does this resolve the issue, it’s also consistently faster than the CUDA timing, sometimes significantly so!

FWIW, you were using a different allocator, cudaMalloc in C++ vs cuMallocAsync (i.e. the stream-ordered allocator) by CUDA.jl.

Also keep an eye on GitHub - Gnimuc/CppInterOp.jl: Julia Interface to https://github.com/compiler-research/CppInterOp, based on GitHub - compiler-research/CppInterOp: A Clang-based C++ Interoperability Library. Hopefully that makes it possible, in the future, to call C++ code directly from Julia without the need for C shims.

1 Like

Interesting, thanks!