Call libcuda cuLaunchKernel from Julia

,

I am trying to analyze the overheads of launching a CUDA task from Julia and also from C++. I see some gaps in Julia after successive launch of CUDA “saxpy_cuda” kernel to non-blocking streams. I am wondering if it is because of some runtime overheads. To analyze this issue, I have split the code into optimized code generation and launch, where the optimized code generation is done using “@cuda launch=false” followed by launch using direct libcuda.so’s cuLaunchKernel API. However, I am getting a segmentation fault during run. I have attached below MWE to illustrate the issue.

Code: sample.jl

using CUDA
using Libdl


# Define a simple CUDA kernel for element-wise addition
function saxpy_cuda(C, A, B)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    @inbounds C[i] = A[i] + B[i]
    return
end


# Array sizes and initialization
n = 1024
A = CUDA.fill(1.0f0, n)  # Fill A with 1.0
B = CUDA.fill(2.0f0, n)  # Fill B with 2.0
C = CUDA.zeros(Float32, n)  # Allocate C as zeros

# Grid and block dimensions
blocks = 32
threads = 32

gridDimX, gridDimY, gridDimZ = blocks, 1, 1
blockDimX, blockDimY, blockDimZ = threads, 1, 1

# Prepare the kernel without launching
kernel_func = CUDA.@cuda launch=false saxpy_cuda(C, A, B)

# Get the kernel function handle as a Cvoid pointer
c_void_ptr = Base.unsafe_convert(Ptr{Cvoid}, kernel_func.fun.handle)

# Convert pointers for kernel parameters
C_ptr = pointer(C)
A_ptr = pointer(A)
B_ptr = pointer(B)

# Prepare the kernel parameters
kernel_params = Ref((reinterpret(Ptr{Cvoid}, C_ptr),
                     reinterpret(Ptr{Cvoid}, A_ptr),
                     reinterpret(Ptr{Cvoid}, B_ptr)))
kernel_params_ptr = Base.unsafe_convert(Ptr{Ptr{Cvoid}}, kernel_params)

# Load the CUDA driver library
lib = Libdl.dlopen("libcuda.so")

# Launch the kernel using cuLaunchKernel
result = ccall(Libdl.dlsym(lib, :cuLaunchKernel), Int32,
               (Ptr{Cvoid}, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, Ptr{Cvoid}, Ptr{Ptr{Cvoid}}, Ptr{Cvoid}),
               c_void_ptr, UInt32(gridDimX), UInt32(gridDimY), UInt32(gridDimZ),
               UInt32(blockDimX), UInt32(blockDimY), UInt32(blockDimZ), UInt32(0),
               C_NULL, kernel_params_ptr, C_NULL)

# Check for errors
if result != 0
    error("CUDA kernel launch failed with error code $result")
end

# Synchronize the GPU to wait for kernel completion
CUDA.synchronize()

# Transfer data back to the host and verify
C_host = Array(C)
println("Result: ", C_host)
# Verify correctness
expected = Array(A) .+ Array(B)
@assert C_host == expected
println("Test passed!")

Run:

$ julia sample.jl

Output:

[3713566] signal 11 (2): Segmentation fault
in expression starting at /noback/nqx/Ranger/tmp/iris.dev.prof/apps/saxpy/sample.jl:47
memcpy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7b10ccbd8f34) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccad8de6) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccebbeb3) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb044a4) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb04d67) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb0a1c6) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10cccdfaef) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
top-level scope at /noback/nqx/Ranger/tmp/iris.dev.prof/apps/saxpy/sample.jl:47
jl_toplevel_eval_flex at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1059
jl_toplevel_eval_flex at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1010
ijl_toplevel_eval at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1079
ijl_toplevel_eval_in at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1124
eval at ./boot.jl:461
include_string at ./loading.jl:2846
_include at ./loading.jl:2906

I may be wrong, but I think you’re opening the wrong libcuda.so? If I remember correctly, the one in the artifact is a stub but you should be using instead system libcuda.so.1?

The one on .artifacts can be the correct one, when the forwards-compatible driver is being used. That said, you can simply use CUDA.cuLaunchKernel, as we provide wrapper functions for the entire driver API, without having to dlopen/ccall yourself.

I haven’t executed your example, but from a quick glance I can spot some issues. For example, you cannot create kernel_params and derive a pointer from it using unsafe_convert without also using GC.@preserve. It’s recommended to rely on ccall to do these conversions automatically. The Julia manual explains this in much more detail: Calling C and Fortran Code · The Julia Language