I am trying to analyze the overheads of launching a CUDA task from Julia and also from C++. I see some gaps in Julia after successive launch of CUDA “saxpy_cuda” kernel to non-blocking streams. I am wondering if it is because of some runtime overheads. To analyze this issue, I have split the code into optimized code generation and launch, where the optimized code generation is done using “@cuda launch=false” followed by launch using direct libcuda.so’s cuLaunchKernel API. However, I am getting a segmentation fault during run. I have attached below MWE to illustrate the issue.
Code: sample.jl
using CUDA
using Libdl
# Define a simple CUDA kernel for element-wise addition
function saxpy_cuda(C, A, B)
i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
@inbounds C[i] = A[i] + B[i]
return
end
# Array sizes and initialization
n = 1024
A = CUDA.fill(1.0f0, n) # Fill A with 1.0
B = CUDA.fill(2.0f0, n) # Fill B with 2.0
C = CUDA.zeros(Float32, n) # Allocate C as zeros
# Grid and block dimensions
blocks = 32
threads = 32
gridDimX, gridDimY, gridDimZ = blocks, 1, 1
blockDimX, blockDimY, blockDimZ = threads, 1, 1
# Prepare the kernel without launching
kernel_func = CUDA.@cuda launch=false saxpy_cuda(C, A, B)
# Get the kernel function handle as a Cvoid pointer
c_void_ptr = Base.unsafe_convert(Ptr{Cvoid}, kernel_func.fun.handle)
# Convert pointers for kernel parameters
C_ptr = pointer(C)
A_ptr = pointer(A)
B_ptr = pointer(B)
# Prepare the kernel parameters
kernel_params = Ref((reinterpret(Ptr{Cvoid}, C_ptr),
reinterpret(Ptr{Cvoid}, A_ptr),
reinterpret(Ptr{Cvoid}, B_ptr)))
kernel_params_ptr = Base.unsafe_convert(Ptr{Ptr{Cvoid}}, kernel_params)
# Load the CUDA driver library
lib = Libdl.dlopen("libcuda.so")
# Launch the kernel using cuLaunchKernel
result = ccall(Libdl.dlsym(lib, :cuLaunchKernel), Int32,
(Ptr{Cvoid}, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, Ptr{Cvoid}, Ptr{Ptr{Cvoid}}, Ptr{Cvoid}),
c_void_ptr, UInt32(gridDimX), UInt32(gridDimY), UInt32(gridDimZ),
UInt32(blockDimX), UInt32(blockDimY), UInt32(blockDimZ), UInt32(0),
C_NULL, kernel_params_ptr, C_NULL)
# Check for errors
if result != 0
error("CUDA kernel launch failed with error code $result")
end
# Synchronize the GPU to wait for kernel completion
CUDA.synchronize()
# Transfer data back to the host and verify
C_host = Array(C)
println("Result: ", C_host)
# Verify correctness
expected = Array(A) .+ Array(B)
@assert C_host == expected
println("Test passed!")
Run:
$ julia sample.jl
Output:
[3713566] signal 11 (2): Segmentation fault
in expression starting at /noback/nqx/Ranger/tmp/iris.dev.prof/apps/saxpy/sample.jl:47
memcpy at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7b10ccbd8f34) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccad8de6) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccebbeb3) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb044a4) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb04d67) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10ccb0a1c6) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
unknown function (ip: 0x7b10cccdfaef) at /home/nqx/.julia/artifacts/8efb75e3f6d9e0f824050d4a0a524e58ce4e9fe3/lib/libcuda.so
top-level scope at /noback/nqx/Ranger/tmp/iris.dev.prof/apps/saxpy/sample.jl:47
jl_toplevel_eval_flex at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1059
jl_toplevel_eval_flex at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1010
ijl_toplevel_eval at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1079
ijl_toplevel_eval_in at /noback/nqx/FPGA/packages/julia.git.v0/src/toplevel.c:1124
eval at ./boot.jl:461
include_string at ./loading.jl:2846
_include at ./loading.jl:2906