Calling a CUDA kernel in an external library

Is it possible to call a precompiled CUDA kernel from Julia?

I am writing CUDA kernels in C++ and compiling with nvcc, for use in a C++ application. I use Julia to test the code. Currently I write host-side wrapping functions that I call from Julia using ccall. I would like to avoid the wrapping functions and call the kernels directly from Julia.

Is this possible?

1 Like

If you can compile to a PTX file (nvcc -pxt), you can load the file with CuModule, look up a kernel, and cudacall it. So pretty low level. It should also be possible to extract the PTX code from the fatbin, but we don’t have any functionality for that.

Thanks, I might try that. Do you know if the low-level code is generated using Nvidias ptxas in that case, or with some other technique?

I am also considering using cudaLaunchKernel(…) from the CUDA runtime, which might allow me to use an already compiled kernel. Do you see any pitfalls with this?

That won’t work, it expects a certain layout of the binary in order to look up the compiled function. The Julia binary doesn’t have that, and we don’t emit our code like that. Unless you want to call cudaLaunchKernel from a pre-compiled binary, but then you’re in the same boat (having to ccall an external library).

Hmm, I’m confused.

A small example:

__global__
void example_kernel(int *data)
{
    data[3] = data[0] + data[1] * data[2];
}

extern "C"
void example_wrapper(int *data)
{
    int *data_cuda;
    size_t size = 4 * sizeof(int);
    cudaMalloc(&data_cuda, size);
    cudaMemcpy(data_cuda, data, size, cudaMemcpyHostToDevice);
    example_kernel<<<1,1>>>(data_cuda);
    cudaDeviceSynchronize();
    cudaMemcpy(data, data_cuda, size, cudaMemcpyDeviceToHost);
    cudaFree(data_cuda);
}

I compile this using nvcc -Xcompiler -fPIC --shared example_kernel.cu -o example_kernel.so
And then in Julia:

julia> data=Cint[1,2,3,0]
4-element Array{Int32,1}:
 1
 2
 3
 0

julia> ccall((:example_wrapper,"./example_kernel.so"),Cvoid,(Ptr{Cint},),pointer(data))

julia> data
4-element Array{Int32,1}:
 1
 2
 3
 7

The shared object example_kernel.so should now contain the kernel code in the format that the CUDA runtime expects. So I would guess that cudaLaunchKernel(...) should be able to launch it, if I’m able to invoke it from within Julia.

Am I wrong?

With what arguments? cudaLaunchKernel takes a function pointer, which is resolved within the executing application, and AFAIK depends on the executable having specific symbols and state set-up.

Fair point, I don’t know how to get that function pointer. Maybe I can create a single C function that does it for me. Will investigate and come back. Thanks for the feedback.