CUDAdrv cannot find __host__ __device__ functions

The example vadd.jl in CUDAdrv/examples will fail if one change the memory qualifier from __global__ to __host__ __device__ in the function signature.

extern "C" {
__host__ __device__ void kernel_vadd(const float *a, const float *b, float *c)
    int i = blockIdx.x *blockDim.x + threadIdx.x;
    c[i] = a[i] + b[i];

with error message:

ERROR: LoadError: CUDA error: named symbol not found (code #500, ERROR_NOT_FOUND)

Why can’t I call a __host__ __device__ function in CUDAdrv while I can call it in main() in c/c++ ? Is there a way around this problem?

FYI I am trying to wrap the CUB library in Julia. They have many interesting routines like RadixSort. Now the road blocker is that their interface functions are mostly __host__ __device__ which I can’t call in another __global__ function.

__device__ functions aren’t kernels, but functions that you can call, on device, from within a kernel, but not directly from the host. That’s fine for CUB, which contains reusable functionality (often templated, at that). So I don’t think it makes much sense to wrap CUB, unless it also contains reusable entry-point functions/kernels.

If you really want to wrap CUB, you’ll need to create kernels that call CUB kernels, much like the BlockSortKernel example on their home page.

1 Like

Thank you for the clarification. I am very new to coda and after some reading I just realized my question is very uninformed. Cufunction wraps a function that is to be called with <<>>, or in another kernel. So no point in wrapping a host function here.

On the other hand, CUB does provide some entry points: SortPairs and friends, in the device function category. Those functions adapt for a wide range of devices through predefined policy classes. I guess they put a lot of effort in tuning the block size, batch size etc, so I am happy if I can just reuse those. Is there a way we can use those functions in Julia?

Yes, but since they all look templated templated you’ll have to go through the extra build step of creating some instances of the template, after which you can call these functions using cufunction. To avoid a package build step, you could theoretically also use Cxx.jl, or roll your own solution by calling out to nvcc from within Julia. Either way, there’s always some complexity involved.

Also note that CUB’s device functions are in principle usable through CUDAnative and Cxx.jl, see eg. GitHub - JuliaGPU/CUDAnativelib.jl as a very rough proof of concept.

After some experiment I managed to call SortPairs and friends from Julia. It is surprisingly simple, one has to:

  1. Provide __host__ wrappers that calls the __host__ __device__ functions. Export them with extern.
  2. Compile the wrapper code to a shared library, rather than .ptx/cubin/fatbin.
  3. Call the exported functions from Julia with ccall rather than cudacall. The functions typically expect device memoy pointers as parameters. You can allocate device memory with d_x = Mem.alloc(n) and pass in d_x.ptr. I am not sure it is the intended way of using CUDAdrv but for me it worked.

I ran a few tests. It seems that SortKeys is two orders of magnitudes faster than Base.sort! for 1e8 Float32s on GTX1080 vs i7-6800K. Really impressive for sorting.

After skimming through more cub code I totally agree with you. Just wrapping a few device functions certainly misses the most “juicy” part of the library. As that library is being developed by NVIDIA we may well expect that it may grow into something like cublas/cufft so I won’t bother writing a full fledged wrapper now.

Oh yeah, instantiating __host__ functions and using ccall is an option too, pretty similar to instanting __global__ functions and calling them with cufunction. One advantage of the latter is that you can more easily ship PTX code, whereas shipping shared libraries is very platform dependent.