Best Practice for Type Declarations in CUDA Kernels

I am writing a number of CUDA kernels for my optimization project. In the rest of my code I am using parametric type declarations so that the code can be run using different types. An example of an existing function is below

function myFunc(a::Vector{Ti}, b::Matrix{Tf}) where {Ti<:Integer, Tf<:AbstractFloat}
#Some code
end

I would like to extend this style to the gpu kernels. I have attempted to define the kernels similar to the following

function myGPUFunc(a::CuArray{Ti, 1}, b::CuArray{Tf, 2}) where {Ti<:Integer, Tf<:AbstractFloat)

This does not work and I get an error saying that no method is available. The error indicates it is looking for a CuDeviceVector{Type, Size}.

I think this is due to the cudaconvert process that happens when I launch the kernel, but I cannot find a lot of details on the function in the docs. Can anyone provide a general recommendation for how to write type stable gpu kernel? Additionally what are the differences between the different array types? CuArray{Int, 2} CuArray{Int, 2, Cuda.DeviceMemory}, CudaDeviceVector?

Thanks

EDIT: It looks like CuArray{Int64, 2} is not valid syntax. I can only see the three argument version in the help CuArray{T, N, M}. The question still stands though.

CuArray represents memory allocated on the GPU in the context of a normal Julia program executing on the host. The third parameter is the underlying buffer type and that is needed for things like unified memory.

Now when we execute code on the GPU we are in a more limited programming environment and it is often useful to simplify the types being passed to the GPU. CUDA.jl uses Adapt.jl for this, one of these translation at the device barrier is to go from CuArray to CuDeviceArray this is needed since as an example getindex needs to be implemented very differently on the GPU than on the host.

Now for your original question, that is a matter of style. I personally avoid type annotations that are not needed for dispatch behavior.

If you are interested in the number of dimensions or the element type, you could simply use AbstractArray, but then again eltype and ndims do the same.

3 Likes

CuArray represents memory allocated on the GPU in the context of a normal Julia program executing on the host. The third parameter is the underlying buffer type and that is needed for things like unified memory.

Now when we execute code on the GPU we are in a more limited programming environment and it is often useful to simplify the types being passed to the GPU. CUDA.jl uses Adapt.jl for this, one of these translation at the device barrier is to go from CuArray to CuDeviceArray this is needed since as an example getindex needs to be implemented very differently on the GPU than on the host.

This make sense

Now for your original question, that is a matter of style. I personally avoid type annotations that are not needed for dispatch behavior.

Personally I like to strongly type things to avoid type instability issues. It is not clear to me when julia moves from type stable mode to type unstable mode. I think that by declaring types I am guaranteed to stay in type stable mode but I’m not sure.

Back to my typing question, I could use CuDeviceArray{} as inputs to the function however this is not the same type as the variable I will pass into the function when I launch the kernel. This does not sit right with me. I would pass in a CuArray, but the function is typed as a CuDeviceArray.

I know it is possible to modify the conversion settings as part of the kernel launch, but I do not understand how or to what effect. On a similar not what happens when I incorporate some constants in the kernel call. When using C++ I would expect these constants to be bundled into the kernel launch and broadcast to threads. Does this happen using CUDA.jl? Are they converted by Adapt.jl and sent to the gpu with a memory transfer? This is not clear to me. I will look into Adapt.jl though.

1 Like

Not really. Julia automatically type stabilizes at every function call and adding types to arguments does not improve the precision.

Pretty much the same in Julia. CUDA has a specific argument buffer and CUDA.jl sets that argument buffer up including the constants.

Now in C++ you can use templated functions and in Julia you can use type parameters to achieve the same effect if you want to specialize your kernel on some constants.