Broadcasting a function on GPU


I am looking for some advice on finding the most efficient way to parallelize my code on GPU. Let’s assume the simplified situation:

struct MyOtherStruct1 end
struct MyOtherStruct2 end

struct MyStructType1 end
struct MyStructType2 end

struct MyStruct{T}
    type :: T
    data1 :: CuArray{Float64}
    data2 :: Vector{MyOtherStruct1}

function foo(arg1::MyStruct{MyStructType1}, arg2::Float64, arg3::MyOtherStruct2)
    # do something specific for type 1

function foo(arg1::MyStruct{MyStructType2}, arg2::Float64, arg3::MyOtherStruct2)
    # do something specific for type 2
s1 = MyOtherStruct1()
s2 = MyOtherStruct2()
v1 = Vector([MyStruct(MyStructType1, CUDA.zeros(100), Vector([s1,s1])),
             MyStruct(MyStructType2, CUDA.zeros(100), Vector([s1,s1])),
             MyStruct(MyStructType1, CUDA.zeros(100), Vector([s1,s1]))
v2 = CUDA.rand(Float64,3)

Now I want to broadcast function foo over vectors v1, v2 and s:

Base.broadcastable(s::MyOtherStruct2) = Ref(s)
foo.(v1, v2, s2)

As the dimension of vectors v1 and v2 is the biggest one in whole problem so I would like to parallelize over them on GPU. Can I achieve it without writing custom kernel? For now I assume that broadcasting is being done on CPU. My understanding is that automatic broadcasting on GPU is achieved when all the arguments passed to function are CuArrays, but I have no idea how to to convert v1 and s2 to GPU-friendly form. Also I have the multiple definitions of function foo which are working well with CuArrays, but rewriting them to CuDeviceArray so I could call them from kernel would be real pain as many operations inside them are pretty high-level and I expect that at least some of them would not work on CuDeviceArrays. Nevertheless I am looking for performance here so if sticking to CuArrays would be much less efficient, I will switch to kernels.

So that’s it. Any help or advice much appreciated.

You shouldn’t need to use CuDeviceArray. Instead, make your structures parametric, and write an Adapt.jl rule (adapt_structure) that converts the contained CuArrays to CuDeviceArray automatically. You’ll still have to convert the CPU vectors to GPU ones though, that never happens automatically.

Thank you for the answer.
I have never used Adapt.jl before so I started searching and found this tutorial. My understanding is that Adapt.jl rule there is for the conversion from CuArrays to CuDeviceArrays (inside the Interpolate struct) when the kernel is being executed, am I right?

Also, if I would like to change the type of field data2 inside MyStruct from Vector{MyOtherStruct1} to CuArray{MyOtherStruct1} then I have to write Adapt.jl rule for MyOtherStruct1 as well to ensure it will be bits type to be accepted by CuArray?

That’s right. Adapt.jl just provides a way to convert the leaf nodes of complex structures, e.g., if you invoke a kernel with a MyStruct{CuArray} argument you’d write a rule that reconstructs the MyStruct while forwarding the conversion of the CuArray field, which will that way result in a GPU-compatible MyStruct{CuDeviceArray}.

No, you have to do that conversion yourself. You could use Adapt for that. But CuArray doesn’t convert arguments using cudaconvert/Adapt, that would be unsafe because of what I’ve described above. So you need to make sure MyOtherStruct1 is a bits type.