Hi,
I am looking for some advice on finding the most efficient way to parallelize my code on GPU. Let’s assume the simplified situation:
struct MyOtherStruct1 end
struct MyOtherStruct2 end
struct MyStructType1 end
struct MyStructType2 end
struct MyStruct{T}
type :: T
data1 :: CuArray{Float64}
data2 :: Vector{MyOtherStruct1}
end
function foo(arg1::MyStruct{MyStructType1}, arg2::Float64, arg3::MyOtherStruct2)
# do something specific for type 1
end
function foo(arg1::MyStruct{MyStructType2}, arg2::Float64, arg3::MyOtherStruct2)
# do something specific for type 2
end
s1 = MyOtherStruct1()
s2 = MyOtherStruct2()
v1 = Vector([MyStruct(MyStructType1, CUDA.zeros(100), Vector([s1,s1])),
MyStruct(MyStructType2, CUDA.zeros(100), Vector([s1,s1])),
MyStruct(MyStructType1, CUDA.zeros(100), Vector([s1,s1]))
])
v2 = CUDA.rand(Float64,3)
Now I want to broadcast function foo over vectors v1, v2 and s:
Base.broadcastable(s::MyOtherStruct2) = Ref(s)
foo.(v1, v2, s2)
As the dimension of vectors v1 and v2 is the biggest one in whole problem so I would like to parallelize over them on GPU. Can I achieve it without writing custom kernel? For now I assume that broadcasting is being done on CPU. My understanding is that automatic broadcasting on GPU is achieved when all the arguments passed to function are CuArrays, but I have no idea how to to convert v1 and s2 to GPU-friendly form. Also I have the multiple definitions of function foo which are working well with CuArrays, but rewriting them to CuDeviceArray so I could call them from kernel would be real pain as many operations inside them are pretty high-level and I expect that at least some of them would not work on CuDeviceArrays. Nevertheless I am looking for performance here so if sticking to CuArrays would be much less efficient, I will switch to kernels.
So that’s it. Any help or advice much appreciated.