Functions of GPUArrays in ODEs and scheduling on GPU

I have a totally noob question here.

I know that there are GPUArrays for which broadcasting and various linear algebra stuff is overloaded, so it gets executed on GPU.

Suppose, that I define a function which works with GPUArrays and may be with some constants. When this function is compiled, does it gets compiled to GPU as a whole, so that it can be executed on GPU without data exchange with CPU? (Provided that I transferred data to GPU beforehand)

If I plug such a function into DifferentialEquations.jl as a right-hand-side of ODE, and integrate it with 4-th order Runge Kutta, for example, will the whole Runge Kutta with the right hand side be compiled on GPU, so that it can be executed as a whole without the data exchange with CPU?

Yes. Try it. And if we need to make this more clear in the documentation, let me know where.

Cool.

Can you explain how it actually works?
I understand how dispatch works when you call, for example, matrix vector product function, which is overloaded to a GPU implementation.
But how does Julia understand that it needs to compile the whole function to GPU, when it tries to compile a generic function using GPUArrays?

It just does this recursively.

I am not sure, but it seems that such a fit requires to

  • determine that there are GPUArrays present.
  • determine part of the syntax tree whidh should be compiled for GPU
  • transform this part of syntax tree into something suitable for GPU
  • actually compile it

Basically, it seems to require the transfromations of the parsed syntax tree of the code. For me, it is on the level of wizardry.

When you use a GPUArrays implementation such as the CuArray, various array operations are defined that will ensure that the operation dispatches onto the GPU. Necessary conversions and kernel compilations happen automatically as part of the support code in GPUArrays. Many semi-generic kernels are also available within GPUArrays for common operations, which are used by GPUArrays and CUDA.jl to implement efficient GPU variants of well known CPU-only operations from Base, stdlibs, and external packages.

Of course, not everything works seamlessly with the CuArray; pretty much only the dispatches defined in GPUArrays and CUDA (and also other packages) will work, and many other dispatches will not (often resulting in an error). It’s not magic, it’s just liberal use of multiple dispatch and a few hand-rolled routines. The most magical thing in the stack is probably GPUCompiler.jl, which compiles a given kernel (basically, a function with a given set of argument types) down to native GPU assembly by leveraging LLVM and Julia’s compiler.

Also, regarding doing AST transformations: CUDA.jl and GPUCompiler.jl mostly do not do AST transforms (although CUDA and GPUArrays do have custom broadcasting implementations which does transform a broadcast tree to a certain depth, such as replacing Base.sin with CUDA.sin automatically), but KernelAbstractions.jl (a package for writing generic kernels that run on the CPU, CUDA GPUs, and soon AMD and Intel GPUs) does use Cassette.jl to do transformations that are akin to AST transforms.

Will it work if I have a callable struct which wraps CuArrays (some of these CuArrays are constant precomputed matrices and vectors, some off these CuArrays are used as caches for the computation)?

It doesn’t currently seem to work (see below), but generally we do now cudaconvert the kernel function in an attempt to convert closures to something GPU-compatible (see https://github.com/JuliaGPU/CUDA.jl/pull/625). I’d personally say it’s probably something worth supporting if possible.

julia> using CUDA

julia> struct MyStruct
       x
       end

julia> m = MyStruct(CuArray(rand(1)))
MyStruct([0.6895091553888575])

julia> function (m::MyStruct)(y)
       m.x[1] = y
       nothing
       end

julia> @cuda m(3.0)
ERROR: MethodError: no method matching cufunction(::MyStruct, ::Type{Tuple{Float64}})
Closest candidates are:
  cufunction(::F, ::TT; name, kwargs...) where {F<:Function, TT<:Type} at /home/jpsamaroo/.julia/packages/CUDA/UeDEJ/src/compiler/execution.jl:285
Stacktrace:
 [1] top-level scope
   @ ~/.julia/packages/CUDA/UeDEJ/src/compiler/execution.jl:101

@maleadt is this something that should work?

I don’t think we can get this to work generally, because the typevars won’t necessarily match the fields as we currently assume they do: https://github.com/JuliaGPU/Adapt.jl/blob/c7dad7faed4cf95aaab2ff986cbcb90cecf7f453/src/base.jl#L19-L20
Here the struct isn’t parameterized so it wouldn’t be GPU compatible anyway.

1 Like

I suppose the users can write adapters for their custom types.

Is there a guide how to do this?

The Adapt.jl README should get you started. You can find a bunch of adapters in there, or in packages like CUDA.jl.