I know that there are GPUArrays for which broadcasting and various linear algebra stuff is overloaded, so it gets executed on GPU.
Suppose, that I define a function which works with GPUArrays and may be with some constants. When this function is compiled, does it gets compiled to GPU as a whole, so that it can be executed on GPU without data exchange with CPU? (Provided that I transferred data to GPU beforehand)
If I plug such a function into DifferentialEquations.jl as a right-hand-side of ODE, and integrate it with 4-th order Runge Kutta, for example, will the whole Runge Kutta with the right hand side be compiled on GPU, so that it can be executed as a whole without the data exchange with CPU?
Can you explain how it actually works?
I understand how dispatch works when you call, for example, matrix vector product function, which is overloaded to a GPU implementation.
But how does Julia understand that it needs to compile the whole function to GPU, when it tries to compile a generic function using GPUArrays?
When you use a GPUArrays implementation such as the CuArray, various array operations are defined that will ensure that the operation dispatches onto the GPU. Necessary conversions and kernel compilations happen automatically as part of the support code in GPUArrays. Many semi-generic kernels are also available within GPUArrays for common operations, which are used by GPUArrays and CUDA.jl to implement efficient GPU variants of well known CPU-only operations from Base, stdlibs, and external packages.
Of course, not everything works seamlessly with the CuArray; pretty much only the dispatches defined in GPUArrays and CUDA (and also other packages) will work, and many other dispatches will not (often resulting in an error). It’s not magic, it’s just liberal use of multiple dispatch and a few hand-rolled routines. The most magical thing in the stack is probably GPUCompiler.jl, which compiles a given kernel (basically, a function with a given set of argument types) down to native GPU assembly by leveraging LLVM and Julia’s compiler.
Also, regarding doing AST transformations: CUDA.jl and GPUCompiler.jl mostly do not do AST transforms (although CUDA and GPUArrays do have custom broadcasting implementations which does transform a broadcast tree to a certain depth, such as replacing Base.sin with CUDA.sin automatically), but KernelAbstractions.jl (a package for writing generic kernels that run on the CPU, CUDA GPUs, and soon AMD and Intel GPUs) does use Cassette.jl to do transformations that are akin to AST transforms.
Will it work if I have a callable struct which wraps CuArrays (some of these CuArrays are constant precomputed matrices and vectors, some off these CuArrays are used as caches for the computation)?
It doesn’t currently seem to work (see below), but generally we do now cudaconvert the kernel function in an attempt to convert closures to something GPU-compatible (see https://github.com/JuliaGPU/CUDA.jl/pull/625). I’d personally say it’s probably something worth supporting if possible.
julia> using CUDA
julia> struct MyStruct
x
end
julia> m = MyStruct(CuArray(rand(1)))
MyStruct([0.6895091553888575])
julia> function (m::MyStruct)(y)
m.x[1] = y
nothing
end
julia> @cuda m(3.0)
ERROR: MethodError: no method matching cufunction(::MyStruct, ::Type{Tuple{Float64}})
Closest candidates are:
cufunction(::F, ::TT; name, kwargs...) where {F<:Function, TT<:Type} at /home/jpsamaroo/.julia/packages/CUDA/UeDEJ/src/compiler/execution.jl:285
Stacktrace:
[1] top-level scope
@ ~/.julia/packages/CUDA/UeDEJ/src/compiler/execution.jl:101