I would like to pass a pointer to CuArray (or one of its sections) as a parameter for DifferentialEquations right-hand-side function which is defined as a GPU kernel function.
function func(u, p, t)
tt, ff = p
f = linterp(t, tt, ff) # some linear interpolation
return f
end
t = range(0.0, 10.0, length=101)
f = cos.(t)
tspan = (t[1], t[end])
u0 = 0.0
p = (t, f)
prob = ODEProblem(func, u0, tspan, p)
Since array f in parameters tuple p is passed by reference I can redefine it at any later stage and the function func will know about this change without need to remake the prob. The same will work if f is a multidimensional array and I pass to p one of its 1d sections using @view macro.
Now I would like to solve this kind of equation on GPU with f being a CuArray. In this situation func will be converted to a GPU kernel function (though I do not know how it actually done in DifferentialEquations, but I try to write my own implementation where it happens). The first problem is the fact that CuArray f is not isbits and therefore it can not be accessed inside func. I can solve this issue by converting f into StaticArray. However, in this case I will loose the advantage of changing f without remaking the prob.
Thus, I need some kind of pointer to CuArray f (or one of its sections if it has multiple dimensions). As far as I understand, in CUDA package this functionality is somehow realized through CuDeviceArray type. Therefore, I guess, I have to manually convert my CuArray into CudeviceArray. How can I do it? Or can you please suggest me any other solutions. Thank you.
Thank you for the suggestion. It seems KernelAbstractions allow me to do what I want:
import CUDA
using KernelAbstractions
CUDA.allowscalar(false)
@kernel function func(u, p)
f, = p
I = @index(Global)
u[I] = f[I]
end
N = 2048
f = ones(Float32, N)
u_gpu = CUDA.zeros(N)
f_gpu = CUDA.CuArray(f)
p = (f_gpu, )
kernel = func(CUDADevice(), 32)
# launch kernel with original f ------------------------------------------------
event = kernel(u_gpu, p; ndrange=size(u_gpu))
wait(event)
u = CUDA.collect(u_gpu)
@show isequal(u, f) # -> isequal(u, f) = true
# change f outside of kernel and launch the kernel once more -------------------
@. f_gpu = 2 * f_gpu
ev = Event(CUDADevice()) # needed for synchronization
event = kernel(u_gpu, p; ndrange=size(u_gpu), dependencies=ev)
wait(event)
u = CUDA.collect(u_gpu)
@show isequal(u, 2 .* f) # -> isequal(u, 2 .* f) = true
As far as I understand, somewhere under the hood KernelAbstractions convert f into CuDeviceArray. I want to understand how I can do it by myself. If someone can help me with that, e.g. give a link to a proper line in sources I would be very appreciated (of cause, I will try to find it by myself, though I failed after a brief look). A direct conversion, f_cda = CUDA.CuDeviceArray(f_gpu.dims, CUDA.DevicePtr(pointer(f_gpu))), does not work — I obtain the ReadOnlyMemoryError() when I try to access f_cda.
It seems I found the solution.
In my original problem, I have an iterative process where on each step I update a function f(x,t) defined on a fixed grid (x,t). Also, on each step I have a differential equation du/dt=f(x,t) where u=u(t), that is, x plays a role of the parameter. In terms of DifferentialEquations this equation can be considered as an ensemble problem where I solve multiple differential equations independently for each point x. In this case for each equation i in the ensemble I have to pass a slice of the array f, that is f[i,:], as a parameter p. The solution is performed on GPU which imposes a limitation on the way how I pass f[i,:] into the kernel function. After reading the sources of DiffEqGPU.jl and manual of CUDA.jl I came to the solution which can be illustrated by the following code:
import CUDA
struct Integrator{P}
p :: P
end
function func(u::CUDA.CuArray, integs)
N = length(u)
function get_config(kernel)
fun = kernel.fun
config = CUDA.launch_configuration(fun)
blocks = cld(N, config.threads)
return (threads=config.threads, blocks=blocks)
end
CUDA.@cuda config=get_config kernel(u, integs)
return nothing
end
function kernel(u, integs)
N1, N2 = size(u)
id = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
stride = CUDA.blockDim().x * CUDA.gridDim().x
for i=id:stride:N1
p = integs[i].p
f, = p
for j=1:N2
u[i, j] = f[j]
end
end
return nothing
end
println()
N1 = 100
N2 = 200
f = rand(Float32, (N1, N2))
f_gpu = CUDA.CuArray(f)
u_gpu = CUDA.zeros((N1, N2))
integs = Array{Integrator}(undef, N1)
for i=1:N1
fi = CUDA.cudaconvert(@view f_gpu[i, :]) # !!! Converting the array view
pi = (fi, )
integs[i] = Integrator(pi)
end
integs = CUDA.CuArray(hcat([integs[i] for i in 1:N1])) # !!! Preparing the array of integrators.
func(u_gpu, integs)
@show isequal(CUDA.collect(u_gpu), CUDA.collect(f_gpu)) # -> true
@. f_gpu = f_gpu * 2 # change f outside of the kernel
func(u_gpu, integs)
@show isequal(CUDA.collect(u_gpu), CUDA.collect(f_gpu)) # -> true
The solution allows to modify the driving function f without remaking of the problem. This is realized through the array view which should be converted by the CUDA.cudaconvert function (otherwise, after passing through the parameter tuple, it can not be accessed inside the GPU kernel since it is not isbits). Another tricky point is how to pass the resulting array of integrators (structures which contain the parameter tuples for each of the equation) into the GPU kernel. The solution was found in DiffEqGPU.jl (https://github.com/SciML/DiffEqGPU.jl/blob/67f4878f4c64e5df75106f382feb3caf4d7d96d2/src/DiffEqGPU.jl#L169). For some reason, you can not simply convert the array of structs into CuArray. First you need to stack the structs together using hcat function, and only after the resulting 2D array (with the length along the second dimension equal to 1) can be used in GPU kernel.