Pointer to CuArray as a parameter for DifferentialEquations right-hand-side function

I would like to pass a pointer to CuArray (or one of its sections) as a parameter for DifferentialEquations right-hand-side function which is defined as a GPU kernel function.

Why do I need it? I have a differential equation whose driving function is defined on a fixed grid (see DIfferentialEquations with the driving function defined on a fixed grid). The corresponding problem can be defined as

function func(u, p, t)
    tt, ff = p
    f = linterp(t, tt, ff)   # some linear interpolation
    return f
end

t = range(0.0, 10.0, length=101)
f = cos.(t)

tspan = (t[1], t[end])
u0 = 0.0
p = (t, f)

prob = ODEProblem(func, u0, tspan, p)

Since array f in parameters tuple p is passed by reference I can redefine it at any later stage and the function func will know about this change without need to remake the prob. The same will work if f is a multidimensional array and I pass to p one of its 1d sections using @view macro.

Now I would like to solve this kind of equation on GPU with f being a CuArray. In this situation func will be converted to a GPU kernel function (though I do not know how it actually done in DifferentialEquations, but I try to write my own implementation where it happens). The first problem is the fact that CuArray f is not isbits and therefore it can not be accessed inside func. I can solve this issue by converting f into StaticArray. However, in this case I will loose the advantage of changing f without remaking the prob.

Thus, I need some kind of pointer to CuArray f (or one of its sections if it has multiple dimensions). As far as I understand, in CUDA package this functionality is somehow realized through CuDeviceArray type. Therefore, I guess, I have to manually convert my CuArray into CudeviceArray. How can I do it? Or can you please suggest me any other solutions. Thank you.

You might want to look at DiffEqGPU.jl where this is done by KernelAbstractions.jl

Thank you for the suggestion. It seems KernelAbstractions allow me to do what I want:

import CUDA
using KernelAbstractions

CUDA.allowscalar(false)


@kernel function func(u, p)
  f, = p
  I = @index(Global)
  u[I] = f[I]
end


N = 2048
f = ones(Float32, N)

u_gpu = CUDA.zeros(N)
f_gpu = CUDA.CuArray(f)
p = (f_gpu, )

kernel = func(CUDADevice(), 32)

# launch kernel with original f ------------------------------------------------
event = kernel(u_gpu, p; ndrange=size(u_gpu))
wait(event)

u = CUDA.collect(u_gpu)
@show isequal(u, f)   # -> isequal(u, f) = true

# change f outside of kernel and launch the kernel once more -------------------
@. f_gpu = 2 * f_gpu
ev = Event(CUDADevice())   # needed for synchronization
event = kernel(u_gpu, p; ndrange=size(u_gpu), dependencies=ev)
wait(event)

u = CUDA.collect(u_gpu)
@show isequal(u, 2 .* f)   # -> isequal(u, 2 .* f) = true

As far as I understand, somewhere under the hood KernelAbstractions convert f into CuDeviceArray. I want to understand how I can do it by myself. If someone can help me with that, e.g. give a link to a proper line in sources I would be very appreciated (of cause, I will try to find it by myself, though I failed after a brief look). A direct conversion, f_cda = CUDA.CuDeviceArray(f_gpu.dims, CUDA.DevicePtr(pointer(f_gpu))), does not work — I obtain the ReadOnlyMemoryError() when I try to access f_cda.

Also I have a questions about the choice of the number of cuda threads for KernelAbstractions (equal to 32 in the above code). For example, in DiffEqGPU.jl the amount of cuda threads is hard coded and equal to 256 (https://github.com/SciML/DiffEqGPU.jl/blob/7c4398df1d069dea17c8292ae067329670a6e4fc/src/DiffEqGPU.jl#L40). Can I choose it more wisely, e.g. like in CUDA.jl where I can use get_config function (see The most general way to estimate the optimal arguments for @cuda macro)?

All good questions for @vchuravy but he’s on vacation for a bit.

1 Like

Thank you. Let’s hope he will find time to answer after his return.

You’ll need to set a reminder.

Sorry, but what is it? Is it some forum functionality? :flushed:

Oh yeah, you can bookmark a reminder on the website, but I just mean you’ll want to send a message in like a week or so.

1 Like

Ah, ok. Of cause. Thank you.

It seems I found the solution.
In my original problem, I have an iterative process where on each step I update a function f(x,t) defined on a fixed grid (x,t). Also, on each step I have a differential equation du/dt=f(x,t) where u=u(t), that is, x plays a role of the parameter. In terms of DifferentialEquations this equation can be considered as an ensemble problem where I solve multiple differential equations independently for each point x. In this case for each equation i in the ensemble I have to pass a slice of the array f, that is f[i,:], as a parameter p. The solution is performed on GPU which imposes a limitation on the way how I pass f[i,:] into the kernel function. After reading the sources of DiffEqGPU.jl and manual of CUDA.jl I came to the solution which can be illustrated by the following code:

import CUDA


struct Integrator{P}
    p :: P
end


function func(u::CUDA.CuArray, integs)
    N = length(u)

    function get_config(kernel)
        fun = kernel.fun
        config = CUDA.launch_configuration(fun)
        blocks = cld(N, config.threads)
        return (threads=config.threads, blocks=blocks)
    end

    CUDA.@cuda config=get_config kernel(u, integs)
    return nothing
end


function kernel(u, integs)
    N1, N2 = size(u)
    id = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for i=id:stride:N1
        p = integs[i].p
        f, = p
        for j=1:N2
            u[i, j] = f[j]
        end
    end
    return nothing
end

println()

N1 = 100
N2 = 200
f = rand(Float32, (N1, N2))
f_gpu = CUDA.CuArray(f)

u_gpu = CUDA.zeros((N1, N2))
integs = Array{Integrator}(undef, N1)
for i=1:N1
    fi = CUDA.cudaconvert(@view f_gpu[i, :])   # !!! Converting the array view
    pi = (fi, )
    integs[i] = Integrator(pi)
end
integs = CUDA.CuArray(hcat([integs[i] for i in 1:N1]))   # !!! Preparing the array of integrators.

func(u_gpu, integs)
@show isequal(CUDA.collect(u_gpu), CUDA.collect(f_gpu))   # -> true

@. f_gpu = f_gpu * 2   # change f outside of the kernel
func(u_gpu, integs)
@show isequal(CUDA.collect(u_gpu), CUDA.collect(f_gpu))   # -> true

The solution allows to modify the driving function f without remaking of the problem. This is realized through the array view which should be converted by the CUDA.cudaconvert function (otherwise, after passing through the parameter tuple, it can not be accessed inside the GPU kernel since it is not isbits). Another tricky point is how to pass the resulting array of integrators (structures which contain the parameter tuples for each of the equation) into the GPU kernel. The solution was found in DiffEqGPU.jl (https://github.com/SciML/DiffEqGPU.jl/blob/67f4878f4c64e5df75106f382feb3caf4d7d96d2/src/DiffEqGPU.jl#L169). For some reason, you can not simply convert the array of structs into CuArray. First you need to stack the structs together using hcat function, and only after the resulting 2D array (with the length along the second dimension equal to 1) can be used in GPU kernel.