Differential Equations running out of GPU memory

Hi all, I posted earlier about minimising memory usage on gpu modelling a system here. Since then I’ve tidied up my code a bit and have managed to reduce memory usage, but it still runs out of memory when I try to scan over a range of parameters.

I’m setting ‘save_on = false’, which according to the documentation should stop it from saving intermediate states, and using a callback to save the solution to cpu at certain points in time. My understanding is that if it has enough memory to get through a few timesteps, it shouldn’t use any extra memory to go for an arbitrarily long time since it shouldn’t be saving intermediate timesteps.

What I find actually happens is for large grid sizes, it will run fine for a long time (100-200 seconds of simulated time), and then run out of memory and throw an error.

I’m basically wondering if there is another way to ensure that the solver isn’t saving any intermediate information, or if there’s a way to see what is taking up all of this memory so I can potentially use a callback to manually clear the gpu memory during solving?

MWE:

using FFTW, CUDA, DifferentialEquations, LinearAlgebra, Plots

function kfunc_opt!(dψ,ψ)
    mul!(dψ,Pf,ψ)
    dψ .*= k2
    Pi!*dψ
    return nothing
end

function GPE!(dψ,ψ,var,t) # GPE Equation 
    kfunc_opt!(dψ,ψ)
    @. dψ = -(im + γ)*(0.5*dψ + (V_0 + abs2(ψ) - 1)*ψ)
end

function GPU_Solve(save_array,EQ!, ψ, tspan) 
    
    savepoints = tspan[2:end]  
    condition(u, t, integrator) = t ∈ savepoints

    function affect!(integrator)    
         push!(save_array, Array(integrator.u))
    end

    push!(save_array, Array(ψ))
    cb = DiscreteCallback(condition, affect!)   
    i = 1                                           
    
    prob = ODEProblem(EQ!,ψ,(tspan[1],tspan[end]))   
    solve(prob, callback=cb, tstops = savepoints, save_on=false)

end

L = 8
M = 60

x = LinRange(-L,L,M) |> cu;
dx = x[2] - x[1]
kx = fftfreq(M,2π/dx) |> collect |> cu;
dkx = kx[2] - kx[1]

k2 =  kx.^2 .+ kx'.^2 .+ reshape(kx,(1,1,M)).^2;
V_0 = 0.3*[i^2 + j^2 + k^2 for i in x, j in x, k in x] |> cu;

const Pf = Float32(dx^3/(2π)^1.5)*plan_fft((cu(rand(M,M,M) + im*rand(M,M,M))));
const Pi! = Float32(M^3*dkx^3/(2π)^1.5)*plan_ifft!((cu(rand(M,M,M) + im*rand(M,M,M))));

γ = 0.05
tspan = LinRange(0.0,10,50); 

CUDA.memory_status()
res_GS = []
GPU_Solve(res_GS,GPE!,(cu(randn(M,M,M) + im*randn(M,M,M))),tspan);

 begin
     t = 3 # Change this to look at different times
     heatmap(abs2.(res_GS[t][:,:,30])) |> display
 end
1 Like

What solver do you need here? If you need something with stiff ODEs, do you have a sparse Jacobian? This is the same exact conclusion of Minimising DifferentialEquations GPU memory allocation

1 Like

I typically use Tsit5() or Vern6(), but have also looked into using some of the low-memory, fixed-timestep algorithms. I believe Vern6 doesn’t have a jacobian, and I need a dense jacobian anyway. I’m more asking here how to stop it from accumulating gpu memory while solving, rather than reducing the memory usage with each timestep.

If save_on = false, then it won’t save anything, so the only memory is from the caches for time stepping.

Should that be allocating much memory? I was running it on a machine with 40G gpu memory and it was eventually running out. And if so is there any way to reduce memory in caches?

How big is the state vector? Tsit5 for example needs 12 concurrent copies of it: OrdinaryDiffEq.jl/src/caches/low_order_rk_caches.jl at master · SciML/OrdinaryDiffEq.jl · GitHub . Pen and paper math, what does that come out in to in memory?

1 Like

The vector is 256^3, using ComplexF32 so 8 bytes each. So it would just be: 8 * 256^3 * 12 = 1.61 GB per timestep? I’ve got some states saved and this lines up how much space they take up

Have you checked that your f calls are fully non-allocating? Try adding a CUDA.gc call inside of the f.

I’m pretty sure my f call is non-allocating, using benchmarktools and @btime it was something like 14kb of allocations. Just to check did you mean GC.gc()? I can’t find any documentation for CUDA.gc(). So just putting a GC.gc(true) in the function to make sure the garbage collector is clearing memory?

@maleadt, is GC.gc(true) calls thrown around the best thing here?

That shouldn’t be required, as we try to collect memory before going OOM. But it may help, as would calling CUDA.reclaim(). But again, both shouldn’t be required, so an MWE that goes OOM without these calls would make a good issue on the CUDA.jl repository.

How about RK4? Will it use less copies than Tsit5? I mean for the adaptive time step.

If what you need is to match low RAM requirements, don’t use RK4. Use one of the optimized low storage methods documented here: ODE Solvers · DifferentialEquations.jl

2 Likes

But it seems none of the low storage methods can use adaptive time step?

Yeah. If you need that, then BS3 might be a better option

1 Like

BS3 may be of less consumption. I am also playing with GPE with state vector 256^3. Now on my machine with 6GB GPU memory, RK4 can work normally, but Tsit5 cannot, unfortunately.

It seems that Tsit5 can also work if using save_start=false in addition to save_on=false, weird @ChrisRackauckas