Why does my devectorized code keep allocating memory?

WouterJRV · June 11, 2021, 10:19am

I am trying to solve a complex Ginzburg-Landau equation in Julia, defined by

function CGLfun!(du,u,p,t)
NBmat,gridsize,η,g,V,α,σ=p    
du.=1im*(1 .-1im*η).*(NBmat*u.-4*u)/gridsize^2 +(.-(1im).*V+α.-(σ.+1im*g).*abs2.(u)).*u; 
end

this was the first vectorized implementation, strongly similar to how I initially programmed it in MATLAB. Evolving it with Vern7() is maybe 20% faster than MATLAB’s ode113 result and takes about 600 sec on my desktop to evolve t from 0 until 100.
In this PDE, du is actually defined on a grid (of N=Nx^2=101^2 datapoints) as a complex variable, but I’ve rewritten it all in a column matrix for convenience. NBmat is an NxN sparse matrix that couples neighbouring sites, so the first term corresponds to the discretized laplacian. \eta,V,\alpha,\sigma are all spatially dependent (N entries) and real-valued (incidentally they are proportional to one another so perhaps there’s a better way to import them, but for the current sake I don’t think that should be the issue), while g is a constant.

The official benchmarks of JulliaDiffEq. mention an about 100x speedup compared to MATLAB, and in this sense I was a bit disappointed by the only very modest gain, even when already using the in-place version.

From @ChrisRackauckas recent FAQ, I found this interesting reference for optimization Optimizing DiffEq Code . I saw from the example of the Gierer-Meinhardt Reaction-Diffusion PDE that most of the improvement can be obtained by devectorizing. So I checked the same example on my machine.

This is my output when evolving from 0 to 0.1 with save_every_timestep=false

BenchmarkTools.Trial: 
  memory estimate:  2.90 MiB
  allocs estimate:  62
  --------------
  minimum time:     4.835 ms (0.00% GC)
  median time:      4.917 ms (0.00% GC)
  mean time:        4.987 ms (1.13% GC)
  maximum time:     7.015 ms (26.40% GC)
  --------------
  samples:          1002
  evals/sample:     1

and from 0 to 1.0

BenchmarkTools.Trial: 
  memory estimate:  2.90 MiB
  allocs estimate:  62
  --------------
  minimum time:     44.611 ms (0.00% GC)
  median time:      45.434 ms (0.00% GC)
  mean time:        45.610 ms (0.11% GC)
  maximum time:     47.961 ms (3.81% GC)
  --------------
  samples:          110
  evals/sample:     1

so we see that the amount of memory assigned is constant.

In a very analogous way to this example, I also tried to rewrite my PDE in a devectorized way. Note that I’ve modified the original complex vector U of length N into a real Nx times Nx times 2 array, where the real and imaginary parts are separated. This gives in full

function CGLfunrealdevect!(du,u,p3,t)
Nx,D,η,g,V,α,σ=p3 
 
    X= @view u[:,:,1]
    P= @view u[:,:,2]
    dX= @view du[:,:,1]
    dP= @view du[:,:,2]
    
@inbounds for kk in 2:Nx-1, jj in 2:Nx-1  
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for kk in 2:Nx-1,jj in 2:Nx-1  
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
    kk=1
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for jj in 2:Nx-1
    kk=1
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
    kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for jj in 2:Nx-1
    kk=Nx
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for kk in 2:Nx-1 
    jj=1
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=Nx
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds begin
    jj=1; kk=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

    jj=1; kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

    jj=Nx; kk=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
    
    jj=Nx; kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
        
end

Gives for (0,0.1) evolution

BenchmarkTools.Trial: 
  memory estimate:  46.61 GiB
  allocs estimate:  2857792769
  --------------
  minimum time:     41.576 s (7.85% GC)
  median time:      41.576 s (7.85% GC)
  mean time:        41.576 s (7.85% GC)
  maximum time:     41.576 s (7.85% GC)
  --------------
  samples:          1
  evals/sample:     1

And for (0,1) evolution

BenchmarkTools.Trial: 
  memory estimate:  450.45 GiB
  allocs estimate:  27623039639
  --------------
  minimum time:     387.538 s (8.10% GC)
  median time:      387.538 s (8.10% GC)
  mean time:        387.538 s (8.10% GC)
  maximum time:     387.538 s (8.10% GC)
  --------------
  samples:          1
  evals/sample:     1

So in this case, there is clearly a huge amount of allocations, and in contrast to the example PDE, the memory is extensive with evolution length.

Any idea what I missed and how I can achieve a similar speedup as the example?

What I already tried in attempt to reproduce the ‘slow’ behavior with the Gierer-Meinhardt example:
*change the solver algorithm from Tsit5() to Vern7() as well
*make \alpha as there as well a spatially dependent grid
*put in some @view to rename (:,:,1) and (:,:,2)
*Putting a ; at the end of expressions
*Implicitly writing the multiplication signs between numbers and constants
*putting random pieces of the expression in parentheses
inserting a perturbation in the form of 1e-10(u[i,j,1]^2+u[i,j,2]^2)u[i,j,1]
*changing intendation
but the memory-efficiency of that example has seemed robust against all these things…

Sukera · June 11, 2021, 10:27am

Your code is very long and verbose, so it’s kind of hard to give a concrete suggestion. As far as I can tell, you only ever write to dP, dX and \alpha, and all other variables are only read from, correct? As a first step to make this more debuggable, I’d suggest eliminating common subexpressions that you recalculate over and over again and put them into their own variables. If the results are single integers or floats, those don’t allocate if the code is type stable! Only arrays and other mutable objects cause allocations in type stable code (there are some exceptions, but that’s much too detailed and doesn’t apply here).

Is your function type stable? What are the types of e.g. η and D, what about p3? If they’re 3D arrays, accessing only two indices may allocate (since you’d try to retrieve an array slice and slicing allocates).

; is only used in the REPL to suppress the output from being printed - it has no effect in code that’s not run at the top level.

Indentation is not relevant when discussing performance, it has no effect here. In general, whitespace is almost never significant in julia (and if it is, it changes behavior, not purely performance. This is the case when writing e.g. array literals).

ChrisRackauckas · June 11, 2021, 10:31am

Why not just @. du = ...? You’re not fusing the whole operation: you forgot a few dots so it’s allocating a few steps. That RHS function is not written in a very fast way, so that will impact the speed you see.

Edit: I see, matmul. Everything else should be dotted those, and it’s not.

Is that the right solver? Sounds like more of a problem for VCABM, QNDF, CVODE_BDF, LSODA, ROCK2, or ROCK4. Did you do a comparative study of methods?

Not sharing code makes it impossible to really debug, but from experience, I would guess that your function is not type-stable because you made p3 a Vector when it should be a Tuple since it’s filled with different typed objects. Did you look at @code_warntype to see?

WouterJRV · June 11, 2021, 10:58am

thank you for your answer. I’ve been checking some of these things kind of from desperation, I saw for example people reporting a problem where Juno skips ill-intended lines (although i use VSC now). And I remember a case where it didn’t work because I called the array of variables \psi instead of u.

Being able to add intermediate values would indeed be something useful, if that works with (pointwise) matrix operations as well. I thought I couldn’t, so good to know

WouterJRV · June 11, 2021, 11:09am

p3 being a vector was indeed the big problem. New output from simulation until 0.1

BenchmarkTools.Trial: 
  memory estimate:  12.95 MiB
  allocs estimate:  71
  --------------
  minimum time:     265.388 ms (0.00% GC)
  median time:      269.040 ms (0.00% GC)
  mean time:        270.049 ms (0.07% GC)
  maximum time:     283.038 ms (0.00% GC)
  --------------
  samples:          19
  evals/sample:     1

And evolution until 100 takes 160sec, a few times better than before

My not .@ was partiallly because NBmat needs a normal multiplication. But I wrote that before I properly saw the optimization tutorial.

Likely, there are better solvers. It’s pretty much the first one I tried. I saw that Vern7() was the goto solver from matlab’s ode113, and for now I just wanted to check the advantage of Julia itself for a similar solver (the 100x improvement for the Lotke-Volterra benchmarks intrigued me)

I have no objections with sharing code if you’d like to play with it, though? (I was assuming most people here to be too busy to do that)

ChrisRackauckas · June 11, 2021, 11:17am

If you share it some other people might take a look. I would bookmark it to look like a week from now and see if no one else has. High order RK methods are just not stable and not a great idea for PDEs, so Vern7 would be one of the last methods I would check here. CVODE_BDF(linear_solver = :GMRES) might be a very easy way to take a chunk off, or ROCK4().

WouterJRV · June 11, 2021, 2:07pm

For those interested in the full code:

using LinearAlgebra, DifferentialEquations, SparseArrays, Statistics,Plots

tstart =0.
tstep=1.
tfinal=100.
tarray=tstart:tstep:tfinal;

xvec=-50.:0.5:50.;yvec=-50.:0.5:50.; 
Nx=length(xvec);Ny=length(yvec);N=Nx*Ny;
X=xvec'.*ones(Ny); Y=ones(Nx)'.*yvec;


ħ=0.6591; 
c = 300.;  
m0 = 511. * 1e6/c^2; 

m=1e-5*m0;
g=1e-4;
t0=7.;
l0=sqrt(ħ*t0/(2*m));

xvectilde=xvec/l0; yvectilde=yvec/l0;
gridsize=(xvectilde[2]-xvectilde[1]); 
@assert ((yvectilde[2]-yvectilde[1])==gridsize);#Have to use a uniform grid for the derivative in the kinetic E
Xtilde=xvectilde'.*ones(Ny); Ytilde=ones(Nx)'.*yvectilde;

V0orig=2.; 
Vtilde0=V0orig*t0/ħ;
gtilde=g*2*m/ħ^2;

dist=10.;
Nspots=4;
L=zeros(Nspots,2)
for ss=1:Nspots
    L[ss,:]=dist*[cos(2*pi/Nspots*ss+pi/2),sin(2*pi/Nspots*ss+pi/2)];
end

Ltilde=L/l0;
pumpwidth=4;pumpwidthtilde=pumpwidth/l0; 
Pfun=zeros(Nx,Nx); 
pumpidx=zeros(Int64,Nspots,1);
pumpstrength=1;
for pp=1:size(L,1)
global Pfun
thispump=pumpstrength*exp.(-((Xtilde .- Ltilde[pp,1]).^2 .+ (Ytilde.- Ltilde[pp,2]).^2)./(2*pumpwidthtilde.^2)); #The exponent can remain dimensionful (micron) without problems (I think)
Pfun = Pfun +thispump;
~,thispumpidx=findmax(thispump[:]);
pumpidx[pp]=thispumpidx
end

middle=mean(L,dims=1);
~,centreidx=findmin((middle[1].-X[:]).^2+(middle[2].-Y[:]).^2);


V=Vtilde0*Pfun
η=0.1*Pfun
σ=0.3*Pfun
α=Pfun

function CGLfunrealdevect!(du,u,p3,t)
Nx,D,η,g,V,α,σ=p3 
 
    X= @view u[:,:,1]
    P= @view u[:,:,2]
    dX= @view du[:,:,1]
    dP= @view du[:,:,2]
    
@inbounds for kk in 2:Nx-1, jj in 2:Nx-1  
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for kk in 2:Nx-1,jj in 2:Nx-1  
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
    kk=1
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for jj in 2:Nx-1
    kk=1
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
    kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for jj in 2:Nx-1
    kk=Nx
    dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end


@inbounds for kk in 2:Nx-1 
    jj=1
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1 
    jj=Nx
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds begin
    jj=1; kk=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

    jj=1; kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

    jj=Nx; kk=1
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
    
    jj=Nx; kk=Nx
    dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
    α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
    dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk]) +
    α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
        
end

u0=1e-3*randn(Nx,Nx,2);

D=1/gridsize^2;
p3=(Nx,D,η,g,V,α,σ)

prob=ODEProblem(CGLfunrealdevect!,u0,(0,100.),p3)
@benchmark sol=solve(prob,Vern7(),saveat=tarray);

I did a quick check with ROCK4() as well and CVODE_BDF(linear_solver = :GMRES) as suggested by, but they were not super good, memory does increase with time and they are slower especially for short evolution times. For a time evolution of 100, ROCK4 seems to have catched up with Vern7 in computation time (but at much larger memory)

ROCK4, (0,100.)

BenchmarkTools.Trial: 
  memory estimate:  101.78 GiB
  allocs estimate:  339782
  --------------
  minimum time:     157.428 s (1.22% GC)
  median time:      157.428 s (1.22% GC)
  mean time:        157.428 s (1.22% GC)
  maximum time:     157.428 s (1.22% GC)
  --------------
  samples:          1
  evals/sample:     1

CVODE_BDF(linear_solver = :GMRES) ,(0,10.)

BenchmarkTools.Trial: 
  memory estimate:  42.00 MiB
  allocs estimate:  729536
  --------------
  minimum time:     127.524 s (0.00% GC)
  median time:      127.524 s (0.00% GC)
  mean time:        127.524 s (0.00% GC)
  maximum time:     127.524 s (0.00% GC)
  --------------
  samples:          1
  evals/sample:     1

WouterJRV · July 19, 2021, 7:58am

I have refined the code (parameters have changed after the previous version) and the determisctic case works fine. I’m now also studyng a stochastic version by including noise, that is much slower. I’m a bit surprised that the adaptive weak-solvers like DRI give out-of-memory errors, especially since system monitor shows that plenty of RAM is still free (see screenshot).
SOSRA works, but is a bit slower then I’d like. Apart from the lower order of convergence, I think that demanding that the tolerances are met by strong rather than weak deviations is also unnecessarily strict

The code:

using LinearAlgebra, DifferentialEquations, SparseArrays, Statistics,Plots, BenchmarkTools, Sundials

tstart =0.
tstep=1.;
tfinal=10;
tarray=tstart:tstep:tfinal;
T_size=length(tarray);

xvec=-20.:0.1:20.;yvec=-20.:0.1:20.; 
Nx=length(xvec);Ny=length(yvec);N=Nx*Ny;
X=xvec'.*ones(Ny); Y=ones(Nx)'.*yvec;


ħ=0.6591; 
c = 300.;  
m0 = 511. * 1e6/c^2; 

m=3.7*1e-5*m0;
t0=0.069; 
l0=sqrt(ħ*t0/(2*m));

cavityfreq=2.84*1e3;

xvectilde=xvec/l0; yvectilde=yvec/l0;
gridsize=(xvectilde[2]-xvectilde[1]); 
@assert ((yvectilde[2]-yvectilde[1])==gridsize);
Xtilde=xvectilde'.*ones(Ny); Ytilde=ones(Nx)'.*yvectilde;


V0orig=63; 
Vtilde0=V0orig*t0/ħ;

dist=4.1;

L=[[0 -dist/2];[0 dist/2]];

Ltilde=L/l0;
pumpwidth=1;
pumpwidthtilde=pumpwidth/l0; 
Pfun=zeros(Ny,Nx);

pumpidx=Vector{Float64}(undef,size(L,1));
allpumpstrength=1;
for pp=1:size(L,1)
global Pfun
thispump=allpumpstrength*exp.(-((Xtilde .- Ltilde[pp,1]).^2 .+ (Ytilde.- Ltilde[pp,2]).^2)./(2*pumpwidthtilde.^2)); 
Pfun = Pfun +thispump;
~,thispumpidx=findmax(thispump[:]);
pumpidx[pp]=thispumpidx
end

middle=mean(L,dims=1);
~,centreidx=findmin((middle[1].-X[:]).^2+(middle[2].-Y[:]).^2);


Ω=3.0;
V=Vtilde0*Pfun; 
η=10/Ω*Pfun;
P=10*2*Pfun;
g=1e-3; 
γ=1;
σ=0.3*Pfun;

function f!(du,u,p,t)
    Nx,Ny,D,η,g,V,P,γ,σ=p 
     
        X= @view u[:,:,1]
        Y= @view u[:,:,2]
        dX= @view du[:,:,1]
        dY= @view du[:,:,2]

        @inbounds for xx in 1:Nx, yy in 1:Ny  
        xxprev=mod(xx-2,Nx)+1
        xxnext=mod(xx,Nx)+1
        yyprev=mod(yy-2,Ny)+1
        yynext=mod(yy,Ny)+1
        dens=X[yy,xx]^2+Y[yy,xx]^2
        localcoh=V[yy,xx]+g*dens+4*D
        localincoh=-η[yy,xx]*(V[yy,xx]+g*dens+4*D)+0.5*(P[yy,xx]-γ[yy,xx])-σ[yy,xx]*dens

        dX[yy,xx]=localincoh*X[yy,xx]+localcoh*Y[yy,xx]+η[yy,xx]*D*(X[yyprev,xx]+X[yynext,xx]+X[yy,xxprev]+X[yy,xxnext])-D*(Y[yyprev,xx]+Y[yynext,xx]+Y[yy,xxprev]+Y[yy,xxnext])
        dY[yy,xx]=localincoh*Y[yy,xx]-localcoh*X[yy,xx]+η[yy,xx]*D*(Y[yyprev,xx]+Y[yynext,xx]+Y[yy,xxprev]+Y[yy,xxnext])+D*(X[yyprev,xx]+X[yynext,xx]+X[yy,xxprev]+X[yy,xxnext])   
        end
        
end

function g!(du,u,p,t)
    Nx,Ny,D,η,g,V,P,γ,σ=p
 @inbounds for dd in 1:2, xx in 1:Nx ,yy in 1:Ny
 du[yy,xx,dd]=√(0.5*P[yy,xx]*D)
 end
end

u0=1e-3*randn(Ny,Nx,2);

D=1/gridsize^2;
p=(Nx,Ny,D,η,g,V,P,γ,σ)

probstoch=SDEProblem(f!,g!,u0,(tstart,tfinal),p)
@time stochsol=solve(probstoch,DRI1NM(),saveat=tarray)

the error:

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
 [1] Array at ./boot.jl:408 [inlined]
 [2] Array at ./boot.jl:416 [inlined]
 [3] zeros at ./array.jl:525 [inlined]
 [4] zeros at ./array.jl:521 [inlined]
 [5] alg_cache(::DRI1NM, ::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::Array{Float64,3}, ::Array{Float64,3}, ::Nothing, ::Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}}, ::Array{Float64,3}, ::Array{Float64,3}, ::Nothing, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Array{Float64,3}, ::SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}, ::Float64, ::Float64, ::Type{Val{true}}) at /home/wouter/.julia/packages/StochasticDiffEq/MzqBB/src/caches/srk_weak_caches.jl:642
 [6] __init(::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::DRI1NM, ::Array{Any,1}, ::Array{Any,1}, ::Type{T} where T, ::Type{Val{true}}; saveat::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_noise::Bool, save_on::Bool, save_start::Bool, save_end::Bool, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmax::Rational{Int64}, qmin::Rational{Int64}, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta2::Rational{Int64}, beta1::Rational{Int64}, delta::Rational{Int64}, maxiters::Int64, dtmax::Float64, dtmin::Float64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, force_dtmin::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, initialize_integrator::Bool, seed::UInt64, alias_u0::Bool, alias_jumps::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .../.julia/packages/StochasticDiffEq/MzqBB/src/solve.jl:453
 [7] #__solve#99 at .../.julia/packages/StochasticDiffEq/MzqBB/src/solve.jl:6 [inlined]
 [8] #solve_call#56 at .../.julia/packages/DiffEqBase/FMY9y/src/solve.jl:61 [inlined]
 [9] solve_up(::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::Nothing, ::Array{Float64,3}, ::Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}}, ::DRI1NM; kwargs::Base.Iterators.Pairs{Symbol,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},Tuple{Symbol},NamedTuple{(:saveat,),Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}) at /home/wouter/.julia/packages/DiffEqBase/FMY9y/src/solve.jl:82
 [10] #solve#57 at .../.julia/packages/DiffEqBase/FMY9y/src/solve.jl:70 [inlined]
 [11] top-level scope at ./timing.jl:174
in expression starting at .../...:113

ChrisRackauckas · July 19, 2021, 10:33am

You should use the adaptive weak integrators with EnsembleCPUArray (or EnsembleGPUArray) then.

WouterJRV · July 19, 2021, 12:51pm

Is it expected that such a weak Silver crashes on the single shot (outside of an ensembleproblem) with out-of-memory error? I was planning to replace the ensemble average by a time average.

ChrisRackauckas · July 19, 2021, 1:26pm

Hmm, it’s because Ihat2 uses length(u)^2 memory for the pairwise 3-point random numbers. Can you open an issue on this? We can fix it.

WouterJRV · July 27, 2021, 2:11am

Just a question, but I just found out about NeuralPDE.jl . I’m wondering if there is something like a rule of thumb on when these methods are expected to outperform the standardi ones of working on a fixed discrete grid like I was doing here?

ChrisRackauckas · July 27, 2021, 10:50am

Essentially never. I am writing a blog post on that soon. Essentially, every result where people have said they have done better has been compromised by using DiffEq out of the box and performing about 10,000x faster than their classical method implementation, so it always wins out of the box by more than 100x. The advantage of NeuralPDE.jl is that it’s easy to apply it to any possible problem. So if you have GPUs and just want the answer, it’s a good strategy for any possible PDE, including integro-differential equations, fractional derivatives, etc. I’m not sure of another method that has that kind of ubiquitousness.

stevengj · July 27, 2021, 3:09pm

If almost every operation needs to be dotted, with only a few exceptions, you can opt out of @. for an individual function call by escaping it with $. Unfortunately, Julia can’t parse x $* y as a binary operation, so you have to write it as an ordinary function call ($*)(x, y) in order to escape the $.

That is, you could write:

@. du.= 1im * (1 - 1im*η) * (($*)(NBmat,u) - 4*u)/gridsize^2 +((-1im)*V + α - (σ+1im*g) * abs2(u)) * u

to “dot” everything but the matrix-vector multiplication.

(There are probably other ways to speed this up. e.g. if η is real, you could write complex(η, 1) instead of 1im * (1 - 1im*η). You could also pull the division /gridsize^2 out of the loop by writing * $(1/gridsize^2) instead. But maybe the dominant cost is the matrix–vector multiplication?)

ChrisRackauckas · July 27, 2021, 6:50pm

BTW, this fix was merged and tagged:

https://github.com/SciML/StochasticDiffEq.jl/pull/428

Topic		Replies	Views
Confusing memory allocations when using the integrator of DifferentialEquations.jl Performance question , diffeq , memory-allocation	10	445	January 12, 2022
Using views causes significant slowdown in simple PDE Performance diffeq , performance , memory-allocation , pde	2	510	February 7, 2021
Autodifferencing and type stability in DifferentialEquations.jl New to Julia diffeq	5	566	January 18, 2022
Memory allocation with Julia 1.8.5 release Modelling & Simulations	0	181	April 11, 2023
Advanced tricks to reduce memory allocations of ODE Performance diffeq	7	510	June 27, 2023

Why does my devectorized code keep allocating memory?

Related topics