Why does my devectorized code keep allocating memory?

I am trying to solve a complex Ginzburg-Landau equation in Julia, defined by

function CGLfun!(du,u,p,t)
NBmat,gridsize,η,g,V,α,σ=p
du.=1im*(1 .-1im*η).*(NBmat*u.-4*u)/gridsize^2 +(.-(1im).*V+α.-(σ.+1im*g).*abs2.(u)).*u;
end


this was the first vectorized implementation, strongly similar to how I initially programmed it in MATLAB. Evolving it with Vern7() is maybe 20% faster than MATLAB’s ode113 result and takes about 600 sec on my desktop to evolve t from 0 until 100.
In this PDE, du is actually defined on a grid (of N=Nx^2=101^2 datapoints) as a complex variable, but I’ve rewritten it all in a column matrix for convenience. NBmat is an NxN sparse matrix that couples neighbouring sites, so the first term corresponds to the discretized laplacian. \eta,V,\alpha,\sigma are all spatially dependent (N entries) and real-valued (incidentally they are proportional to one another so perhaps there’s a better way to import them, but for the current sake I don’t think that should be the issue), while g is a constant.

The official benchmarks of JulliaDiffEq. mention an about 100x speedup compared to MATLAB, and in this sense I was a bit disappointed by the only very modest gain, even when already using the in-place version.

From @ChrisRackauckas recent FAQ, I found this interesting reference for optimization Optimizing DiffEq Code . I saw from the example of the Gierer-Meinhardt Reaction-Diffusion PDE that most of the improvement can be obtained by devectorizing. So I checked the same example on my machine.

This is my output when evolving from 0 to 0.1 with save_every_timestep=false

BenchmarkTools.Trial:
memory estimate:  2.90 MiB
allocs estimate:  62
--------------
minimum time:     4.835 ms (0.00% GC)
median time:      4.917 ms (0.00% GC)
mean time:        4.987 ms (1.13% GC)
maximum time:     7.015 ms (26.40% GC)
--------------
samples:          1002
evals/sample:     1



and from 0 to 1.0

BenchmarkTools.Trial:
memory estimate:  2.90 MiB
allocs estimate:  62
--------------
minimum time:     44.611 ms (0.00% GC)
median time:      45.434 ms (0.00% GC)
mean time:        45.610 ms (0.11% GC)
maximum time:     47.961 ms (3.81% GC)
--------------
samples:          110
evals/sample:     1


so we see that the amount of memory assigned is constant.

In a very analogous way to this example, I also tried to rewrite my PDE in a devectorized way. Note that I’ve modified the original complex vector U of length N into a real Nx times Nx times 2 array, where the real and imaginary parts are separated. This gives in full

function CGLfunrealdevect!(du,u,p3,t)
Nx,D,η,g,V,α,σ=p3

X= @view u[:,:,1]
P= @view u[:,:,2]
dX= @view du[:,:,1]
dP= @view du[:,:,2]

@inbounds for kk in 2:Nx-1, jj in 2:Nx-1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1,jj in 2:Nx-1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=Nx
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=Nx
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=1
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=Nx
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds begin
jj=1; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=1; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=Nx; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=Nx; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

end


Gives for (0,0.1) evolution

BenchmarkTools.Trial:
memory estimate:  46.61 GiB
allocs estimate:  2857792769
--------------
minimum time:     41.576 s (7.85% GC)
median time:      41.576 s (7.85% GC)
mean time:        41.576 s (7.85% GC)
maximum time:     41.576 s (7.85% GC)
--------------
samples:          1
evals/sample:     1


And for (0,1) evolution

BenchmarkTools.Trial:
memory estimate:  450.45 GiB
allocs estimate:  27623039639
--------------
minimum time:     387.538 s (8.10% GC)
median time:      387.538 s (8.10% GC)
mean time:        387.538 s (8.10% GC)
maximum time:     387.538 s (8.10% GC)
--------------
samples:          1
evals/sample:     1


So in this case, there is clearly a huge amount of allocations, and in contrast to the example PDE, the memory is extensive with evolution length.

Any idea what I missed and how I can achieve a similar speedup as the example?

What I already tried in attempt to reproduce the ‘slow’ behavior with the Gierer-Meinhardt example:
*change the solver algorithm from Tsit5() to Vern7() as well
*make \alpha as there as well a spatially dependent grid
*put in some @view to rename (:,:,1) and (:,:,2)
*Putting a ; at the end of expressions
*Implicitly writing the multiplication signs between numbers and constants
*putting random pieces of the expression in parentheses
inserting a perturbation in the form of 1e-10(u[i,j,1]^2+u[i,j,2]^2)u[i,j,1]
*changing intendation
but the memory-efficiency of that example has seemed robust against all these things…

Your code is very long and verbose, so it’s kind of hard to give a concrete suggestion. As far as I can tell, you only ever write to dP, dX and \alpha, and all other variables are only read from, correct? As a first step to make this more debuggable, I’d suggest eliminating common subexpressions that you recalculate over and over again and put them into their own variables. If the results are single integers or floats, those don’t allocate if the code is type stable! Only arrays and other mutable objects cause allocations in type stable code (there are some exceptions, but that’s much too detailed and doesn’t apply here).

Is your function type stable? What are the types of e.g. η and D, what about p3? If they’re 3D arrays, accessing only two indices may allocate (since you’d try to retrieve an array slice and slicing allocates).

; is only used in the REPL to suppress the output from being printed - it has no effect in code that’s not run at the top level.

Indentation is not relevant when discussing performance, it has no effect here. In general, whitespace is almost never significant in julia (and if it is, it changes behavior, not purely performance. This is the case when writing e.g. array literals).

3 Likes

Why not just @. du = ...? You’re not fusing the whole operation: you forgot a few dots so it’s allocating a few steps. That RHS function is not written in a very fast way, so that will impact the speed you see.

Edit: I see, matmul. Everything else should be dotted those, and it’s not.

Is that the right solver? Sounds like more of a problem for VCABM, QNDF, CVODE_BDF, LSODA, ROCK2, or ROCK4. Did you do a comparative study of methods?

Not sharing code makes it impossible to really debug, but from experience, I would guess that your function is not type-stable because you made p3 a Vector when it should be a Tuple since it’s filled with different typed objects. Did you look at @code_warntype to see?

3 Likes

thank you for your answer. I’ve been checking some of these things kind of from desperation, I saw for example people reporting a problem where Juno skips ill-intended lines (although i use VSC now). And I remember a case where it didn’t work because I called the array of variables \psi instead of u.

Being able to add intermediate values would indeed be something useful, if that works with (pointwise) matrix operations as well. I thought I couldn’t, so good to know

p3 being a vector was indeed the big problem. New output from simulation until 0.1

BenchmarkTools.Trial:
memory estimate:  12.95 MiB
allocs estimate:  71
--------------
minimum time:     265.388 ms (0.00% GC)
median time:      269.040 ms (0.00% GC)
mean time:        270.049 ms (0.07% GC)
maximum time:     283.038 ms (0.00% GC)
--------------
samples:          19
evals/sample:     1


And evolution until 100 takes 160sec, a few times better than before

My not .@ was partiallly because NBmat needs a normal multiplication. But I wrote that before I properly saw the optimization tutorial.

Likely, there are better solvers. It’s pretty much the first one I tried. I saw that Vern7() was the goto solver from matlab’s ode113, and for now I just wanted to check the advantage of Julia itself for a similar solver (the 100x improvement for the Lotke-Volterra benchmarks intrigued me)

I have no objections with sharing code if you’d like to play with it, though? (I was assuming most people here to be too busy to do that)

If you share it some other people might take a look. I would bookmark it to look like a week from now and see if no one else has. High order RK methods are just not stable and not a great idea for PDEs, so Vern7 would be one of the last methods I would check here. CVODE_BDF(linear_solver = :GMRES) might be a very easy way to take a chunk off, or ROCK4().

2 Likes

For those interested in the full code:

using LinearAlgebra, DifferentialEquations, SparseArrays, Statistics,Plots

tstart =0.
tstep=1.
tfinal=100.
tarray=tstart:tstep:tfinal;

xvec=-50.:0.5:50.;yvec=-50.:0.5:50.;
Nx=length(xvec);Ny=length(yvec);N=Nx*Ny;
X=xvec'.*ones(Ny); Y=ones(Nx)'.*yvec;

ħ=0.6591;
c = 300.;
m0 = 511. * 1e6/c^2;

m=1e-5*m0;
g=1e-4;
t0=7.;
l0=sqrt(ħ*t0/(2*m));

xvectilde=xvec/l0; yvectilde=yvec/l0;
gridsize=(xvectilde[2]-xvectilde[1]);
@assert ((yvectilde[2]-yvectilde[1])==gridsize);#Have to use a uniform grid for the derivative in the kinetic E
Xtilde=xvectilde'.*ones(Ny); Ytilde=ones(Nx)'.*yvectilde;

V0orig=2.;
Vtilde0=V0orig*t0/ħ;
gtilde=g*2*m/ħ^2;

dist=10.;
Nspots=4;
L=zeros(Nspots,2)
for ss=1:Nspots
L[ss,:]=dist*[cos(2*pi/Nspots*ss+pi/2),sin(2*pi/Nspots*ss+pi/2)];
end

Ltilde=L/l0;
pumpwidth=4;pumpwidthtilde=pumpwidth/l0;
Pfun=zeros(Nx,Nx);
pumpidx=zeros(Int64,Nspots,1);
pumpstrength=1;
for pp=1:size(L,1)
global Pfun
thispump=pumpstrength*exp.(-((Xtilde .- Ltilde[pp,1]).^2 .+ (Ytilde.- Ltilde[pp,2]).^2)./(2*pumpwidthtilde.^2)); #The exponent can remain dimensionful (micron) without problems (I think)
Pfun = Pfun +thispump;
~,thispumpidx=findmax(thispump[:]);
pumpidx[pp]=thispumpidx
end

middle=mean(L,dims=1);
~,centreidx=findmin((middle[1].-X[:]).^2+(middle[2].-Y[:]).^2);

V=Vtilde0*Pfun
η=0.1*Pfun
σ=0.3*Pfun
α=Pfun

function CGLfunrealdevect!(du,u,p3,t)
Nx,D,η,g,V,α,σ=p3

X= @view u[:,:,1]
P= @view u[:,:,2]
dX= @view du[:,:,1]
dP= @view du[:,:,2]

@inbounds for kk in 2:Nx-1, jj in 2:Nx-1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1,jj in 2:Nx-1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=Nx
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for jj in 2:Nx-1
kk=Nx
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=1
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end

@inbounds for kk in 2:Nx-1
jj=Nx
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

@inbounds begin
jj=1; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=1; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=Nx; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);

jj=Nx; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end

end

u0=1e-3*randn(Nx,Nx,2);

D=1/gridsize^2;
p3=(Nx,D,η,g,V,α,σ)

prob=ODEProblem(CGLfunrealdevect!,u0,(0,100.),p3)
@benchmark sol=solve(prob,Vern7(),saveat=tarray);



I did a quick check with ROCK4() as well and CVODE_BDF(linear_solver = :GMRES) as suggested by, but they were not super good, memory does increase with time and they are slower especially for short evolution times. For a time evolution of 100, ROCK4 seems to have catched up with Vern7 in computation time (but at much larger memory)

ROCK4, (0,100.)

BenchmarkTools.Trial:
memory estimate:  101.78 GiB
allocs estimate:  339782
--------------
minimum time:     157.428 s (1.22% GC)
median time:      157.428 s (1.22% GC)
mean time:        157.428 s (1.22% GC)
maximum time:     157.428 s (1.22% GC)
--------------
samples:          1
evals/sample:     1


CVODE_BDF(linear_solver = :GMRES) ,(0,10.)

BenchmarkTools.Trial:
memory estimate:  42.00 MiB
allocs estimate:  729536
--------------
minimum time:     127.524 s (0.00% GC)
median time:      127.524 s (0.00% GC)
mean time:        127.524 s (0.00% GC)
maximum time:     127.524 s (0.00% GC)
--------------
samples:          1
evals/sample:     1

1 Like

I have refined the code (parameters have changed after the previous version) and the determisctic case works fine. I’m now also studyng a stochastic version by including noise, that is much slower. I’m a bit surprised that the adaptive weak-solvers like DRI give out-of-memory errors, especially since system monitor shows that plenty of RAM is still free (see screenshot).
SOSRA works, but is a bit slower then I’d like. Apart from the lower order of convergence, I think that demanding that the tolerances are met by strong rather than weak deviations is also unnecessarily strict

The code:

using LinearAlgebra, DifferentialEquations, SparseArrays, Statistics,Plots, BenchmarkTools, Sundials

tstart =0.
tstep=1.;
tfinal=10;
tarray=tstart:tstep:tfinal;
T_size=length(tarray);

xvec=-20.:0.1:20.;yvec=-20.:0.1:20.;
Nx=length(xvec);Ny=length(yvec);N=Nx*Ny;
X=xvec'.*ones(Ny); Y=ones(Nx)'.*yvec;

ħ=0.6591;
c = 300.;
m0 = 511. * 1e6/c^2;

m=3.7*1e-5*m0;
t0=0.069;
l0=sqrt(ħ*t0/(2*m));

cavityfreq=2.84*1e3;

xvectilde=xvec/l0; yvectilde=yvec/l0;
gridsize=(xvectilde[2]-xvectilde[1]);
@assert ((yvectilde[2]-yvectilde[1])==gridsize);
Xtilde=xvectilde'.*ones(Ny); Ytilde=ones(Nx)'.*yvectilde;

V0orig=63;
Vtilde0=V0orig*t0/ħ;

dist=4.1;

L=[[0 -dist/2];[0 dist/2]];

Ltilde=L/l0;
pumpwidth=1;
pumpwidthtilde=pumpwidth/l0;
Pfun=zeros(Ny,Nx);

pumpidx=Vector{Float64}(undef,size(L,1));
allpumpstrength=1;
for pp=1:size(L,1)
global Pfun
thispump=allpumpstrength*exp.(-((Xtilde .- Ltilde[pp,1]).^2 .+ (Ytilde.- Ltilde[pp,2]).^2)./(2*pumpwidthtilde.^2));
Pfun = Pfun +thispump;
~,thispumpidx=findmax(thispump[:]);
pumpidx[pp]=thispumpidx
end

middle=mean(L,dims=1);
~,centreidx=findmin((middle[1].-X[:]).^2+(middle[2].-Y[:]).^2);

Ω=3.0;
V=Vtilde0*Pfun;
η=10/Ω*Pfun;
P=10*2*Pfun;
g=1e-3;
γ=1;
σ=0.3*Pfun;

function f!(du,u,p,t)
Nx,Ny,D,η,g,V,P,γ,σ=p

X= @view u[:,:,1]
Y= @view u[:,:,2]
dX= @view du[:,:,1]
dY= @view du[:,:,2]

@inbounds for xx in 1:Nx, yy in 1:Ny
xxprev=mod(xx-2,Nx)+1
xxnext=mod(xx,Nx)+1
yyprev=mod(yy-2,Ny)+1
yynext=mod(yy,Ny)+1
dens=X[yy,xx]^2+Y[yy,xx]^2
localcoh=V[yy,xx]+g*dens+4*D
localincoh=-η[yy,xx]*(V[yy,xx]+g*dens+4*D)+0.5*(P[yy,xx]-γ[yy,xx])-σ[yy,xx]*dens

dX[yy,xx]=localincoh*X[yy,xx]+localcoh*Y[yy,xx]+η[yy,xx]*D*(X[yyprev,xx]+X[yynext,xx]+X[yy,xxprev]+X[yy,xxnext])-D*(Y[yyprev,xx]+Y[yynext,xx]+Y[yy,xxprev]+Y[yy,xxnext])
dY[yy,xx]=localincoh*Y[yy,xx]-localcoh*X[yy,xx]+η[yy,xx]*D*(Y[yyprev,xx]+Y[yynext,xx]+Y[yy,xxprev]+Y[yy,xxnext])+D*(X[yyprev,xx]+X[yynext,xx]+X[yy,xxprev]+X[yy,xxnext])
end

end

function g!(du,u,p,t)
Nx,Ny,D,η,g,V,P,γ,σ=p
@inbounds for dd in 1:2, xx in 1:Nx ,yy in 1:Ny
du[yy,xx,dd]=√(0.5*P[yy,xx]*D)
end
end

u0=1e-3*randn(Ny,Nx,2);

D=1/gridsize^2;
p=(Nx,Ny,D,η,g,V,P,γ,σ)

probstoch=SDEProblem(f!,g!,u0,(tstart,tfinal),p)
@time stochsol=solve(probstoch,DRI1NM(),saveat=tarray)


the error:

ERROR: LoadError: OutOfMemoryError()
Stacktrace:
[1] Array at ./boot.jl:408 [inlined]
[2] Array at ./boot.jl:416 [inlined]
[3] zeros at ./array.jl:525 [inlined]
[4] zeros at ./array.jl:521 [inlined]
[5] alg_cache(::DRI1NM, ::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::Array{Float64,3}, ::Array{Float64,3}, ::Nothing, ::Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}}, ::Array{Float64,3}, ::Array{Float64,3}, ::Nothing, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Array{Float64,3}, ::SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing}, ::Float64, ::Float64, ::Type{Val{true}}) at /home/wouter/.julia/packages/StochasticDiffEq/MzqBB/src/caches/srk_weak_caches.jl:642
[6] __init(::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::DRI1NM, ::Array{Any,1}, ::Array{Any,1}, ::Type{T} where T, ::Type{Val{true}}; saveat::StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_noise::Bool, save_on::Bool, save_start::Bool, save_end::Bool, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmax::Rational{Int64}, qmin::Rational{Int64}, qoldinit::Rational{Int64}, fullnormalize::Bool, failfactor::Int64, beta2::Rational{Int64}, beta1::Rational{Int64}, delta::Rational{Int64}, maxiters::Int64, dtmax::Float64, dtmin::Float64, internalnorm::typeof(DiffEqBase.ODE_DEFAULT_NORM), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::typeof(DiffEqBase.ODE_DEFAULT_UNSTABLE_CHECK), verbose::Bool, force_dtmin::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, initialize_integrator::Bool, seed::UInt64, alias_u0::Bool, alias_jumps::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .../.julia/packages/StochasticDiffEq/MzqBB/src/solve.jl:453
[7] #__solve#99 at .../.julia/packages/StochasticDiffEq/MzqBB/src/solve.jl:6 [inlined]
[8] #solve_call#56 at .../.julia/packages/DiffEqBase/FMY9y/src/solve.jl:61 [inlined]
[9] solve_up(::SDEProblem{Array{Float64,3},Tuple{Float64,Float64},true,Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}},Nothing,SDEFunction{true,typeof(f!),typeof(g!),UniformScaling{Bool},Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing,Nothing},typeof(g!),Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}},Nothing}, ::Nothing, ::Array{Float64,3}, ::Tuple{Int64,Int64,Float64,Array{Float64,2},Float64,Array{Float64,2},Array{Float64,2},Int64,Array{Float64,2}}, ::DRI1NM; kwargs::Base.Iterators.Pairs{Symbol,StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}},Tuple{Symbol},NamedTuple{(:saveat,),Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}}}) at /home/wouter/.julia/packages/DiffEqBase/FMY9y/src/solve.jl:82
[10] #solve#57 at .../.julia/packages/DiffEqBase/FMY9y/src/solve.jl:70 [inlined]
[11] top-level scope at ./timing.jl:174
in expression starting at .../...:113


You should use the adaptive weak integrators with EnsembleCPUArray (or EnsembleGPUArray) then.

Is it expected that such a weak Silver crashes on the single shot (outside of an ensembleproblem) with out-of-memory error? I was planning to replace the ensemble average by a time average.

Hmm, it’s because Ihat2 uses length(u)^2 memory for the pairwise 3-point random numbers. Can you open an issue on this? We can fix it.

2 Likes

Just a question, but I just found out about NeuralPDE.jl . I’m wondering if there is something like a rule of thumb on when these methods are expected to outperform the standardi ones of working on a fixed discrete grid like I was doing here?

Essentially never. I am writing a blog post on that soon. Essentially, every result where people have said they have done better has been compromised by using DiffEq out of the box and performing about 10,000x faster than their classical method implementation, so it always wins out of the box by more than 100x. The advantage of NeuralPDE.jl is that it’s easy to apply it to any possible problem. So if you have GPUs and just want the answer, it’s a good strategy for any possible PDE, including integro-differential equations, fractional derivatives, etc. I’m not sure of another method that has that kind of ubiquitousness.

1 Like

If almost every operation needs to be dotted, with only a few exceptions, you can opt out of @. for an individual function call by escaping it with $. Unfortunately, Julia can’t parse x$* y as a binary operation, so you have to write it as an ordinary function call ($*)(x, y) in order to escape the $.

That is, you could write:

@. du.= 1im * (1 - 1im*η) * (($*)(NBmat,u) - 4*u)/gridsize^2 +((-1im)*V + α - (σ+1im*g) * abs2(u)) * u  to “dot” everything but the matrix-vector multiplication. (There are probably other ways to speed this up. e.g. if η is real, you could write complex(η, 1) instead of 1im * (1 - 1im*η). You could also pull the division /gridsize^2 out of the loop by writing *$(1/gridsize^2) instead. But maybe the dominant cost is the matrix–vector multiplication?)

3 Likes

BTW, this fix was merged and tagged:

1 Like