I am trying to solve a complex Ginzburg-Landau equation in Julia, defined by
function CGLfun!(du,u,p,t)
NBmat,gridsize,η,g,V,α,σ=p
du.=1im*(1 .-1im*η).*(NBmat*u.-4*u)/gridsize^2 +(.-(1im).*V+α.-(σ.+1im*g).*abs2.(u)).*u;
end
this was the first vectorized implementation, strongly similar to how I initially programmed it in MATLAB. Evolving it with Vern7() is maybe 20% faster than MATLAB’s ode113 result and takes about 600 sec on my desktop to evolve t from 0 until 100.
In this PDE, du is actually defined on a grid (of N=Nx^2=101^2 datapoints) as a complex variable, but I’ve rewritten it all in a column matrix for convenience. NBmat is an NxN sparse matrix that couples neighbouring sites, so the first term corresponds to the discretized laplacian. \eta,V,\alpha,\sigma are all spatially dependent (N entries) and real-valued (incidentally they are proportional to one another so perhaps there’s a better way to import them, but for the current sake I don’t think that should be the issue), while g is a constant.
The official benchmarks of JulliaDiffEq. mention an about 100x speedup compared to MATLAB, and in this sense I was a bit disappointed by the only very modest gain, even when already using the in-place version.
From @ChrisRackauckas recent FAQ, I found this interesting reference for optimization Optimizing DiffEq Code . I saw from the example of the Gierer-Meinhardt Reaction-Diffusion PDE that most of the improvement can be obtained by devectorizing. So I checked the same example on my machine.
This is my output when evolving from 0 to 0.1 with save_every_timestep=false
BenchmarkTools.Trial:
memory estimate: 2.90 MiB
allocs estimate: 62
--------------
minimum time: 4.835 ms (0.00% GC)
median time: 4.917 ms (0.00% GC)
mean time: 4.987 ms (1.13% GC)
maximum time: 7.015 ms (26.40% GC)
--------------
samples: 1002
evals/sample: 1
and from 0 to 1.0
BenchmarkTools.Trial:
memory estimate: 2.90 MiB
allocs estimate: 62
--------------
minimum time: 44.611 ms (0.00% GC)
median time: 45.434 ms (0.00% GC)
mean time: 45.610 ms (0.11% GC)
maximum time: 47.961 ms (3.81% GC)
--------------
samples: 110
evals/sample: 1
so we see that the amount of memory assigned is constant.
In a very analogous way to this example, I also tried to rewrite my PDE in a devectorized way. Note that I’ve modified the original complex vector U of length N into a real Nx times Nx times 2 array, where the real and imaginary parts are separated. This gives in full
function CGLfunrealdevect!(du,u,p3,t)
Nx,D,η,g,V,α,σ=p3
X= @view u[:,:,1]
P= @view u[:,:,2]
dX= @view du[:,:,1]
dP= @view du[:,:,2]
@inbounds for kk in 2:Nx-1, jj in 2:Nx-1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end
@inbounds for kk in 2:Nx-1,jj in 2:Nx-1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
@inbounds for jj in 2:Nx-1
kk=1
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end
@inbounds for jj in 2:Nx-1
kk=1
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
@inbounds for jj in 2:Nx-1
kk=Nx
dX[jj,kk]=D*η[jj,kk]*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end
@inbounds for jj in 2:Nx-1
kk=Nx
dP[jj,kk]=D*η[jj,kk]*(P[jj-1,kk]+P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(X[jj-1,kk]+X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
@inbounds for kk in 2:Nx-1
jj=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end
@inbounds for kk in 2:Nx-1
jj=1
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
@inbounds for kk in 2:Nx-1
jj=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
end
@inbounds for kk in 2:Nx-1
jj=Nx
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+P[jj,kk-1]+P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+X[jj,kk-1]+X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
@inbounds begin
jj=1; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
jj=1; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj+1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj+1,kk]+2*X[jj,kk-1]-4*X[jj,kk])+
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
jj=Nx; kk=1
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk]) +
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk+1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk+1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
jj=Nx; kk=Nx
dX[jj,kk]=D*η[jj,kk]*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk])-D*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+
α[jj,kk]*X[jj,kk]+V[jj,kk]*P[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*X[jj,kk]+g*P[jj,kk]);
dP[jj,kk]=D*η[jj,kk]*(2*P[jj-1,kk]+2*P[jj,kk-1]-4*P[jj,kk])+D*(2*X[jj-1,kk]+2*X[jj,kk-1]-4*X[jj,kk]) +
α[jj,kk]*P[jj,kk]-V[jj,kk]*X[jj,kk]+(X[jj,kk]^2+P[jj,kk]^2)*(-σ[jj,kk]*P[jj,kk]-g*X[jj,kk]);
end
end
Gives for (0,0.1) evolution
BenchmarkTools.Trial:
memory estimate: 46.61 GiB
allocs estimate: 2857792769
--------------
minimum time: 41.576 s (7.85% GC)
median time: 41.576 s (7.85% GC)
mean time: 41.576 s (7.85% GC)
maximum time: 41.576 s (7.85% GC)
--------------
samples: 1
evals/sample: 1
And for (0,1) evolution
BenchmarkTools.Trial:
memory estimate: 450.45 GiB
allocs estimate: 27623039639
--------------
minimum time: 387.538 s (8.10% GC)
median time: 387.538 s (8.10% GC)
mean time: 387.538 s (8.10% GC)
maximum time: 387.538 s (8.10% GC)
--------------
samples: 1
evals/sample: 1
So in this case, there is clearly a huge amount of allocations, and in contrast to the example PDE, the memory is extensive with evolution length.
Any idea what I missed and how I can achieve a similar speedup as the example?
What I already tried in attempt to reproduce the ‘slow’ behavior with the Gierer-Meinhardt example:
*change the solver algorithm from Tsit5() to Vern7() as well
*make \alpha as there as well a spatially dependent grid
*put in some @view to rename (:,:,1) and (:,:,2)
*Putting a ; at the end of expressions
*Implicitly writing the multiplication signs between numbers and constants
*putting random pieces of the expression in parentheses
inserting a perturbation in the form of 1e-10(u[i,j,1]^2+u[i,j,2]^2)u[i,j,1]
*changing intendation
but the memory-efficiency of that example has seemed robust against all these things…