Speed up solution system of ODEs

I’m working in a mathematical model developed using Julia. I’ve the same model done in Mathematica and, strangely, the solution for the system of ODEs takes more time in Julia than in Mathematica (around 60s vs 24s).

As it’s my first model in Julia I think I might have built the model using some non-efficient ways of implementation.

Here is a sample of my code:

#Some data from spreadsheet
tabs=(Dates.value.(data[:,1])-Dates.value(data[1,1])*ones(size(data)[1]))*24*60*60; #Time (s)
P=Interpolations.LinearInterpolation(tabs,data[:,13]*0.001/(24*60*60)); #Rainfall (m/s)
Qin=Interpolations.LinearInterpolation(tabs,data[:,2]*1/(24*60*60)); #Influent water flow rate (m3/s)
ET0=Interpolations.LinearInterpolation(tabs,data[:,14]*0.001/(24*60*60)); #Reference evap019 otranspiration (m/s)
T=Interpolations.LinearInterpolation(tabs,data[:,3]+273.15*ones(size(data)[1])); #Average temperature (K)
fp=Interpolations.LinearInterpolation(tabs,data[:,12]/24); #Photoperiod
I0=Interpolations.LinearInterpolation(tabs,data[:,9]*1000000/(24*60*60)); #Average incident daylight intensity (W/m2)
w=Interpolations.LinearInterpolation(tabs,float.(data[:,10])); #Average wind speed (m/s)
wmax=Interpolations.LinearInterpolation(tabs,float.(data[:,11])); #Maximum wind speed (m/s)

#Some numerical parameters
g=9.8; #Standard gravity of Earth (N/kg)
hs=0.45; #Piezometric head at the outlet (m)
d=(200-2*4.9)*10^(-3); #Outlet's pipe diameter (m)
e=0.0015*10^(-3); #Pipe's absolute roughness (m)
f=0.015; #Outlet pipe's friction factor
L=6.00; #Outlet's pipe lenght (m)
KL=0.75; #Outlet's pipe kinetic energy factor due to minor losses at the elbow
n=3; #Number of pipes at the outlet
S0=100000; #Reference wetland area (m2)
h0=0.50; #Reference wetland depth (m)
p=4.55; #Shape parameter
Kg=10.0*0.001/(24*60*60); #Soil's hydraulic conductivity constant (m/s)
Kc=1.2; #Vegetation coefficient
F=80; #Fetch (m)
RF=1:10; #Resuspension factors for different groups of birds
Ne=[15,50,0,300,0,40+50,40,0,0,0]; #Number of birds of each type
Xin=[1,0,3.1,3.1,12,12,0.007,0.001,0.001,12,12,0,0,18.3,1,4.5]*1000/1000000; #Input concentrations (kg/m3)
X0=[1,1800,3.1,3.1,12,12,0.007,0.001,0.001,12,12,4505.4,0.005,18.3,1,4.5]*1000/1000000; #Initial conditions

#Some functions
w0(t)=117.1267*h(t)/(pi*tanh(pi))*(g/F)^0.5; #Wind speed limit for resuspension (m/s)
Fd(X)=1/(1+Kd*X[14]); #DIP/TIP
FPNH4(X)=X[3]*X[4]/((KmN+X[3])*(KmN+X[4]))+X[3]*KmN/((X[3]+X[4])*(KmN+X[4])); #Preference between N compounds
fphs(X)=1/12*(32+48*(1-FPNH4(X))); #Photosyntesis stochiometry coefficient
VC(X)=(X[12]+X[13])/(KVC+X[12]+X[13]); #Vegetation covering
GL(X,t)=2.718*fp(t)*(1-VC(X))/((Ke0+aXp*X[7])*h(t))*(exp(-I0(t)/Is*exp(-(Ke0+aXp*X[7])*h(t)))-exp(-I0(t)/Is));
IGS(t)=max(sin(2*pi*t/tabs[end])*(sin(2*pi*t/tabs[end])>0),0.2); #Vegetation growth function
IGD(t)=max(-sin(2*pi*t/tabs[end])*(-sin(2*pi*t/tabs[end])>0),0.2); #Vegetation degradation function
DOsat(t)=0.475/(T(t)-239.65); #Dissolved oxygen saturation concentration (kg/m3)
Rw(t)=Ew*(tauw(t)/tauwcr(t)-1)*S(h(t))/V(h(t))*(tauw(t)>tauwcr(t)); #Resuspension by wind (kg/m3/s)
Xg(X)=X.*[Fd(X),0,1,1,1,1,0,1,1,0,0,0,0,0,0,1]; #Output concentrations to groundwater (kg/m3)
Xw(X)=X.*[1,0,1,1,1,1,1,1,1,1,1,0,0,1,1,1]; #Output concentrations (kg/m3)
psi1(X,t)=Kh*thetakh^(T(t)-293.15)*X[10]/(Kx*X[8]+X[10])*X[16]/(KhydDO+X[16])*X[8]; #Hydrolysis
psi2(X,t)=Kh*Oh*thetakh^(T(t)-293.15)*X[10]/(Kx*X[8]+X[10])*KhydDO/(KhydDO+X[16])*X[4]/(KNOh+X[4])*X[8]; #Anoxic hydrolysis
psi3(X,t)=muH*thetamuH^(T(t)-293.15)*X[5]/(Ks+X[5])*X[16]/(KOH+X[16])*X[3]/(KNHH+X[3])*Fd(X)*X[1]/(KPH+Fd(X)*X[1])*(1-(X[8]+X[9])/Kbiomax)*X[8]; #Aerobic growth of XH
psi4(X,t)=etaNO3*muH*thetamuH^(T(t)-293.15)*X[5]/(Ks+X[5])*KOH/(KOH+X[16])*X[4]/(KOH+X[4])*X[3]/(KNHH+X[3])*Fd(X)*X[1]/(KPH+Fd(X)*X[1])*(1-(X[8]+X[9])/Kbiomax)*X[8]; #Anoxic growth of XH
psi5(X,t)=bH*thetabH^(T(t)-293.15)*X[8]; #Lysis of XH
psi6(X,t)=muA*thetamuA^(T(t)-293.15)*X[3]/(KNHA+X[3])*X[16]/(KOA+X[16])*Fd(X)*X[1]/(KPA+Fd(X)*X[1])*(1-(X[8]+X[9])/Kbiomax)*X[9]; #Growth of XA
psi7(X,t)=bA*thetabA^(T(t)-293.15)*X[9]; #Lysis of XA
psi8(X,t)=1/h(t)*Kpl*thetaupl^(T(t)-293.15)*X[3]/(KNHP+X[3])*Fd(X)*X[1]/(KPP+Fd(X)*X[1])*(1-X[12]/KXml)*X[12]*IGS(t); #NH4 uptake by Xml
psi9(X,t)=1/h(t)*Kpl*thetaupl^(T(t)-293.15)*X[4]/(KNOP+X[4])*KNHP/(KNHP+X[3])*Fd(X)*X[1]/(KPP+Fd(X)*X[1] )*(1-X[12]/KXml)*X[12]*IGS(t); #NO3 uptake by Xml
psi10(X,t)=1/h(t)*Kdeg*thetadeg^(T(t)-293.15)*X[13]; #Xmd degradation
psi11(X,t)=S(h(t))/(0.1*V(h(t)))*DNH4*thetaNH4^(T(t)-293.15)*(NH4sed-X[3]); #Diffusion of NH4
psi12(X,t)=S(h(t))/(0.1*V(h(t)))*DNO3*thetaNO3^(T(t)-293.15)*(NO3sed-X[4]); #Diffusion of NO3
psi13(X,t)=vsx/h(t)*(1+Kvegsed*VC(X))*X[10]; #Sedimentation of Xs
psi14(X,t)=vsx/h(t)*(1+Kvegsed*VC(X))*X[11]; #Sedimentation of Xi
psi15(X,t)=Pmaxuptake*(Pmax-X[2])/(Pmax-Pmin)*(Fd(X)*X[1])/(kDIPup+Fd(X)*X[1]); #Phosphorus uptake by Xp
psi16(X,t)=Gmax*thetaG^(T(t)-293.15)*(X[2]-Pmin)/(Pmax-Pmin)*X[7]*GL(X,t); #Growth of Xp
psi17(X,t)=Kr*thetar^(T(t)-293.15)*X[7]; #Decay of Xp
psi18(X,t)=Kr*thetar^(T(t)-293.15)*X[2]; #Lysis of Pint
psi19(X,t)=Kresp*thetaresp^(T(t)-293.15)*X[16]/(KDOXp+X[16])*X[7]; #Respiration of Xp
psi20(X,t)=vsXp/h(t)*X[7]; #Sedimentation of Xp
psi21(X,t)=vsTIP/h(t)*(1-Fd(X))*(1+Kvegsed*VC(X))*X[1]; #Sedimentation of PIP
psi22(X,t)=DDIP*Kdif*thetadif^(T(t)-293.15)*(Psed-X[1]*Fd(X))*1/(0.1*h(t)); #Diffusion of DIP
psi23(X,t)=vsXTSS/h(t)*(1+Kvegsed*VC(X))*X[14]; #Sedimentation of vsXTSS
psi24(X,t)=Kavi*LinearAlgebra.dot(Ne,RF)/S(h(t))*(1-Kvegres*VC(X)); #Resuspension by avifauna
psi25(X,t)=KminOP*thetaminOP^(T(t)-293.15)*X[16]/(X[16]+kDO)*X[15]; #Mineralization of OP
psi26(X,t)=vsOP/h(t)*fPOP*(1+Kvegsed*VC(X))*X[15]; #Sedimentation of OP
psi27(X,t)=Rw(t)*(1-Kvegres*VC(X)); #Resuspension of TSS
psi28(X,t)=Rw(t)*(1-Kvegres*VC(X))*iPsed*Fpr; #Resuspension of OP
psi29(X,t)=Rw(t)*(1-Kvegres*VC(X))*iPsed*(1-Fpr); #Resuspension of PIP
psi30(X,t)=krea(t)*(DOsat(t)-X[16]); #Reaireation
psi31(X,t)=KphXp*thetaG^(T(t)-293.15)*X[7]; #Photosyntesis of Xp
psi32(X,t)=KphXml*thetaupl^(T(t)-293.15)*X[12]; #Photosyntesis of Xml
psi33(X,t)=Kmlmd*thetamlmd^(T(t)-293.15)*X[12]; #Degradation of Xml

#Processes
psi(X,t)=[psi1(X,t),psi2(X,t),psi3(X,t),psi4(X,t),psi5(X,t),psi6(X,t),psi7(X,t),psi8(X,t),psi9(X,t),psi10(X,t),psi11(X,t),psi12(X,t),psi13(X,t),psi14(X,t),psi15(X,t),psi16(X,t),psi17(X,t),psi18(X,t),psi19(X,t),psi20(X,t),psi21(X,t),psi22(X,t),psi23(X,t),psi24(X,t),psi25(X,t),psi26(X,t),psi27(X,t),psi28(X,t),psi29(X,t),psi30(X,t),psi31(X,t),psi32(X,t),psi33(X,t)];

#Stochiometry matrix
SM(X)=LinearAlgebra.Transpose([0 0 U1NH4 0 1-fhydSi fhydSi 0 0 0 -1 0 0 0 0 U1TIP -1;
                                    U2TIP 0 U2NH4 0 1-fhydSi fhydSi 0 0 0 -1 0 0 0 0 0 0;
                                    -iPbm+iPSs/YH 0 -iNbm+iNSs/YH 0 -1/YH 0 0 1 0 0 0 0 0 0 0 -1/YH+1;
                                    -iPbm+iPSs/YH 0 -iNbm+iNSs/YH -(1-YH)/(2.86*YH) -1/YH 0 0 1 0 0 0 0 0 0 0 0;
                                    U5TIP 0 U5NH4 0 fbmSs 0 0 -1 0 U5Xs fbmXi 0 0 0 0 0;
                                    -iPbm 0 -iNbm-1/YA 1/YA 0 0 0 0 1 0 0 0 0 0 0 -4.57/YA+1;
                                    U7TIP 0 U7NH4 0 fbmSs 0 0 0 -1 U7Xs fbmXi 0 0 0 0 0;
                                    -iPXm 0 -iNXm 0 0 0 0 0 0 0 0 1 0 0 0 0;
                                    -iPXm 0 0 -iNXm 0 0 0 0 0 0 0 1 0 0 0 0;
                                    U10TIP 0 U10NH4 0 0 0 0 0 0 1-fpl fpl 0 -1 0 0 0;
                                    0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0;
                                    0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0;
                                    -X[7] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 -iPXp/X[7] -iNXp*FPNH4(X) -iNXp*(1-FPNH4(X)) 0 0 1 0 0 0 0 0 0 iTSSXp 0 0;
                                    U17TIP 0 U17NH4 0 fXpSs*iCODXp 0 -1 0 0 U17Xs fXpXi*iCODXp 0 0 -iTSSXp iPXp*Fop 0;
                                    X[7] -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 -iCODXp;
                                    0 0 0 0 0 0 -1 0 0 0 0 0 0 -iTSSXp 0 0;
                                    -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0;
                                    0 0 0 0 0 0 0 0 0 iCODsed 0 0 0 1 iPsed*Fpr 0;
                                    1 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -fPDO;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0;
                                    1-Fd(X) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1;
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fphs(X);
                                    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fphs(X);
                                    0 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0]);

#Solver (BDF)
dX(X,p,t)=Qin(t)/V(h(t))*(Xin-Xw(X))+QET(h(t),t)/V(h(t))*Xw(X)-QP(t)/V(h(t))*Xw(X)-Qg(h(t))/V(h(t))*(Xg(X)-Xw(X))+SM(X)*psi(X,t);
@time X=DifferentialEquations.solve(DifferentialEquations.ODEProblem(dX,X0,(0,tabs[end])),DifferentialEquations.TRBDF2());

Any suggestions how could I optimize my code to be faster?

Thanks in advance for any advice!

Let me start by saying, you haven’t followed any of the performance guidelines so there’s fantastic speedups ahead! That said, oh boy there’s a lot to do here. I would suggest reading the “Optimizing DiffEq Code” tutorial:

http://juliadiffeq.org/DiffEqTutorials.jl/html/introduction/optimizing_diffeq_code.html

Notice that your main issues are:

  1. Excessive use of globals
  2. Allocations
  3. Not using the in-place (du,u,p,t) format
  4. Why the choice of TRBDF2? Maybe Rodas5() would be a better choice here.

I can help you out more directly when you get closer, but right now take a stab at it with these thoughts in mind, and I’ll help you get the last bits of performance out.

Note that one very simple thing to do would be to const all of those parameters (tabs through Xin). That would give you type stability. Quite honestly I’d re-write a lot more for clarity as well, but at least that would make it a ton faster.

9 Likes

Thank you for your answer!

I’ve been carefully reading your tutorial, specially the part dedicated to small systems, as my model has 16 ODEs. I’ve paid attention also to your 4 main advices. However, I’m struggling to make the code compile faster.

  1. First of all, I’ve tried changing all my global variables adding const at the beginning of every line, but this hasn’t had any valuable impact on the performance.
  2. About allocations, I’m not sure if it’s related to the way I’ve defined the functions. Probably it would be faster to define them in other ways.
  3. I’ve tried also using the in-place format for the definition of the final function to be integrated. This has been the most promising change so far.
  4. I made the choice of TRBDF2 following the documententation given in the package that you developed DifferentialEquations.jl, combining with a “trial and error” approach, and it has been the fastest given the problem conditions: stiff and nonlinear system.

Going back to the in-place format, I add the code that I used:

function balance(dX,X,p,t)
    dX=Qin(t)/V(h(t))*(Xin-Xw(X))+QET(h(t),t)/V(h(t))*Xw(X)-QP(t)/V(h(t))*Xw(X)-Qg(h(t))/V(h(t))*(Xg(X)-Xw(X))+SM(X)*psi(X,t);
end

@time X=DifferentialEquations.solve(DifferentialEquations.ODEProblem(balance,X0,(0,tabs[end])),DifferentialEquations.TRBDF2());

The execution speed has dramatically decreased after this change from around 60 s to around 0.5 s, which is stunning. However, I think I’ve made a mistake and the solver is not reading the function correctly, as it returns now a constant solution for all variables.

Finally, I tried also with @StaticArrays. But I haven’t been successful so far.

So all in all, I’d say that using the in-place format as you suggest in the 3rd point can be the first important change to optimize my code, without changing so much the main structure, at least for the moment.

Thank you again @ChrisRackauckas! The tutorial, the documentation and the package as a whole that you have developed are incredibly useful!

3 Likes

That’s not in-place since it’s creating a new array instead of changing the values in dX. You’ll want to do .=, but here, you’re already creating the new arrays before changing the values so you won’t get the speedup. You’ll want to re-work the implementation a bit more.

Do you have a minimal working example (MWE) that doesn’t include data that isn’t shared? I can’t help more directly if I can’t run it.

1 Like

Ok, I’ve been developing a working example, based on the complete model. The code is below:

import Interpolations
import LinearAlgebra
import DifferentialEquations

#Numerical parameters
g=9.8;
S0=100000;
h0=0.5;
p=4.55;
hs=0.45;
d=(200-2*4.9)*10^(-3);
f=0.015;
L=6;
KL=0.75;
n=3;
Kg=10*0.001/(24*60*60);
Kc=1.2;
iPXp=1.04;
Fop=0.8;
iTSSXp=68.49;
iPsed=8.337*10^(-4);
Fpr=0.5;
KminOP=0.22*1/(24*60*60);
thetaminOP=1.08;
kDO=2*1000/1000000;
Pmaxuptake=1.28*1/(24*60*60);
Pmax=2.08;
Pmin=0.52;
kDIPup=0.0005*1000/1000000;
Kd=(0.07+0.19)/2*1000000/1000;
Gmax=2*1/(24*60*60);
thetaG=1.068;
Kr=0.1*1/(24*60*60);
thetar=1.02;
Kresp=0.1*1/(24*60*60);
thetaresp=1.045;
KDOXp=0.2*1000/1000000;
vsXp=0.07*1/(24*60*60);
vsTIP=0.04*1/(24*60*60);
Kvegsed=1;
vsOP=0.07*1/(24*60*60);
fPOP=0.34;
vsXTSS=0.07*1/(24*60*60);
Kavi=1081*1000/(1000000*24*60*60);
Kvegres=1;
RF=1:10;
Ne=[15,50,0,300,0,40+50,40,0,0,0];
DDIP=6.34*10^(-5)*1/(24*60*60);
Kdif=0.4;
thetadif=1.0234;
Psed=2.96*1000/1000000;
Ke0=0.1;
aXp=8.8*1000000/1000;
Is=150*41840/(24*60*60);
DO=4.5*1000/1000000;
VC=0.9;
Xin=[0.001,0.,0.001,7*10^-6,0.0183];
X0=[0.001,1.8,0.001,7*10^-6,0.0183];

#Interpolation functions
tabs=(0:52)*7*24*60*60;
Pdata=[0.2,0.1,0.2,1.6,1.3,2.5,0.2,1.3,0.1,0.2,0.,0.5,5.5,7.8,4.8,0.1,4.2,1.9,0.,0.9,3.9,0.,0.,2.3,0.,0.,0.2,0.,0.,0.,0.9,0.,0.,0.,0.,4.4,2.2,0.9,1.5,0.,0.,0.9,0.,0.,0.,6.2,0.1,0.1,0.1,2.5,0.2,2.6,0.1]*0.001/(24*60*60);
Qindata=[4964,5239,5370,5679,5148,5125,4400,4365,4757,4965,4923,4982,5760,5519,5194,5069,5067,5323,5374,5450,5377,5097,5595,5304,5433,4841,5284,5397,5343,5313,5557,5528,5292,5203,5162,4401,5257,5418,5393,5409,5220,5461,5725,5618,5507,5379,5312,5246,5222,5187,5280,5207,5084]*1/(24*60*60);
ET0data=[0.8,0.6,0.8,0.8,0.8,1.5,2.3,1.6,2.9,3.2,3.7,3.,2.2,1.9,2.9,4.7,3.5,4.,5.9,5.3,4.3,6.1,6.3,4.9,7.1,7.5,7.6,8.,7.6,8.1,6.5,7.,6.5,7.,5.8,4.6,3.8,3.7,3.,3.,2.7,2.4,2.1,1.9,1.5,1.5,1.,0.8,0.9,0.8,0.8,0.6,0.7]*0.001/(24*60*60);
Tdata=[278.,277.,277.5,274.3,278.5,281.7,282.6,281.2,283.5,282.3,282.8,280.,281.3,280.1,284.7,287.7,286.,285.5,291.2,292.1,287.9,292.,294.8,292.,294.9,297.5,298.2,299.4,296.6,301.5,298.9,297.2,295.9,297.8,296.3,294.8,293.1,291.1,289.9,289.2,287.8,286.8,283.,282.8,281.,279.9,278.2,278.8,277.6,275.6,279.2,276.3,280.5];
fpdata=[0.3,0.29,0.32,0.31,0.3,0.31,0.38,0.35,0.42,0.42,0.45,0.45,0.42,0.41,0.44,0.49,0.46,0.5,0.54,0.52,0.51,0.56,0.55,0.53,0.57,0.57,0.55,0.56,0.55,0.54,0.52,0.53,0.52,0.48,0.49,0.45,0.44,0.45,0.41,0.43,0.4,0.36,0.39,0.38,0.38,0.3,0.35,0.34,0.33,0.25,0.31,0.31,0.27];
I0data=[96.38,75.74,88.14,82.61,85.98,99.6,141.98,109.85,206.,205.04,239.22,228.06,152.1,149.69,205.32,270.14,232.66,278.14,348.2,302.56,266.62,356.8,327.33,254.88,376.49,364.34,361.56,357.11,361.62,349.65,309.31,327.12,325.43,264.96,290.66,226.31,232.21,236.33,168.8,219.73,178.54,161.64,173.63,164.96,143.72,109.89,132.06,115.71,95.59,64.48,78.39,83.48,57.89];
P=Interpolations.LinearInterpolation(tabs,Pdata);
Qin=Interpolations.LinearInterpolation(tabs,Qindata);
ET0=Interpolations.LinearInterpolation(tabs,ET0data);
fp=Interpolations.LinearInterpolation(tabs,fpdata);
T=Interpolations.LinearInterpolation(tabs,Tdata);
I0=Interpolations.LinearInterpolation(tabs,I0data);


#Functions
S(h)=S0*(h/h0*(h>hs))^(2/p);
V(h)=p/(2+p)*S0*h*(h/h0*(h>hs))^(2/p);
Qout(h)=n*pi*d^2*(g*(h-hs)*(h>hs)/(8*(1+KL+f*L/d)))^0.5;
Qg(h)=Kg*S(h)*(h>0);
QP(t)=P(t)*S0;
QET(h,t)=Kc*ET0(t)*S(h)*(h>0);
Fd(X)=1/(1+Kd*X[5]);
GL(X,t)=2.718*fp(t)*(1-VC)/((Ke0+aXp*X[4])*h(t))*(exp(-I0(t)/Is*exp(-(Ke0+aXp*X[4])*h(t)))-exp(-I0(t)/Is));
SM(X)=LinearAlgebra.Transpose([1 0 -1 0 0 ; -X[4] 1 0 0 0 ; 0 -iPXp/X[4] 0 1 iTSSXp ; iPXp*(1-Fop) 0 iPXp*Fop -1 -iTSSXp ; X[4] -1 0 0 0 ; iPXp*(1-Fop) 0 iPXp*Fop -1 -iTSSXp ; 0 0 0 -1 -iTSSXp ; -1 0 0 0 0 ; 0 0 -1 0 0 ; 0 0 0 0 -1 ; iPsed*(1-Fpr) 0 iPsed*Fpr 0 1 ; 1 0 0 0 0]);
psi1(X,t)=KminOP*thetaminOP^(T(t)-293.15)*DO/(DO+kDO)*X[3];
psi2(X,t)=Pmaxuptake*(Pmax-X[2])/(Pmax-Pmin)*(Fd(X)*X[1])/(kDIPup+Fd(X)*X[1]);
psi3(X,t)=Gmax*thetaG^(T(t)-293.15)*(X[2]-Pmin)/(Pmax-Pmin)*X[4]*GL(X,t);
psi4(X,t)=Kr*thetar^(T(t)-293.15)*X[4];
psi5(X,t)=Kr*thetar^(T(t)-293.15)*X[2];
psi6(X,t)=Kresp*thetaresp^(T(t)-293.15)*DO/(KDOXp+DO)*X[4];
psi7(X,t)=vsXp/h(t)*X[4];
psi8(X,t)=vsTIP/h(t)*(1-Fd(X))*(1+Kvegsed*VC)*X[1];
psi9(X,t)=vsOP/h(t)*fPOP*(1+Kvegsed*VC)*X[3];
psi10(X,t)=vsXTSS/h(t)*(1+Kvegsed*VC)*X[5];
psi11(X,t)=Kavi*LinearAlgebra.dot(Ne,RF)/S(h(t))*(1-Kvegres*VC);
psi12(X,t)=DDIP*Kdif*thetadif^(T(t)-293.15)*(Psed-X[1]*Fd(X))*1/(0.1*h(t));
psi(X,t)=[psi1(X,t),psi2(X,t),psi3(X,t),psi4(X,t),psi5(X,t),psi6(X,t),psi7(X,t),psi8(X,t),psi9(X,t),psi10(X,t),psi11(X,t),psi12(X,t)];
Xg(X)=[Fd(X)*X[1],0.0,0.0,0.0,0.0];
Xw(X)=X.*[1,0,1,1,1];
dh(h,p,t)=1/S(h)*(Qin(t)+QP(t)-Qg(h)-Qout(h)-QET(h,t));
dX(X,p,t)=Qin(t)/V(h(t))*(Xin-Xw(X))+QET(h(t),t)/V(h(t))*Xw(X)-QP(t)/V(h(t))*Xw(X)-Qg(h(t))/V(h(t))*(Xg(X)-Xw(X))+SM(X)*psi(X,t);

#Solver (Adams)
@time h=DifferentialEquations.solve(DifferentialEquations.ODEProblem(dh,h0,(0.0,tabs[end])),DifferentialEquations.VCABM3());
@time X=DifferentialEquations.solve(DifferentialEquations.ODEProblem(dX,X0,(0.0,tabs[end])),DifferentialEquations.VCABM3());

As you can see it’s in a Matlab-like style. The first differential equation takes around 2 ms to be solved and the second system around 120 ms. Could you help me more directly with this example?

Thank you very much in advance!

Here is a faster version of the code; the key changes are to take everything out of the global scope (increases the speed by a factor of two) and to use StaticArrays for the linear algebra (increases the speed by a further factor of 12).

On my laptop the original code takes 25ms whereas the new code takes 1ms.

You should also hoist some/all the parameters out into a separate struct for flexibility, and possibly a little extra speed. (Though you might need to be careful with parameterising the struct appropriate, particularly with the interpolation functions - it might be easier to use a named tuple for the parameters.)

Note that I haven’t checked the correctness of the results!

using OrdinaryDiffEq
using StaticArrays
using Interpolations
using LinearAlgebra
using BenchmarkTools


function dU(U, p, t)

	h = U[1]
	X = U[2:end]

	#Numerical parameters
	g=9.8;
	S0=100000;
	h0=0.5;
	p=4.55;
	hs=0.45;
	d=(200-2*4.9)*10^(-3);
	f=0.015;
	L=6;
	KL=0.75;
	n=3;
	Kg=10*0.001/(24*60*60);
	Kc=1.2;
	iPXp=1.04;
	Fop=0.8;
	iTSSXp=68.49;
	iPsed=8.337*10^(-4);
	Fpr=0.5;
	KminOP=0.22*1/(24*60*60);
	thetaminOP=1.08;
	kDO=2*1000/1000000;
	Pmaxuptake=1.28*1/(24*60*60);
	Pmax=2.08;
	Pmin=0.52;
	kDIPup=0.0005*1000/1000000;
	Kd=(0.07+0.19)/2*1000000/1000;
	Gmax=2*1/(24*60*60);
	thetaG=1.068;
	Kr=0.1*1/(24*60*60);
	thetar=1.02;
	Kresp=0.1*1/(24*60*60);
	thetaresp=1.045;
	KDOXp=0.2*1000/1000000;
	vsXp=0.07*1/(24*60*60);
	vsTIP=0.04*1/(24*60*60);
	Kvegsed=1;
	vsOP=0.07*1/(24*60*60);
	fPOP=0.34;
	vsXTSS=0.07*1/(24*60*60);
	Kavi=1081*1000/(1000000*24*60*60);
	Kvegres=1;
	RF=1:10;
	Ne=@SVector [15,50,0,300,0,40+50,40,0,0,0];
	DDIP=6.34*10^(-5)*1/(24*60*60);
	Kdif=0.4;
	thetadif=1.0234;
	Psed=2.96*1000/1000000;
	Ke0=0.1;
	aXp=8.8*1000000/1000;
	Is=150*41840/(24*60*60);
	DO=4.5*1000/1000000;
	VC=0.9;
	Xin=@SVector [0.001,0.,0.001,7*10^-6,0.0183];
	X0=@SVector [0.001,1.8,0.001,7*10^-6,0.0183];

	#Interpolation functions
	tabs=(0:52)*7*24*60*60;
	Pdata=[0.2,0.1,0.2,1.6,1.3,2.5,0.2,1.3,0.1,0.2,0.,0.5,5.5,7.8,4.8,0.1,4.2,1.9,0.,0.9,3.9,0.,0.,2.3,0.,0.,0.2,0.,0.,0.,0.9,0.,0.,0.,0.,4.4,2.2,0.9,1.5,0.,0.,0.9,0.,0.,0.,6.2,0.1,0.1,0.1,2.5,0.2,2.6,0.1]*0.001/(24*60*60);
	Qindata=[4964,5239,5370,5679,5148,5125,4400,4365,4757,4965,4923,4982,5760,5519,5194,5069,5067,5323,5374,5450,5377,5097,5595,5304,5433,4841,5284,5397,5343,5313,5557,5528,5292,5203,5162,4401,5257,5418,5393,5409,5220,5461,5725,5618,5507,5379,5312,5246,5222,5187,5280,5207,5084]*1/(24*60*60);
	ET0data=[0.8,0.6,0.8,0.8,0.8,1.5,2.3,1.6,2.9,3.2,3.7,3.,2.2,1.9,2.9,4.7,3.5,4.,5.9,5.3,4.3,6.1,6.3,4.9,7.1,7.5,7.6,8.,7.6,8.1,6.5,7.,6.5,7.,5.8,4.6,3.8,3.7,3.,3.,2.7,2.4,2.1,1.9,1.5,1.5,1.,0.8,0.9,0.8,0.8,0.6,0.7]*0.001/(24*60*60);
	Tdata=[278.,277.,277.5,274.3,278.5,281.7,282.6,281.2,283.5,282.3,282.8,280.,281.3,280.1,284.7,287.7,286.,285.5,291.2,292.1,287.9,292.,294.8,292.,294.9,297.5,298.2,299.4,296.6,301.5,298.9,297.2,295.9,297.8,296.3,294.8,293.1,291.1,289.9,289.2,287.8,286.8,283.,282.8,281.,279.9,278.2,278.8,277.6,275.6,279.2,276.3,280.5];
	fpdata=[0.3,0.29,0.32,0.31,0.3,0.31,0.38,0.35,0.42,0.42,0.45,0.45,0.42,0.41,0.44,0.49,0.46,0.5,0.54,0.52,0.51,0.56,0.55,0.53,0.57,0.57,0.55,0.56,0.55,0.54,0.52,0.53,0.52,0.48,0.49,0.45,0.44,0.45,0.41,0.43,0.4,0.36,0.39,0.38,0.38,0.3,0.35,0.34,0.33,0.25,0.31,0.31,0.27];
	I0data=[96.38,75.74,88.14,82.61,85.98,99.6,141.98,109.85,206.,205.04,239.22,228.06,152.1,149.69,205.32,270.14,232.66,278.14,348.2,302.56,266.62,356.8,327.33,254.88,376.49,364.34,361.56,357.11,361.62,349.65,309.31,327.12,325.43,264.96,290.66,226.31,232.21,236.33,168.8,219.73,178.54,161.64,173.63,164.96,143.72,109.89,132.06,115.71,95.59,64.48,78.39,83.48,57.89];
	P=Interpolations.LinearInterpolation(tabs,Pdata);
	Qin=Interpolations.LinearInterpolation(tabs,Qindata);
	ET0=Interpolations.LinearInterpolation(tabs,ET0data);
	fp=Interpolations.LinearInterpolation(tabs,fpdata);
	T=Interpolations.LinearInterpolation(tabs,Tdata);
	I0=Interpolations.LinearInterpolation(tabs,I0data);

	# Everything before here should be wrapped up inside a parameter structure

	Pₜ = P(t)
	Qinₜ = Qin(t)
	ET0ₜ = ET0(t)
	fpₜ = fp(t)
	Tₜ = T(t)
	I0ₜ = I0(t)

	S = S0*(h/h0*(h>hs))^(2/p);
	V = p/(2+p)*S0*h*(h/h0*(h>hs))^(2/p);
	Qout = n*pi*d^2*(g*(h-hs)*(h>hs)/(8*(1+KL+f*L/d)))^0.5;
	Qg = Kg*S*(h>0);
	QP = Pₜ*S0;
	QET = Kc*ET0ₜ*S*(h>0);
	Fd = 1/(1+Kd*X[5]);
	GL = 2.718*fpₜ*(1-VC)/((Ke0+aXp*X[4])*h)*(exp(-I0ₜ/Is*exp(-(Ke0+aXp*X[4])*h))-exp(-I0ₜ/Is));
	SM = (@SMatrix [1 0 -1 0 0 ; -X[4] 1 0 0 0 ; 0 -iPXp/X[4] 0 1 iTSSXp ; iPXp*(1-Fop) 0 iPXp*Fop -1 -iTSSXp ; X[4] -1 0 0 0 ; iPXp*(1-Fop) 0 iPXp*Fop -1 -iTSSXp ; 0 0 0 -1 -iTSSXp ; -1 0 0 0 0 ; 0 0 -1 0 0 ; 0 0 0 0 -1 ; iPsed*(1-Fpr) 0 iPsed*Fpr 0 1 ; 1 0 0 0 0])'
	psi1 = KminOP*thetaminOP^(Tₜ-293.15)*DO/(DO+kDO)*X[3];
	psi2 = Pmaxuptake*(Pmax-X[2])/(Pmax-Pmin)*(Fd*X[1])/(kDIPup+Fd*X[1]);
	psi3 = Gmax*thetaG^(Tₜ-293.15)*(X[2]-Pmin)/(Pmax-Pmin)*X[4]*GL;
	psi4 = Kr*thetar^(Tₜ-293.15)*X[4];
	psi5 = Kr*thetar^(Tₜ-293.15)*X[2];
	psi6 = Kresp*thetaresp^(Tₜ-293.15)*DO/(KDOXp+DO)*X[4];
	psi7 = vsXp/h*X[4];
	psi8 = vsTIP/h*(1-Fd)*(1+Kvegsed*VC)*X[1];
	psi9 = vsOP/h*fPOP*(1+Kvegsed*VC)*X[3];
	psi10 = vsXTSS/h*(1+Kvegsed*VC)*X[5];
	psi11 = Kavi*LinearAlgebra.dot(Ne,RF)/S*(1-Kvegres*VC);
	psi12 = DDIP*Kdif*thetadif^(Tₜ-293.15)*(Psed-X[1]*Fd)*1/(0.1*h);
	psi = @SVector [psi1,psi2,psi3,psi4,psi5,psi6,psi7,psi8,psi9,psi10,psi11,psi12];
	Xg = @SVector [Fd*X[1],0.0,0.0,0.0,0.0];
	Xw = X.*(@SVector [1,0,1,1,1])
	dh = 1/S*(Qinₜ+QP-Qg-Qout-QET);
	dX = Qinₜ/V*(Xin-Xw)+QET/V*Xw-QP/V*Xw-Qg/V*(Xg-Xw)+SM*psi;

	return [SVector(dh); dX]
end

@btime sol=solve(ODEProblem(dU, SVector{6, Float64}([h0; X0]), (0.0,tabs[end])), OrdinaryDiffEq.VCABM3());
4 Likes

Thank you both @ChrisRackauckas and @dawbarton for your useful answers! I’ve followed your suggestions and your examples to speed up my code. Now it runs in around 6.5 s vs 24 s in Mathematica, coming from 60 s in my previous Julia implementation.

The main changes have been:

  1. Using StaticArrays as much as possible.
  2. Tying with different solvers.
  3. Avoiding globals as much as possible by using local variables and functions into the main one.

I hope these advices to be useful also for other people starting with Julia!

3 Likes

I’m trying to grasp the ideas here, and understand efficient coding in Julia. Related to dawbarton’s suggestion for improved efficiency, I have tested 6 different ways to use parameters in a function, and tested them using BenchmarkTools. I don’t understand why, but all 6 ways give identical memory and allocs estimates… so here are the 6 functions:

using BenchmarkTools
a = 10
function f1()
    return "a=$a"
end
@benchmark f1()

For f1(), a is defined in the global space, and returned from the function. There is no variable allocation inside of the function. (?)

function f2()
    global a
    return "a = $a"
end
@benchmark f2()

For f2(), because of the global statement, the a inside the function is the same as the a in the global space, and nothing is allocated inside the function. (?)

function f3()
    global b = 10
    return "a = $b"
end
@benchmark f3()

For f3(), variable b is introduced inside the function – but as a global variable (b doesn’t exist the first time the function is called), and memory is allocated only the first time the function is called. (???)

function f4()
    c = 10
    return "a=$c"
end
@benchmark f4()

For f4(), parameter c is created inside the function, and new memory is allocated every time the function is called. (??)

function f5(p)
    return "a=$p"
end
@benchmark f5(10)

For f5(), argument p is transferred into the function as it is called, but no memory is allocated inside the function. (??)

function f6(p)
    d = p
    return "a=$d"
end
@benchmark f6(10)

For f6(), argument p is transferred into the function and does not allocate memory. But variable d is created and allocates memory every time the function is called. (??)

OK… I’m trying to grasp when memory is allocated by function calls, and my “guesses” above are probably not correct… It would certainly aid in my understanding if someone knowledgeable would correct my “guesses”…

[I should perhaps have said that wrt. execution speed, functions `f1()`–`f3()` are more or less similar, while functions `f4()`–`f6()` are more or less similar and faster.]

I imagine (but haven’t checked) that all the allocations come from the string that you are returning, hence all are the same in terms of memory and allocations. The first three are effectively identical - interpolating a global variable into a string. This will be slow(er) since Julia doesn’t know what type the variable is. For the latter three, you are interpolating a local variable, the type of which will be known at compile time (that’s not always the case but here it should be trivial for the compiler to work out what the types are) and so it can be compile to efficient (type specific) code.

Note that introducing new variables doesn’t necessarily allocate memory. If the variable is “simple” (e.g., isbits returns true) then the variable is put on the stack and doesn’t allocate, otherwise it is allocated on the heap. (That’s a gross simplification but hopefully you get the idea.)

That’s my guess, though I haven’t run any of the code!

3 Likes

Thanks for answer. Allow me to extend my question – because I want to understand this…:

  • In your rewrite of the model, you moved parameters from the global scope and into the function.
  • Wouldn’t that cause allocation of memory for parameters within the function every time the function is called [and the function is called many times in DiffEq solvers]?
  • While if the parameters are defined outside of the function (in the global scope), they are only allocating memory once??

So… obviously, I’m struggling to clarify this for myself, but I would have guessed that your rewrite allocates more memory?? Or do I totally misunderstand things?

Of course, your test indicates that your code is faster. So what is the reason for that? Is it that the type of the variable is known when the parameter is defined inside of the function, or does that only relate to the string interpolation?

I think @dawbarton or @ChrisRackauckas will clarify this better, but just to add what I’ve experienced: I noticed that in my previous implementation (using mainly globals) the code took 60 s from the first time I ran it. However, with the new implementation (moving parameters into the function), it took around 10 min the first time I ran it. After that only 6.5 s…

Stings always allocate. They are heap allocated objects. But there shouldn’t be a reason to solve a differential equation of strings. If you want control flow things, you can use a :symbol instead.

If set as const so the type is known.

The interpolations only allocate a small wrapper struct which is removed by the compiler. You can also just const allocate them outside. Numbers and small immutables do not heap allocate.

The big static arrays might be hurting compile times here. You can do this without static arrays and just using array mutation, though that would take more of a rewrite given the way you first wrote it down.

OK — maybe my examples were not clear… The reason why I returned the string was not that I want to return a string — the string was not the big point in my question, I just returned the string to check that the function operated on the same value all the time.

Instead, my point was:

  • If I introduce a single temporary assignment, say, a = 10, or ten thousand assignments a = 10; b = 5; ... inside a function call — will that cause the allocation of memory for a single variable or ten thousand variables every time I call the function?
  • In other words: is it bad practice to introduce temporary assignments inside of a function definition?

Say I want to solve a differential equation (I use a scalar equation here just to save space) \frac{d T}{dt} = -aT + bu where, to me, all symbols have a physical (or whatever) meaning. So I want to use these symbols in my implementation of the model. So I can do either…
Method 1:

a = 2
b = 1
u = 5
#
function dT(T)
    return -a*T + b*u
end

or Method 2:

const a = 2
const b = 1
const u = 5
#
function dT(T)
    return -a*T + b*u
end

or I can assign parameter values inside of the function… Method 3:

function dT(T)
    a = 2
    b = 1
    u = 5
    return -a*T + b*u
end

If I don’t care about preserving my names in the model, I can instead go for the more abstract formulation and transfer the parameters as arguments as in Method 4:

function dT(T,p)
    return -p[1]*T + p[2]*p[3]
end

Which formulation is best, from an efficiency point of view? The case that I’m using a scalar ODE is not the point — I could just as well have used hundreds or thousands of assignments of parameters.

I’m just trying to understand the origin of memory allocation that you (Chris) refer to in your guideline for efficient formulation of Differential Equations:

  • If the important point is to avoid allocating variables inside of the function, I would assume that Method 3 is the worst one.
  • If, in addition, Julia does not transfer information about the type of the variable in Method 1 while type information is transferred in Method 2, then I can understand that Method 1 is worse than Method 2.
  • And perhaps then Method 2 and Method 4 are similarly efficient??

Observe, my point is not that I have used a scalar differential equation as an example.

Or have I completely misunderstood? Is it only when assigning arrays that there is this allocation of memory every time the function is called?

In this particular case there will be no allocations for any of the functions - operations with scalars don’t (generally) allocate. This is also the case for StaticVectors, which is why I used them in the example above. Regular vectors do usually allocate.

Note that your first example would be bad because it uses non-constant globals which brings in the idea of type stability but that is a different concept. (The idea that the compiler can only generate efficient code if it can guarantee what the types are going to be.)

Personally I’d use your method 4, though with a struct or a named tuple so that I can still use the parameter names. E.g.

myp = (a = 1, b = 2, u = 3)

function dT(T,p)
    return -p.a*T + p.b*p.u
end

dT(0.1, myp)

This then allows for easy parameter studies. (Note that tuples are immutable but structs can be mutable or immutable.)

1 Like

I think maybe a helpful clarification is that “allocations” refers to heap allocations instead of all memory allocations (which also include stack allocations). My rough understanding is if the compiler can know how much memory something will take and can prove that won’t change, then it can allocate it on the stack which is very cheap and efficient. Otherwise it has to allocate it on the heap. So any structure that can grow in size (like a vector) has to be heap allocated, whereas scalars and StaticVectors can be stack allocated. Strings are heap allocated as well, which is why your earlier examples were allocating.

So if you were making a new vector each time the function was called, that would cause a lot of (heap) allocations, which can be fixed by switching to use StaticVectors instead, or by mutating an input vector instead of making a new one.

1 Like

In my last implementation I used some non-constant globals and it took 6.5 s to compile. I’ve just tried also using constant globals for type stability, as @dawbarton said and, weirdly, it has taken 7.6 s to compile. Does anyone know what happens?

1 Like

My main “problem” with my “method 4” is that it seems like the parameter argument of the differential equations are meant for those parameters that one wants to estimate. I may be wrong on that, though.

Thanks – this clarifies things to some degree.

  • Arrays are mutable, and may change size, etc., so I can understand that it is difficult to guarantee a certain location for them (unless one uses static arrays).
  • But strings are immutable – so why are they allocated on the heap?
  • Tuples are immutable – would they avoid to be allocated on the heap? Or… since a tuple may hold an array as one element, they are perhaps semi-immutable… or perhaps it is only possible to replace elements in a tuple with elements of the same type?
t = (1,2,[1,2])
t[3][1] = 2

is allowed.

Is it possible to say something general about which variables are allocated in the stack, and which in the heap? Like…

  • stack: scalars, tuples (?), stack arrays, etc.
  • heap: mutable collections (?), strings (?), etc.?

I think this thread asks the same question: How to know if object memory resides on stack or heap?. It seems that it’s not always very easy because it depends on the compiler’s current performance (since it depends on what it can prove, not on what is true).

I’m not sure why strings are heap allocated, but you can probably find some discussion on discourse or the issue tracker.

Maybe something that helps regarding the tuple example is Julia’s isbitstype. The tuple which contains a Vector will have isbits(t) == false. You can read more about that in the manual. It sounds like there isn’t a 1-1 correspondence between isbits and stack allocation, but I think it’s easier to stack allocate such objects.

At this point it’s pretty far from things I know well, so maybe someone else would have a more correct or more helpful answer.

3 Likes

The parameters are any parameters, not just those used for estimation purposes. If you have a large parameter structure then it’s easy to restrict it down if needs be; for example, (taking the previous example), if you wanted to do parameter estimation on parameter a then I’d do something like

myp = (a = 1, b = 2, u = 3)
estimate((T, p) -> dT(T, merge(myp, a=p), initialguessfora)

This way, you retain full flexibility over your function (you don’t need to edit/recompile it to change any of the constants) but you can restrict attention to certain parameters if you want.

1 Like