Problem with huge number of memory allocations in for loops

spenam · February 21, 2021, 12:09pm

Hello, I am just trying to migrate from Python/C++ to Julia, so I am just starting to get familiar with the Julia syntax. I wrote a code for doing some numerical integrations in nested for loops, the very same code performs very fast on C++, but when trying to do it with Julia I am not fully writing it in the most optimal way. My Julia code performs a very high number of memory allocations on each loop (I just replace the values of multiple variables), as they are nested loops the overall code performs very badly. I would be pleased if someone has some suggestions on how to improve it.

My script goes like this:

using DelimitedFiles

function searchsortednearest(a,x)
   idx = searchsortedfirst(a,x)
   if (idx==1); return idx; end
   if (idx>length(a)); return length(a); end
   if (a[idx]==x); return idx; end
   if (abs(a[idx]-x) < abs(a[idx-1]-x))
      return idx
   else
      return idx-1
   end
end

ang = readdlm("../pmt_ang_interp.txt",skipstart=1); # angular acceptance data
qe = readdlm("../pmt_qe_interp.txt",skipstart=1); # quantum efficiency data
qe[:,2]=qe[:,2]*0.01;
wabs = readdlm("../water_abs_interp.txt",skipstart=1); # absorption length data
wscatt = readdlm("../water_scatt_rayleight.txt",skipstart=1); # scattering length data
b=Float64(0.835); # parameter b of the scattering
lam = Float64(446); #Wavelength (nm)
xa = wabs[searchsortednearest(wabs[1:end,1],lam),2]; # absorption length
xs = wscatt[searchsortednearest(wscatt[1:end,1],lam),2]; # scattering length
Pq = qe[searchsortednearest(qe[1:end,1],lam),2]; # quantum efficiency
f=Float64((1. /(4. *pi))*(3. /(3. +b)))
human = Float64(1); # Human eye conversion factor
lum = Float64(1); # Brightness (lm)
dist = Float64(400); # Distance

Aeff3inch = Float64(0.004536); # Effective area 3inch PMT [m^2]
ratdef = Float64(4.11e15); #photon rate 1 lumen
#ratdef = 8e17; # photon rate for the ROV
#ratdef = 9*10^18; # photon rate for the ROV

ctmin = Float64(-1.);
ctmax = Float64(-cos(pi/6));#0;
Nbct = Int32(100);
delct=Float64((ctmax-ctmin)/Nbct);

#light source azimuth
Nazim = Int32(100);
AzimStep =Float64(2.0 *pi/Float32(Nazim));
circ = Float64(2.0 *pi);

ndist = Float64(1000); # steps over 500m
nmet  = Float64(5); # step width (m)
rate = Float64(0);

ratesum = Float64(0); # sum of individual rates 
#loop over distance bins
ndist = Float64(10); # steps over 500m
nmet = Float64(1); #step width
pmtsteps=Int32(1);#100 #number of steps to average the value
sector=[0.,1.,2.,3.,4.,5.] # what part of the hemisphere you are looking
AngPMTStep = Float64(pi/(pmtsteps*1.0));
@time for kk = 1:6
	ratesum = 0; # sum of individual rates 
	for ii =1:pmtsteps
	#angPMT = (sector[kk]+ii/pmtsteps)*pi/6;
	angPMT = (sector[kk]+0.5)*pi/6;
	azimPMT = 0.5;
	angPMTDeg = angPMT*180.0/pi;

	xPMT = cos(azimPMT)*sin(angPMT);
	yPMT = sin(azimPMT)*sin(angPMT);
	zPMT = cos(angPMT);
		rate = 0;
		#loop over light source azimuth
		for m = 0:Nazim-1
			azim=(m+0.5)/AzimStep;
		for j = 0:ndist-1
			xm = (j+0.05)*nmet; # mean distance of bin
		Psca = nmet/xs*exp(-xm/xs); # probability of scattering inside the bin
		for l=0:Nbct-1
			ct= ctmin + (l+0.5)*delct;
			y=sqrt(dist*dist+xm*xm-2. *dist*xm*ct); # distance from scattering to PMT
			Pabs = exp(-(y+xm)/xa); # probability of non-absorption on full photon path
			cosa = (-dist*dist + xm*xm + y*y)/(2. *xm*y); # scattering angle
			cosb = (-xm*xm + y*y + dist*dist)/(2. *y*dist); # PMT-photon angle
			sinb = sqrt(1.0 -cosb*cosb);
			ihx = dist*sinb/cosb
			xphot = sqrt(1.0-cosb*cosb)*cos(azim);
			yphot = sqrt(1.0-cosb*cosb)*sin(azim);
			zphot = cosb;
			cosx = xPMT*xphot+yPMT*yphot+zPMT*zphot;
			# included sanity check in angles
			if (abs((acos(ct)+acos(cosa)+acos(cosb))/pi-1)>1e-6) 
			println(ct, " ", cosa, " ", cosb)
			end
			PSAng = f*(1. +b*(cosa)^2.) # probability for angular scattering
			PxAng = ang[searchsortednearest(ang[1:end,1],cosx),2] # probability for photon on PMT
			rate += ratdef * human * lum * Aeff3inch/y/y * Pabs * Psca * PSAng * PxAng * Pq / Nbct / Nazim;
			#part += 0;
			end
			end
		end
		#println(rate,"\t",pmtcosb[jj])
		ratesum += rate
	end
	println("For the sector= ", sector[kk], ", Rate ", ratesum/pmtsteps, "Hz")  
	ratesum = 0
end

The @time macro output gives:
5.000072 seconds (42.35 M allocations: 9.703 GiB, 7.20% gc time)

Thank you in advance.

Henrique_Becker · February 21, 2021, 12:29pm

First of all, welcome to our community!

Some thoughts:

Your code is not reproducible without the txt files you read. If it is not reproducible, then we cannot make changes and test the time difference before telling you to try it.
Put all code since ang = ... until the last line inside a function, it can be called main and have no arguments. Then just after the function definition do:

@time main()
@time main()

The reason for doing this are twofold:

The first of the performance tips is to avoid global variables. You declare a lot of variables that could be local to a function in global scope there, and this hinders a lot of compiler optimizations.
This allows you to call your code two times, getting the timing for the first and the second call. This is important because the first time you call any function in Julia it will be compiled, and calling it twice here allows us to see the difference between a call including compilation and a call without any time spent on compilation.

spenam · February 21, 2021, 1:14pm

Thank you @Henrique_Becker for your hints and your kind welcome :).

The text files are in this compressed file: Filebin | 3dxy94b7emb507ob

Now here is the new code putting all together into a main function:

using DelimitedFiles

function searchsortednearest(a,x)
   idx = searchsortedfirst(a,x)
   if (idx==1); return idx; end
   if (idx>length(a)); return length(a); end
   if (a[idx]==x); return idx; end
   if (abs(a[idx]-x) < abs(a[idx-1]-x))
      return idx
   else
      return idx-1
   end
end
function main()

ang = readdlm("../pmt_ang_interp.txt",skipstart=1); # angular acceptance data
qe = readdlm("../pmt_qe_interp.txt",skipstart=1); # quantum efficiency data
qe[:,2]=qe[:,2]*0.01;
wabs = readdlm("../water_abs_interp.txt",skipstart=1); # absorption length data
wscatt = readdlm("../water_scatt_rayleight.txt",skipstart=1); # scattering length data
b=Float64(0.835); # parameter b of the scattering
lam = Float64(446); #Wavelength (nm)
xa = wabs[searchsortednearest(wabs[1:end,1],lam),2]; # absorption length
xs = wscatt[searchsortednearest(wscatt[1:end,1],lam),2]; # scattering length
Pq = qe[searchsortednearest(qe[1:end,1],lam),2]; # quantum efficiency
f=Float64((1. /(4. *pi))*(3. /(3. +b)))
human = Float64(1); # Human eye conversion factor
lum = Float64(1); # Brightness (lm)
dist = Float64(400); # Distance

Aeff3inch = Float64(0.004536); # Effective area 3inch PMT [m^2]
ratdef = Float64(4.11e15); #photon rate 1 lumen
#ratdef = 8e17; # photon rate for the ROV
#ratdef = 9*10^18; # photon rate for the ROV

ctmin = Float64(-1.);
ctmax = Float64(-cos(pi/6));#0;
Nbct = Int32(100);
delct=Float64((ctmax-ctmin)/Nbct);

#light source azimuth
Nazim = Int32(100);
AzimStep =Float64(2.0 *pi/Float32(Nazim));
circ = Float64(2.0 *pi);

ndist = Float64(1000); # steps over 500m
nmet  = Float64(5); # step width (m)
rate = Float64(0);

ratesum = Float64(0); # sum of individual rates 
#loop over distance bins
ndist = Float64(10); # steps over 500m
nmet = Float64(1); #step width
pmtsteps=Int32(1);#100 #number of steps to average the value
sector=[0.,1.,2.,3.,4.,5.] # what part of the hemisphere you are looking
AngPMTStep = Float64(pi/(pmtsteps*1.0));
for kk = 1:6
	ratesum = 0; # sum of individual rates 
	for ii =1:pmtsteps
	#angPMT = (sector[kk]+ii/pmtsteps)*pi/6;
	angPMT = (sector[kk]+0.5)*pi/6;
	azimPMT = 0.5;
	angPMTDeg = angPMT*180.0/pi;

	xPMT = cos(azimPMT)*sin(angPMT);
	yPMT = sin(azimPMT)*sin(angPMT);
	zPMT = cos(angPMT);
		rate = 0;
		#loop over light source azimuth
		for m = 0:Nazim-1
			azim=(m+0.5)/AzimStep;
		for j = 0:ndist-1
			xm = (j+0.05)*nmet; # mean distance of bin
		Psca = nmet/xs*exp(-xm/xs); # probability of scattering inside the bin
		for l=0:Nbct-1
			ct= ctmin + (l+0.5)*delct;
			y=sqrt(dist*dist+xm*xm-2. *dist*xm*ct); # distance from scattering to PMT
			Pabs = exp(-(y+xm)/xa); # probability of non-absorption on full photon path
			cosa = (-dist*dist + xm*xm + y*y)/(2. *xm*y); # scattering angle
			cosb = (-xm*xm + y*y + dist*dist)/(2. *y*dist); # PMT-photon angle
			sinb = sqrt(1.0 -cosb*cosb);
			ihx = dist*sinb/cosb
			xphot = sqrt(1.0-cosb*cosb)*cos(azim);
			yphot = sqrt(1.0-cosb*cosb)*sin(azim);
			zphot = cosb;
			cosx = xPMT*xphot+yPMT*yphot+zPMT*zphot;
			# included sanity check in angles
			if (abs((acos(ct)+acos(cosa)+acos(cosb))/pi-1)>1e-6) 
			println(ct, " ", cosa, " ", cosb)
			end
			PSAng = f*(1. +b*(cosa)^2.) # probability for angular scattering
			PxAng = ang[searchsortednearest(ang[1:end,1],cosx),2] # probability for photon on PMT
			rate += ratdef * human * lum * Aeff3inch/y/y * Pabs * Psca * PSAng * PxAng * Pq / Nbct / Nazim;
			#part += 0;
			end
			end
		end
		#println(rate,"\t",pmtcosb[jj])
		ratesum += rate
	end
	println("For the sector= ", sector[kk], ", Rate ", ratesum/pmtsteps, "Hz")  
	ratesum = 0
end
end
@time main()
@time main()

And the output of the @time macro is:

2.010064 seconds (10.06 M allocations: 9.199 GiB, 8.88% gc time)
2.027421 seconds (10.06 M allocations: 9.199 GiB, 8.58% gc time)

lmiq · February 21, 2021, 1:21pm

There is too much in the code (a minimal example would help), but very likely you are allocating new arrays inadvertently somewhere in the loop. It is a good idea to track those allocations. I think one easy way to do it is to add:

a = @allocated begin
   # block of code
end; a > 0 && println(a) # prints a if the blocks allocates

Move the test around different blocks and lines of code and see where the allocations are occurring. Probably some assigment there would be better written as a broadcasted operation (a .= b .+ c instead of a = b + c).

spenam · February 21, 2021, 1:39pm

Thank you @lmiq I now have located the exact steps were the memory allocation is happening.

On this step Psca = nmet/xs*exp(-xm/xs); # probability of scattering inside the bin with 80 allocations every step, Pabs = exp(-(y+xm)/xa); # probability of non-absorption on full photon path with 48 allocations every step and PxAng = ang[searchsortednearest(ang[1:end,1],cosx),2] # probability fo r photon on PMT with 16288 allocations every step.

So the most time allocating step is the one that looks for the index in a preloaded array.

I really thought that was not going to be a problem.

Henrique_Becker · February 21, 2021, 1:53pm

The allocations probably come from ang[1:end,1] because here you are creating a new array. You can check if wrapping it with @view does helps you (i.e., @view(ang[1:end,1])) or not. With @view it will not allocate a new array but make a reference to the memory from the already existing one.

spenam · February 21, 2021, 2:03pm

@Henrique_Becker , your advice really is a huge improvement.
Now the second @time macro reads:
0.306110 seconds (10.06 M allocations: 190.971 MiB, 2.97% gc time)

lmiq · February 21, 2021, 2:09pm

Most of what remains comes from the fact that some type instabilities remain because you are reading data inside the function (the compiler cannot know for sure what kind of data the arrays will be). Doing this improves things very much:

ang = readdlm("./pmt_ang_interp.txt",skipstart=1); # angular acceptance data
qe = readdlm("./pmt_qe_interp.txt",skipstart=1); # quantum efficiency data
wabs = readdlm("./water_abs_interp.txt",skipstart=1); # absorption length data
wscatt = readdlm("./water_scatt_rayleight.txt",skipstart=1); # scattering length data

function main(ang,qe,wabs,wscatt)
...

Meaning, read the files outside the function and pass the resulting arrays to the function. Then the function can specialize for those types of arrays.

I get:

julia> @time main(ang,qe,wabs,wscatt)
For the sector= 0.0, Rate 0.03250691697268659Hz
For the sector= 1.0, Rate 0.018013885487274464Hz
For the sector= 2.0, Rate 0.006830213305980414Hz
For the sector= 3.0, Rate 4.562522206759234e-9Hz
For the sector= 4.0, Rate 0.0Hz
For the sector= 5.0, Rate 0.0Hz
  0.074513 seconds (169 allocations: 48.609 KiB)

The allocations remaining in the loop come from the println at the end, which I guess is fine.

spenam · February 21, 2021, 2:43pm

Thank you both @Henrique_Becker and @lmiq for all your help, I felt very supported by this community!

@lmiq Thanks for that last improvement, definetely it gives some final touches.

One final question, would it be possible to improve the Psca = nmet/xs*exp(-xm/xs); # probability of scattering inside the bin with 80 allocations every step and Pabs = exp(-(y+xm)/xa); # probability of non-absorption on full photon path with 48 allocations lines?

lmiq · February 21, 2021, 2:44pm

I think those lines are not allocating anymore after putting the file reading outside the function. It was a side effect of that.

Oscar_Smith · February 21, 2021, 5:04pm

Also, if you care about numerical integration, you should definitely check out quadgk.jl

stevengj · February 21, 2021, 5:52pm

There are probably a lot of things to clean up first.

In particular, PxAng = ang[searchsortednearest(ang[1:end,1],cosx),2] looks like a very crude piecewise-constant method for interpolation of tabulated data. If the function you are integrating is piecewise-constant, then it’s going to be very hard to use any sophisticated integration method. At the very least you might look into a better interpolation scheme, e.g. using Interpolations.jl or Dierckx.jl

One option beyond that might be to fit the data to a smooth function, e.g. with ApproxFun.jl, at which point you have a smooth interpolant and can then use smooth quadrature techniques like those in QuadGK.jl. Another possibility (if you are computing many integrals against the same tabulated data) is to construct a specialized quadrature scheme from your tabulated data — see here (Julia notebook) and here (notes) on how I did that for integrating solar spectra.

But yes, as a general rule I think many people don’t realize how many orders of magnitude they are sacrificing in performance and/or accuracy by using quadrature schemes based on Euler or trapezoidal-rule integration. It’s really stunning how much better (exponentially better) high-order schemes can be.

Topic		Replies	Views
Reducing allocations to optimize (to match Java speed) Performance	22	1341	December 10, 2018
Poor performance due to memory allocations? Performance memory-allocation	17	2854	January 15, 2019
Improving performance of Integral equations (3-Nested loops) Performance performance , loops	4	1094	May 13, 2021
Memory allocations in for loop variable Performance question	28	3639	August 31, 2018
Undesired memory allocation during interpolation Performance memory-allocation	33	1378	December 8, 2021

Problem with huge number of memory allocations in for loops

Related topics