Us novices also want moar speed

g2g · December 22, 2021, 8:56pm

The following outline of an example would have helped me as I was trying to learn just enough of Julia’s parallel processing capabilities to speed up the generation of about 6000 frames in a 3D plot animation, which was my first and only (so far) Julia programming task.

# file: msld.jl (My Slow Loop Distributed)
# import Pkg.add("Distributed")
using Distributed

nProc = 8 # loop partitioned over this number of CPU cores

addprocs(nProc, exeflags="--project=.")
@everywhere begin
	include("msl.jl")	# initialize each instance of msl
end

pmap(1:nProc) do i
	msl(0, i, nProc)
end

# file: msl.jl (My Slow Loop)
# some initialization so msl() can be run from REPL
# (executed when running include("msl.jl"))
# an example include initialization:
# using Plots, Printf
# pyplot()

# msl (My Slow Loop)
# - myArgs: whatever
# - iProc: 1-based index of CPU core
# - nProc: number of CPU cores used to partition slow loop
function msl(myArgs=0, iProc=1, nProc=1)
	
	# some (possibly lengthy) initial calculations
	# ...
	nLoop = 20
	# ...
	
	# partition my slow loop among nProc CPU cores
	# NOTE: The following calculation of loop start
	# and stop indexes defaults to the non-partitioned 
	# loop and also correctly handles the unlikely case 
	# nProc >> nLoop by not running the loop at all when 
	# iLoop2 < iLoop1.
	iLoop1 = Int64(round((1 + iProc-1)*nLoop/nProc))
	iLoop2 = Int64(round(iProc*nLoop/nProc))
	
	# execute loop (partitioned or not)
	# - use default values of iProc and nProc for unpartitioned loop
	# - be sure loop iterations can be performed in any order
	for iLoop = iLoop1:iLoop2
		# some probably lengthy calculation
	end
end

I was not able to use multi-threading, described in Julia’s parallel computing documentation as “usually the easiest way to get parallelism”, because I used PyPlot for its 3D plotting capabilities and a PyPlot instance is not reentrant because (apparently) its implementation stores some state information in global variables.

However, using Distributed as in the above example outline has been a breeze to implement and I suspect a particularly slow loop that would benefit from being distributed over several CPU cores is a common pattern. Specifically about the ease of implementation, the ability to swap back and forth between running the non-distributed application (for debugging and optimizing by running include(“msl.jl”); msl()) and the distributed application (for faster production runs by running include(“msld.jl”)), without any changes to code, is a great convenience.

For those interested in seeing more than an example outline, the full Julia code is in the appendix of the (still evolving) essay at Zeno’s Enduring Example
and the resulting 3D plot animation of a complex Fourier series tracing the letter ‘e’ is at https://www.youtube.com/watch?v=pTajZtdz5ns
Concerning timing, porting the application from GNU Octave to Julia 1.7.0 resulted in a 5x increase in speed (23.3 minutes versus 120 minutes), and Julia 1.7.0 using Distributed accounted for another 2x increase in speed (11.1 minutes versus 23.3 minutes) for a total 10x increase in speed. Thank you Julia designers and developers!

I do have a question though: the Caveats section of Julia’s documentation about multi-threading gives the warning “Avoid running top-level operations, e.g. include , or eval of type, method, and module definitions in parallel.” As far as I can tell, my application is working. What problem is caused by running a top-level include in parallel?

Oscar_Smith · December 22, 2021, 9:29pm

One thing to note here is that Int64(round(iProc*nLoop/nProc)) can be better written as div(Proc*nLoop,nProc, RoundNearest) which will be faster and simpler since it won’t convert to Float64 and back. (also more accurate).

Raf · December 23, 2021, 2:09am

5x faster than octave sounds a long way from optimal. I would expect 500
(unless its all spent in BLAS anyway)

I would focus on distributed/threading optimisations last, and first focus on type stability and reducing allocations. That’s where the free performance is when you have a great compiler. You should end up with C-like native machine code.

Your lengthy calculations may need to be broken into smaller functions, which can often help with type stability.

g2g · December 23, 2021, 12:33pm

Thanks for the coding tip. (I was unaware of div’s rounding options.)

g2g · December 24, 2021, 4:23pm

Sorry, I made at least three mistakes in the template code to partition iterations of a slow loop over multiple CPU cores. First, I made a parenthesis mistake in the calculation of iLoop1. My intent was iLoop1 = Int64(round(1+(iProc-1)*nLoop/nProc)) not iLoop1 = Int64(round((1+iProc-1)*nLoop/nProc)). Second, as Oscar_Smith pointed out, I should have used Julia’s integer division so that iLoop1 = 1 + div((iProc-1)*nLoop, nProc, RoundNearest). Third, there is a big memory leak in msld.jl. Although msl() can be run thousands of times without any problem, each run of include(“msld.jl”) leaks about 1.6 GB of memory so just a few runs results in a “realloc: Not enough space” error message reported by the REPL.

The following is a fixed version of msl.jl.

# file: msl.jl (My Slow Loop)
# some initialization so msl() can be run from REPL
# (executed when running include("msl.jl"))
# an example include initialization:
# using Plots, Printf
# pyplot()

# msl (My Slow Loop)
# - myArgs: whatever
# - iProc: 1-based index of CPU core
# - nProc: number of CPU cores used to partition slow loop
function msl(myArgs=0, iProc=1, nProc=1)
	
	# some (possibly lengthy) initial calculations
	# ...
	nLoop = 20
	# ...
	
	# partition my slow loop among nProc CPU cores
	# NOTE: The following calculation of loop start
	# and stop indexes defaults to the non-partitioned 
	# loop and also correctly handles the unlikely case 
	# nProc >> nLoop by not running the loop at all when 
	# iLoop2 < iLoop1.
	iLoop1 = 1 + div((iProc-1)*nLoop, nProc, RoundNearest)
	iLoop2 = div(iProc*nLoop, nProc, RoundNearest)
	print(iLoop1, ":", iLoop2, " ")
	
	# execute loop (partitioned or not)
	# - use default values of iProc and nProc for unpartitioned loop
	# - be sure loop iterations can be performed in any order
	for iLoop = iLoop1:iLoop2
		# some probably lengthy calculation
		print(".")
	end
	print("\n")
end

I do not yet know how to change msld.jl to fix the memory leak.

g2g · December 26, 2021, 2:28pm

Looking closer I saw that the allocated memory is getting released but slower than I expected. In Windows 10, with Julia started from a power shell, the following plot shows the memory allocated to that power shell process after three consecutive runs of include(“msld.jl”).

dbg_mem_leak

I don’t know of any current way to force Julia’s garbage collector to run. For example, I looked at the output of varinfo() and did not see any large variables that could be set to Int64(0). Restarting Julia releases its allocated memory and, at least on this laptop setup, I now know to do a lot of Julia restarts while testing code that allocates a lot of memory (e.g., addprocs()).

Topic		Replies	Views
Distributed for loop slower than serial? Julia at Scale	4	1207	August 20, 2018
Questions about getting started with parallel computing Julia at Scale	18	5853	June 22, 2019
Slow parallel fractals Performance parallel , multithreading , fractal	7	542	March 4, 2020
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4036	February 11, 2019
My parallel cellular automata with Distributed Arrays is very slowly Julia at Scale	0	539	October 7, 2018

Us novices also want moar speed

Related topics