First call latency in distributed code

Hello!

I am facing with an issue I haven’t encountered so far with distributed code.

I have a code to optimize some function f for several initial conditions. Function f might be very complicated. To avoid high times in the first call, first I call the optimizer for a simple function fsimple, and then, for the more complicated and expensive objective function fcomplicated.

This strategy worked fine for a piece of code I run in the previous months. Now, I want to replicate the same pipeline in another project. However, I am finding that I should compile and execute the simple function fsimple for each worker. Is it the expected behavior? I think it is not since I didn’t have to do it in my previous project.

As an example, the timing for fsimple are

nworkers = 2
# First call to fsimple (1 initial condition)
Optimizing 1 initial conditions...
Time distributing: 1.5502729415893555
      From worker 2:	Optimizing initial condition 1
      From worker 2:	Finished optimization of initial condition 1. Time: 34.1743049621582
Time whole optimization: 66.75040698051453
# Second call to fsimple (2 initial conditions: the same as before twice)
Optimizing 2 initial conditions...
Time distributing: 0.32300615310668945
      From worker 2:	Optimizing initial condition 1
      From worker 2:	Finished optimization of initial condition 1. Time: 18.69347906112671
      From worker 3:	Optimizing initial condition 2
      From worker 3:	Finished optimization of initial condition 2. Time: 36.666008949279785
Time whole optimization: 70.04228496551514

From the first timing we can deduce that the overhead due to the first call is approximately 32 seconds (66-34=32). As expected, in the second call to fsimple, the optimization from worker 2 is fast since the code was already executed (34 seconds vs 18 seconds). However, I do not understand why worker 3 has to execute the code again as it was the first call (36 + 32 \approx 70).

The problem may seem irrelevant in this example, but as I increase the number of CPUs, it takes more and more time.

Thanks in advance,

SHCE

PS: I would like to provide a simple MWE, but the compiler is too smart and I cannot reproduce it with a simple explicit function.

I don’t understand your logic here. You are going to pay the cost of compilation on the first call either way, so what is the point of trying to separate this time into a call with fsimple?

I see your point there. Since I am kind of interested in the timings, I prefer to have a very simple execution of the whole code and “discard” the results, and then, execute the actual example. More precisely, in sequential code in an already started julia session, the simple example takes seconds to complete, while the complicated example takes hours (even days). Thus, I would rather pay the cost of compilation with an example that takes some seconds (and will eventually take some minutes including compilation) than with the actual example that would take days to complete.

Each worker is a different process and doesn’t share compilation of your own code, so each worker will have to compile the code. You can alleviate this by using sysimages with a typical use of your code precompiled, or, put your code into a package and add some statements using SnoopCompile.jl (I think you need a Julia version 1.8+) which will let your package be precompiled and remove a lot of the latency.

1 Like

Then, I don’t understand why in my previous project I don’t experience this issue.

I thought about this alternative but since it is a code that I am still developing I have some concerns about doing it…

Is it possible that most of the compilation time comes from your project’s dependencies ? In such case, the custom sysimage solution should solve most of your latency problem.

1 Like

It might be that… In my previous code (where everything works well), my only dependency is Optim package. @time using Optim takes 14 seconds. In this new project, my dependencies are JuMP and GLPK, and the timings are 35 and 8 seconds, respectively.

You and @jmair are suggesting that every time I run my simple function in a worker, I have to wait 35+8=43 seconds until the dependencies are compiled, right? I thought that once it was compiled in the master process, it was compiled for all the processes. It is strange because in my previous project I do not see the 14 seconds of Optim’s compilation for each worker…

@LaurentPlagne is probably onto something here with the dependencies. I believe the precompilation is shared between all workers, but precompilation doesn’t cover every use case.

IMHO the simplest thing to do is to try to build a custom image with all the dependencies and see if the latency is gone. We had such a problem once with Cairo preventing us to get some speed up with distributed parallel computing. Custom sysimage was a good solution in this case.

1 Like

Thank you (to you both). I would try to create a sysimage (I would have to structure a little bit the code first though) and see if the problems are solved. Is it enough to create the sysimage with the module that does the optimization? Or do I have to include the module and its dependencies as a sysimage? I’m sorry if the question is trivial: I am a researcher and I am not very used to structure my codes. I usually write each module in a single file (not with a MyModule/src/MyModule.jl tree structure).

Thanks in advance

Could you first run the simple function on all workers in parallel?

Yes. With 2 workers I report:

nworkers = 2
# First call to fsimple (the same number of initial conditions than workers)
Optimizing 2 initial conditions...
Time distributing: 2.490978956222534
      From worker 2:	Optimizing initial condition 1
      From worker 3:	Optimizing initial condition 2
      From worker 2:	Finished optimization of initial condition 1. Time: 42.40286302566528
      From worker 3:	Finished optimization of initial condition 2. Time: 42.08672118186951
Time whole optimization: 82.36103010177612
# Second call to fsimple
Optimizing 2 initial conditions...
Time distributing: 0.0010709762573242188
      From worker 2:	Optimizing initial condition 1
      From worker 3:	Optimizing initial condition 2
      From worker 2:	Finished optimization of initial condition 1. Time: 19.636727809906006
      From worker 3:	Finished optimization of initial condition 2. Time: 19.82528281211853
Time whole optimization: 20.138563871383667

With 3 workers:

nworkers = 3
# First call to fsimple (the same number of initial conditions than workers)
Optimizing 3 initial conditions...
Time distributing: 1.744204044342041
      From worker 4:	Optimizing initial condition 3
      From worker 3:	Optimizing initial condition 2
      From worker 2:	Optimizing initial condition 1
      From worker 2:	Finished optimization of initial condition 1. Time: 51.95658302307129
      From worker 3:	Finished optimization of initial condition 2. Time: 52.18298292160034
      From worker 4:	Finished optimization of initial condition 3. Time: 52.61937713623047
Time whole optimization: 100.61571288108826
# Second call to fsimple
Optimizing 3 initial conditions...
Time distributing: 0.001116037368774414
      From worker 2:	Optimizing initial condition 1
      From worker 4:	Optimizing initial condition 3
      From worker 3:	Optimizing initial condition 2
      From worker 2:	Finished optimization of initial condition 1. Time: 33.42637801170349
      From worker 3:	Finished optimization of initial condition 2. Time: 33.787031173706055
      From worker 4:	Finished optimization of initial condition 3. Time: 33.80772399902344
Time whole optimization: 34.26401209831238

When I run it in a cluster with 48 threads per node, the first call to fsimple on all the workers takes more than 2 hours.

Maybe I missed it, but what version of Julia are you using?

Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

I know it is an old version but it is the one installed in the cluster machine.

You can download a newer version from Download Julia. There have been some improvements with regards to latency with Distributed.jl since 1.5.

1 Like

I agree, but I would rather have the same version in my local machine than the one installed in the cluster machine (at least until I finish my current project and I ask the support people in my organization for an update)

My suggestion was to install a newer version of Julia on the cluster. You have to download the tarball, unpack it, and you’re ready to go. Of course it’d also be great if sysadmin will install a newer version for everybody, but you don’t have to wait for them.

3 Likes

The PackageCompiler.jl documentation walks through the steps of compiling a system image.
What I suggest doing, is the following:

  1. Start julia with julia --project --trace-compile=custom_precompile.jl
  2. Run a typical workload in the REPL, making sure that all functions that you want precompiled are used at some point (with the right types).
  3. Exit from your REPL. Now you should have a custom_precompile.jl file with a lot of entries of what you want precompiled.
  4. Open a new REPL (julia --project -t auto) and make sure PackageCompiler is installed (] add PackageCompiler).
  5. Type the following (it can take a while to finish):
using PackageCompiler
compiled_packages = ["MyPackage", "Plots", ...] # fill up with the packages you want to use.
PackageCompiler.create_sysimage(compiled_packages;
    sysimage_path="example_sysimage.so",
    precompile_execution_file="custom_precompile.jl")
exit()
  1. Start Julia with julia -Jexample_sysimage.so --project, this should have everything precompiled.
  2. If you are using distributed, when you do addprocs, run it like this:
addprocs(8; exeflags=["-Jexample_sysimage.so", "--project"])

I think this should be everything, but there may be some mistakes in here, so make sure to check the documentation. If anyone notices an issue here, let me know and I’ll update this.

It is definitely worth upgrading to v1.8+. You don’t need to compile it yourself from source on the cluster (but this is what I do), but I think the below will work:

cd ~
wget https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz
tar -xvf julia-1.8.2-linux-x86_64.tar.gz
echo "alias julia=~/julia-1.8.2/bin/julia" >> ~/.bashrc

This means that when you type julia in bash, it will redirect it to the right executable. You may want to remove any module loads for julia if you are using environment files.

p.s. You can of course change the version to the one you want.

Thank you very much for the detailed explanation. I specify the number of processes in my laptop with julia -p 2 myFile.jl. In the cluster machine, since I run it in more than one node, I run it with the flag machine-file.

I created a precompile execution file as per the documentation, and tried running julia --sysimage=JuMPGLPKSysimage.so -p 2 myFile.jl. The are the timings:

nworkers = 2
Optimizing 1 initial conditions...
Time distributing: 1.3093209266662598
      From worker 2:	Optimizing initial condition 1
      From worker 2:	Finished optimization of initial condition 1. Time: 17.932085037231445
Time whole optimization: 41.2694878578186
Optimizing 2 initial conditions...
Time distributing: 0.4331221580505371
      From worker 2:	Optimizing initial condition 1
      From worker 2:	Finished optimization of initial condition 1. Time: 14.45651912689209
      From worker 3:	Optimizing initial condition 2
      From worker 3:	Finished optimization of initial condition 2. Time: 18.3886079788208
Time whole optimization: 41.79726004600525

If we compare these results with the ones above, we observe that in the first run the optimization times for each of the distributions are “optimal”. I mean, the timings coincide with the already precompiled timings I reported above (about 18 seconds). However, we see that there is still some latency (Time whole optimization: 41 seconds > 17 seconds). In the second run, worker 2 performs fast and no latency is observed. However, on worker 3 we observe the same behavior as for worker 2 in the first run (fast optimization but still latency).

Should I generate the precompile execution file in parallel? Is it even possible to do so? I mean, should the sysimage I generate be aware of it will be executed in parallel?

Thank you

Edit: The sysimage contains the JuMP and GLPK packages. Right now, it is extremely difficult for me to convert each module I have into a package.