Best practice for executing many instances parallel

Dear community,
as indicated in the title, I would like to run many (thousands) of computational experiments on a hpc-cluster using julia code. Notice that I’m new to julia, so hopefully there are better options than those I suggest here. Since I previously worked with C++ the procedure was easy: compiling a statically linked binary, and send the jobs to the queue of the cluster which processes as many jobs parallel as it has capacity. Notice that my main focus is on tracking the speed of the algorithms excluding JIT compilation times.

As far as I know, there are some options:

  1. compiling the code using PackageCompiler, and send the jobs to the queue as described above. However, it seems that with PackageCompiler, there is no way to statically linking/compiling (?). On my local machine everything works fine and fast (with the dynamic libs), however, the software versions on the cluster are out-of-date such that I would need to link all libs statically (since I cannot update the cluster).

  2. installing Julia on the cluster…
    2a) …and send “juila my_main.jl arg1 arg2 …” in the queue. This is a mess, since the run-times are very bad and not competitive due to JIT compilation in each instance.
    2b)…open a REPL environment -> start a mini-instance to get everything pre-compiled -> start the “real” instance. This is also very messy, since it produces huge unnecessary workload.

As far as I read about Julia, it is said to be a kind of ‘newcomer language’ for scientific computing, and I think that it is quite natural in many scientific disciplines to perform a large number of computational experiments. So I wonder how you are performing such experiments, or what the best practice is in your opinion?

(By now, the best option for me would be 1): a statically linked binary, which can be executed on different machines (without any dependencies). )

Thank you in advance for your remarks :slight_smile:

I assume that the work that each simulation does is comparable to the compilation cost (which is why you are concerned about it).

Here is a similar situation. I have a list of frequencies for which simulations need to be run (many more than there are available cores). I start a certain number of workers, and then start the simulations for some frequencies. Then I watch for a successful completion (runit then returns true), and then the next round of runs is started.

function parallelrunner(meshfile, frequencies, phis, neigvs, runit)
    FrequencyList.makefrequencylist(meshfile, frequencies)
    while true
        results = []
        for p = workers()
            @show frequency = FrequencyList.nextfrequency(meshfile)
            if frequency != nothing # there is another frequency for which to run
                FrequencyList.frequencydone(meshfile, frequency)
                rref = @spawnat p runit(meshfile, frequency, phis, neigvs)
                push!(results, rref)

       isempty(results) && break # no more runs: stop

        for rref = results
            if !fetch(rref) # False value indicates failure
                @warn "Failure:"
                @show rref
    return true

Note that parallelrunner is run as (this is file sim_50_5000.jl):

using H2Libjl
using H2Libjl.FrequencyList
using H2Libjl.SimUtilities: serialrunner, parallelrunner
using Distributed

@everywhere meshfile = "cylinder-30mm-2116el.mesh"
@everywhere frequencies = 50:50:5000
@everywhere phis = (pi/180).*(0:2:90)
@everywhere neigvs = 120
@everywhere include("solve_homebrew_modal.jl"); # defines runit()

parallelrunner(meshfile, frequencies, phis, neigvs, runit)

with nohup julia -p 8 -e 'include("sim_50_5000.jl"); exit()' &
So the compilation cost is limited to once per one hundred simulations (in this case).
I should also make clear that in my case all simulations have precisely the same cost (theoretically).

Thank you for your answer. Your assumption is almost right: for small instances, the compilation cost is comparable to the runtime of the algorithms . For larger instances the compilation cost is comparably low, however, it distorts the results anyway.

Regarding your suggestion: as far as I understand it, you mange the workers yourself with these scripts, which allows that your code is only compiled once for all parallel jobs (?). However, as indicated above I have the possibility to submit each experiment to an SGE cluster via qsub which does the management of the workers itself.

So I wonder if there is a possibility to submit each job via qsub such that the code is only compiled once, similar to your suggestion. (As stated above, sadly my compiled executable does not work on the cluster due to dynamic linking).

I’m performing large scale Quantum Monte Carlo simulations (>=1 million CPU hours) every month and use your point 2). I don’t think it is a mess and the JIT compilation time is completely negligible compared to my run time cost.

1 Like

I completely understand the point in your case. However, I need to run many small experiments among. Here is a short example of the runtimes of a function:

  1. function solve(model): 13.6ms
  2. function solve(model): 2.60s

where 1) is the runtime of the precompiled executable, and 2) the runtime of starting the script with JIT compilation (for one and the same instance). Sadly, in my case compilation cost matter.

Why not make one julia script start multiple small experiments on different worker processes? In this case you’ll only pay the JIT compilation price once.

Also, if your experiments are so small, why not do them in serial?

This is what I wanted to avoid, and to leave the management of the workers to the SGE cluster.

You are right, this is an option for the very small instances. However, the larger ones will run several hours, and the idea was to come up with a solution which fits for any instance size. It seems to me that managing the workers “manually” via a script is best work-around so far.

Hi @mike_k I have worked in HPC for twenty years, installing and managing HPC clusters. Would love to help you.
Is the softwaer environment on your compute nodes an dyour login nodes really different?
I would normally say that the libraries on the compute nodes and on the login nodes should be the same. Else you hit exactly the problems you describe!

There was also a good thread her recently on the JULIA compilation step being run on local node storage, to prevent cacheing problems on NFS storage. But that is a different matter.

HAppy to take this offlien if it helps!

ps. The ever present phrase “use a container” comes to mind here - as usual!
Woudl love to help you, as I Say,