Using Julia on a cluster

i am new to parallel computing and wanted to parallelize a for loop. This for loop is embarassingly parallel. In each iteration of the for loop it calls sciml_train with a initial guess and computes a parameter estimate. So in the end it is a multiple shooting approach. All the initial guesses are stored in the columns of a matrix, therefore the for loop iterates over the columns. The parameter guesses after the training should be stored in a global matrix, columnwise aswell. I already tried to use the Threads.threads macro, but this didn’t work at all. Can you maybe tell me how to do it in the correct way? I can submit the job via a batch submisssion on the cluster on a few hunderd cores.

The part where i want parallelization is the following, train_model just calls sciml_train

parameter_estimate = zeros(size(parameter_guesses)[1], size(parameter_guesses)[2])
estimates_errors = zeros(size(parameter_guesses)[2])

println("Computing on $(Threads.nthreads()) threads")
@threads for i in 1:size(parameter_guesses)[2]
        p_guess = parameter_guesses[:,i]
        parameter_estimate[:,i], estimates_errors[i] = train_model(ODE, u0, tstart, tend, p_guess, time_indices, saves_t, data)

If you post the output of the script it will be much easier for people to help you. Chances are that whatever problem you see doesn’t have anything to do with parallelism.

Does it work if you remove the @threads macro?

About distributed computing: It should be possible to run the task you describe on a cluster. How easy depends a little bit on the type of cluster. Easiest is probably to use the stdlib Distributed and search online for a package which can plug in your type of cluster to Distributed.

You might want to start by spinning up processes locally on the host just to test the plumbing. When running distributed, I have found that errors thrown in the user code tend end up in some cluster-logfile and all the user sees from the host process is some “worker not responding” error, so make sure the code works locally first.

A common pitfall when using process parallelism is to forget that the workers are independent OS processes which don’t share memory. This is a big difference compared to when using threads and has alot of implications which might be obvious in hindsight but are easy to overlook and lead to frustrating errors. I have found that trying to think of each worker as being pretty much the same thing as starting multiple instances of the REPL manually helps alot.

For example, the code above in your example would most likely fail because the variables and functions in the loop do not exist on the workers. Even if they did, they would be completely independent, so setting the i:th column on a worker does absolutely nothing to the same variable on the other machines.

Here is a thread with some general tips for distributed computing which could be good to read through: The ultimate guide to distributed computing

Thank you for the very long and detailed answer. So i think Distributed is indeed the better approach. In my case i kept my normal file, but started with a job array many processes on different cores. This is a not so hard work around i guess