I am using Distributed.jl and trying to parallelize a for loop on a slurm cluster (Hyak klone at the University of Washington). I mark the loop with the @sync macro, then I enclose its insides in an @async macro. There’s code at the end that’s supposed to display the results after all the @async calls handle. Instead, the code after the loop executes immediately and the program ends.
I use “salloc -A stf -p compute -N 2 -c 4 --mem=10G ==time=2:00:00” to get my nodes
My code has the following form
@eval using Distributed
num_workers = 2 # create 2 workers for distributed computing
addprocs(num_workers, exeflags="--project=$(Base.active_project())")
# import a bunch of packages with @eval using Package
@eval @everywhere include("MyModule.jl")
@eval using .MyModule.jl
@everywhere dir = # my directory
cd(dir)
@everywhere push!(LOAD_PATH, dir)
print(Threads.nthreads())
# do a bunch of stuff before the for loop
task_list = [] # this structure will have a list of lists for each worker. each list in the list of lists will contain all the arguments needed for a single function call to test_func
for worker_index in range(start = 1, stop = num_workers, step = 1)
push!(task_list,[])
end
# populate every task in task_list without parallelization
# perform tasks
@sync @distributed for pid in workers()
@async begin
# import the same packages as before using @eval
for task in task_list[j]
keyName, index_1, index_2 = task
try
diff_sqr_list[index_1, index_2 ] = remotecall_fetch(test_func, pid, keyName)
catch e
print(e)
print("\n")
print("Error at"*keyName)
end
end
end
end
display(diff_sqr_list)
My Julia version is 1.10.0
I expect diff_sqr_list to populate and then get displayed, but instead diff_sqr_list gets displayed and the program ends.
You are at least right to expect that everything in the @sync block should complete before the program moves on. I don’t have the hardware to test this out, but some odd things jump out at me.
lots of @eval calls on import statements, I can’t see why
using .MyModule.jl : It’s not impossible for a module to be named jl, but considering you included a file "MyModule.jl", I’m guessing you intended to import MyModule. This should print a warning, rather than error.
for task in task_list[j]: where does the j come from? Was there another for-loop level?
diff_sqr_list[index_1, index_2 ] =... : where was diff_sqr_list assigned?
Could you add indications to the comments that are standing in for omitted code? They’re not immediately distinguishable from descriptive comments, so it took effort to delineate. With so much code omitted and an absence of what was printed exactly, it’s hard to say what happened. For example, was “Error at…” printed often and why? If diff_sqr_list was never assigned or lacks the proper indices, it’s plausible to blow through the entire @sync block in UndefVarErrors or BoundsErrors that get caught and printed.
This does not do what you think it does. Especially the @sync does not propagate to the workers. So what happens is each worker starts a task with @async ans then the program moves on.
But why do you want another layer of parallelization here? If each worker only starts a single Task, it’s unnecessary. Just remove the @sync and @async and it should work.
Or was your goal to use multiple threads on each node? (Just reread your SLURM command at top - you allocate 2 nodes and 4 cores each right?). Did you omit some loop around the @async? Anyways, in that case you need to move the @sync inside the outer @distributed loop. Think like this: what happens inside @distributed for ... end happens on the workers. Now each worker should start multiple tasks and wait for their completion. So the structure would be:
@distributed for pid in workers()
# per worker setup
@sync for chunk in chunks # or whatever your loops look like
@spawn begin # use @spawn instead of @async
# per chunk setup
compute_chunk(chunk)
Thank you! Removing the @sync and @async tags helped in this case. I’ve discovered I had a few other errors in my implementation, so I might be asking more questions soon, but for this question, your post is the answer