And here there is a discussion exactly about load balancing in this context: Parallel load balancing · JuliaNotes.jl
But basically, you can do this:
using ChunkSplitters
function run(;nchunks=Threads.nthreads())
threadcache = [create_cache() for i in 1:nchunks]
@sync for (i_range, i_chunk) in chunks(data, nchunks)
@spawn for i in i_range
data = threadcache[ichunk]
result = compute_stuff(..., data) #data will be mutated in this function
store_results[i] = result #Maybe save the result in a vector
end
end
return store_results
end
and you can increase nchunks to be of any size, for instance 10 times nthreads(), to take advantage of the dynamic scheduling while at the same time being thread-safe by not using threadid()