i have a function that i need to call 10s of thousands of times that only takes a few milliseconds to run. there is no I/O. and it doesn’t return anything, so no communication. it is strictly just compute. how do i best parallelize it?
i first tried threads. evaluating a simply dummy function as many times as i have threads takes 5x longer than just evaluating it once:
julia> Threads.nthreads()
24
julia> foo(x) = rand(x,x,x)
julia> @time foo(100); # showing here and below just the best of 3 runs
0.006792 seconds (2 allocations: 7.629 MiB)
julia> @time Threads.@threads for _ in 1:Threads.nthreads()
foo(100)
end
0.036396 seconds (29.44 k allocations: 184.669 MiB)
julia> 0.036396 / 0.006792
5.358657243816254
if i try to scale it up, there is still about 2x overhead:
julia> @time Threads.@threads for _ in 1:1000
foo(100)
end
0.696078 seconds (31.39 k allocations: 7.452 GiB, 46.63% gc time)
julia> 0.696078 / 0.006792 / (1000/24)
2.4596395759717313
i next tried processes, as ideally i’d do 20,000 evaluations simultaneously, and no machine has that many cores.
julia> using Distributed
julia> addprocs(); # 24 hyperthreaded workers on my machine
julia> @everywhere foo(x) = rand(x,x,x)
evaluating the same dummy function just once on a local process takes 2x as long:
julia> @time wait(@spawnat 2 foo(100));
0.013471 seconds (175 allocations: 8.188 KiB)
julia> 0.013471 / 0.006792
1.9833627797408715
evaluating it as many times as i have workers takes 4x as long:
julia> @time @sync for p in workers()
@spawnat p foo(100)
end
0.027544 seconds (3.55 k allocations: 168.047 KiB)
julia> 0.027544 / 0.006792
4.0553592461719665
can’t do any better when using multi-threaded tasks to coordinate the processes:
julia> @time begin
ts = [Threads.@spawn remotecall_wait(foo, p, 100) for p in workers()]
for t in ts; wait(t); end
end
0.079460 seconds (49.64 k allocations: 2.571 MiB)
julia> @time Threads.@threads for p in workers()
remotecall_wait(foo, p, 100)
end
0.039949 seconds (19.57 k allocations: 1.013 MiB)
ideally i’d liked to spawn 100s of workers on my cluster:
julia> using ClusterManagers
julia> addprocs_lsf(100; ssh_cmd=`ssh login1`, throttle=10)
julia> @everywhere foo(x) = rand(x,x,x)
julia> nworkers()
124
the time to evaluate once on a remote process is faster by 2x than doing so on a local process:
julia> @time wait(@spawnat 125 foo(100));
0.007510 seconds (176 allocations: 8.203 KiB)
julia> 0.013471 / 0.007510
1.7937416777629827
still evaluating it as many times as i have workers takes 9x longer than just evaluating it once locally:
julia> @time @sync for p in workers()
@spawnat p foo(100)
end
0.059993 seconds (17.84 k allocations: 837.359 KiB)
julia> 0.059993 / 0.006792
8.832891637220259
scaling up to 1000 evaluations there is still 7x overhead:
julia> @time @sync for _ in 1:1000
@spawnat :any foo(100)
end
0.394211 seconds (375.43 k allocations: 13.295 MiB)
julia> 0.394211 / 0.006792 / (1000/124)
7.197020612485277
it seems as if there is a lot of overhead in the tasks managing the threads and processes. does anyone have any ideas as to how to do this more efficiently? thanks!