i have a function that i need to call 10s of thousands of times that only takes a few milliseconds to run. there is no I/O. and it doesn’t return anything, so no communication. it is strictly just compute. how do i best parallelize it?
i first tried threads. evaluating a simply dummy function as many times as i have threads takes 5x longer than just evaluating it once:
julia> Threads.nthreads() 24 julia> foo(x) = rand(x,x,x) julia> @time foo(100); # showing here and below just the best of 3 runs 0.006792 seconds (2 allocations: 7.629 MiB) julia> @time Threads.@threads for _ in 1:Threads.nthreads() foo(100) end 0.036396 seconds (29.44 k allocations: 184.669 MiB) julia> 0.036396 / 0.006792 5.358657243816254
if i try to scale it up, there is still about 2x overhead:
julia> @time Threads.@threads for _ in 1:1000 foo(100) end 0.696078 seconds (31.39 k allocations: 7.452 GiB, 46.63% gc time) julia> 0.696078 / 0.006792 / (1000/24) 2.4596395759717313
i next tried processes, as ideally i’d do 20,000 evaluations simultaneously, and no machine has that many cores.
julia> using Distributed julia> addprocs(); # 24 hyperthreaded workers on my machine julia> @everywhere foo(x) = rand(x,x,x)
evaluating the same dummy function just once on a local process takes 2x as long:
julia> @time wait(@spawnat 2 foo(100)); 0.013471 seconds (175 allocations: 8.188 KiB) julia> 0.013471 / 0.006792 1.9833627797408715
evaluating it as many times as i have workers takes 4x as long:
julia> @time @sync for p in workers() @spawnat p foo(100) end 0.027544 seconds (3.55 k allocations: 168.047 KiB) julia> 0.027544 / 0.006792 4.0553592461719665
can’t do any better when using multi-threaded tasks to coordinate the processes:
julia> @time begin ts = [Threads.@spawn remotecall_wait(foo, p, 100) for p in workers()] for t in ts; wait(t); end end 0.079460 seconds (49.64 k allocations: 2.571 MiB) julia> @time Threads.@threads for p in workers() remotecall_wait(foo, p, 100) end 0.039949 seconds (19.57 k allocations: 1.013 MiB)
ideally i’d liked to spawn 100s of workers on my cluster:
julia> using ClusterManagers julia> addprocs_lsf(100; ssh_cmd=`ssh login1`, throttle=10) julia> @everywhere foo(x) = rand(x,x,x) julia> nworkers() 124
the time to evaluate once on a remote process is faster by 2x than doing so on a local process:
julia> @time wait(@spawnat 125 foo(100)); 0.007510 seconds (176 allocations: 8.203 KiB) julia> 0.013471 / 0.007510 1.7937416777629827
still evaluating it as many times as i have workers takes 9x longer than just evaluating it once locally:
julia> @time @sync for p in workers() @spawnat p foo(100) end 0.059993 seconds (17.84 k allocations: 837.359 KiB) julia> 0.059993 / 0.006792 8.832891637220259
scaling up to 1000 evaluations there is still 7x overhead:
julia> @time @sync for _ in 1:1000 @spawnat :any foo(100) end 0.394211 seconds (375.43 k allocations: 13.295 MiB) julia> 0.394211 / 0.006792 / (1000/124) 7.197020612485277
it seems as if there is a lot of overhead in the tasks managing the threads and processes. does anyone have any ideas as to how to do this more efficiently? thanks!