So I think I follow how Threads.@threads
works in typical cases. I.e. this pattern:
Threads.@threads for i = 1:N
something(i)
end
where the something(i)
will be called N times with concurrency by being scheduled to, say, n threads in a threadpool – but not necessarily in any particular order.
So if the threadpool has a number of threads n < N, then only n tasks will be concurrently running. Once one of these finishes, another is scheduled. One could set N really large and the block would carry on scheduling tasks for something(i)
, n at a time, and the overall block will complete only after i has taken on every value from 1 to N.
My question has to do with how to get this to terminate early once some criteria has been reached. This would be something cumulative, e.g. let’s say we’re adding up uniform random numbers on [0..1) and we want to stop when the sum is greater than some value S (this is obviously a contrived example but I’m trying it keep it simple).
So I think you’d do that this way:
function f(S, N)
done = Threads.Atomic{Bool}(false)
rsum = Threads.Atomic{Float64}(0.0)
Threads.@threads for i = 1:N
Threads.atomic_or!(done, rsum[] >= S)
done[] && break
Threads.atomic_add!(rsum, rand())
end
rsum
end
The sum of course may be as much as S + Threads.nthreads()
but let’s say we can accept a little over-run (in fact, as one might expect it looks like on average it overruns by about Threads.nthreads() / 2
:
julia> Threads.nthreads()
8
julia> mean([f(1000, typemax(Int64)) for _ in 1:1000])
1003.7479534404102
Now, my question was going to be, is there a way to have a pattern like
Threads.@threads while rsum[] < S
something()
end
?
In part it’s because I was thinking that what I have above would be pretty inefficient given that as N increases, once the done
flag is true (on average that takes 2S iterations for this code, Julia still has to schedule the remaining iterations. To my surprise though, running out those remaining iterations seems to be quite cheap:
julia> @benchmark f(S, 1000)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 93.875 μs … 1.950 ms ┊ GC (min … max): 0.00% … 88.44%
Time (median): 223.458 μs ┊ GC (median): 0.00%
Time (mean ± σ): 225.653 μs ± 40.987 μs ┊ GC (mean ± σ): 0.08% ± 0.88%
▁▂▃▄▄▄▆▆▆▇█▆▅▃▂▁
▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▃▃▄▄▆▆██████████████████▇▇▇▆▅▄▃▃▃▂▂▂▂▂▁▁▂▂▂▂▂ ▄
93.9 μs Histogram: frequency by time 344 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, 10_000)
BenchmarkTools.Trial: 8305 samples with 1 evaluation.
Range (min … max): 245.584 μs … 2.311 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 570.125 μs ┊ GC (median): 0.00%
Time (mean ± σ): 601.192 μs ± 92.648 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▂
▂▁▁▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▃▅████▇▇▆▆▅▅▄▄▄▄▄▅▄▅▄▄▄▅▆███▅▄▃▂▂▂ ▃
246 μs Histogram: frequency by time 790 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, 100_000)
BenchmarkTools.Trial: 8360 samples with 1 evaluation.
Range (min … max): 293.292 μs … 946.250 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 562.230 μs ┊ GC (median): 0.00%
Time (mean ± σ): 597.242 μs ± 88.290 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▃
▂▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▂▂▁▂▂▃▆█████▆▆▆▆▅▅▄▄▄▄▄▄▄▃▃▃▃▃▄▄▅▆▆██▆▅▃▃ ▃
293 μs Histogram: frequency by time 764 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, 1_000_000)
BenchmarkTools.Trial: 8373 samples with 1 evaluation.
Range (min … max): 343.041 μs … 1.068 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 561.041 μs ┊ GC (median): 0.00%
Time (mean ± σ): 596.332 μs ± 87.863 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▄▁
▂▂▂▁▁▂▁▁▁▁▁▂▁▂▁▂▂▁▁▂▂▄▇█████▇▆▆▆▅▅▅▄▄▄▄▄▄▄▄▃▃▄▃▃▃▄▄▅▆▇▇█▇▆▄▃ ▄
343 μs Histogram: frequency by time 758 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, 10_000_000)
BenchmarkTools.Trial: 8371 samples with 1 evaluation.
Range (min … max): 213.125 μs … 958.625 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 559.542 μs ┊ GC (median): 0.00%
Time (mean ± σ): 596.468 μs ± 88.037 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▂
▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▂▂▁▁▂▂▂▂▂▂▅████▇▆▆▅▅▅▄▄▄▄▄▄▃▃▃▃▄▅▆▇█▇▆▄▃ ▃
213 μs Histogram: frequency by time 760 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, 100_000_000)
BenchmarkTools.Trial: 8357 samples with 1 evaluation.
Range (min … max): 324.667 μs … 901.875 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 561.458 μs ┊ GC (median): 0.00%
Time (mean ± σ): 597.500 μs ± 87.812 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▆▃
▂▁▁▁▁▁▁▁▁▁▂▁▁▂▁▂▂▂▂▂▁▂▂▄██████▇▆▆▅▅▅▄▄▄▄▄▄▄▄▄▃▃▃▃▄▄▆▆▇██▇▆▄▃▃ ▃
325 μs Histogram: frequency by time 766 μs <
Memory estimate: 4.73 KiB, allocs estimate: 52.
julia> @benchmark f(S, typemax(Int64))
BenchmarkTools.Trial: 8006 samples with 1 evaluation.
Range (min … max): 88.000 μs … 1.603 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 622.479 μs ┊ GC (median): 0.00%
Time (mean ± σ): 623.616 μs ± 101.896 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅ ▁▁
▂▁▁▁▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃█████▇▇▅▆▆▇▇█▇▆▆▇███▆▅▄▃▃▃▂▂▃ ▃
88 μs Histogram: frequency by time 863 μs <
Memory estimate: 4.58 KiB, allocs estimate: 47.
So in a case where we can pick N and it’s not ridiculously large (or even if it is, like typemax(Int64)
as you see, I guess this pattern suffices.
Still, what if we don’t know N? Would this pattern just have to be wrapped in an outer while !done[]
, and you would just pick N based on the specifics of the algorithm (i.e. bounded below by some minimum number of iterations)? Still, what if you don’t want to give up those cycles?
Sorry for the lengthy post and I hope I just haven’t missed some existing @threads while ...
construct!