Adding the GreedyScheduler
trick I showed above, hereβs what I see on my 6 core machine with 6 threads:
julia> @time foreach(waste_time, work_units);
25.933839 seconds (2.01 M allocations: 6.321 GiB, 1.44% gc time, 0.01% compilation time)
julia> @time tforeach(waste_time, work_units);
17.283795 seconds (2.63 M allocations: 6.363 GiB, 0.83% gc time, 2.55% compilation time)
julia> @time tforeach(waste_time, work_units, scheduler=DynamicScheduler(nchunks=4*Threads.nthreads(), split=:scatter))
16.796721 seconds (2.03 M allocations: 6.323 GiB, 0.26% gc time, 0.06% compilation time)
julia> @time tforeach(chunks(work_units; n=2_000); scheduler=GreedyScheduler()) do inds
foreach(waste_time, view(work_units, inds))
end
13.069916 seconds (2.07 M allocations: 6.325 GiB, 0.42% gc time, 0.64% compilation time)
I donβt think this is very surprising though when we look at work_units
:
julia> histogram(work_units)
β β
[ 0.0, 200.0) β€βββββββββββββββββββββββββββββββ 999 393
[ 200.0, 400.0) β€β 379
[ 400.0, 600.0) β€β 106
[ 600.0, 800.0) β€β 43
[ 800.0, 1000.0) β€β 16
[1000.0, 1200.0) β€β 17
[1200.0, 1400.0) β€β 9
[1400.0, 1600.0) β€β 8
[1600.0, 1800.0) β€β 9
[1800.0, 2000.0) β€ 0
[2000.0, 2200.0) β€β 5
[2200.0, 2400.0) β€β 2
[2400.0, 2600.0) β€β 3
[2600.0, 2800.0) β€β 2
[2800.0, 3000.0) β€ 0
[3000.0, 3200.0) β€ 0
[3200.0, 3400.0) β€ 0
[3400.0, 3600.0) β€β 1
[3600.0, 3800.0) β€ 0
[3800.0, 4000.0) β€β 1
[4000.0, 4200.0) β€β 6
β β
Frequency
The highly expensive work-units are tiny needles in a very big haystack (99.9393% of them are between 0
and 200
, and 0.0006% of them are the very expensive 4000
entries.). Are there any multithreading systems out there that can get full thread utilization out of a case this pathological, even full on work stealing systems?
I think thereβs just literally not enough objects in the long tail of the distribution to get efficient multithreading out of this unless we do something intelligent by trying to predict how long a given work unit will take to run, and then partitioning the work units based on that.
I do expect though that if you add more work units, youβll get closer to full utilization, but since the distribution is so steep, itβll take a long time to converge.