Hey there!
So I looked a bit at your code, and it has a lot of memory allocations that may prevent threads from working well in parallel. Before using multithreading, itβs always a good idea to squeeze every inch of performance out of the serial code. A great source for that is Juliaβs performance tips.
For reference, here is the performance of your code on my computer:
julia> @benchmark yourmain()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 2.947 s β¦ 3.277 s β GC (min β¦ max): 8.43% β¦ 14.76%
Time (median): 3.112 s β GC (median): 11.76%
Time (mean Β± Ο): 3.112 s Β± 232.697 ms β GC (mean Β± Ο): 11.76% Β± 4.47%
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
2.95 s Histogram: frequency by time 3.28 s <
Memory estimate: 19.91 GiB, allocs estimate: 725338.
julia> @benchmark yourmain_parallel()
BenchmarkTools.Trial: 2 samples with 1 evaluation.
Range (min β¦ max): 3.027 s β¦ 3.068 s β GC (min β¦ max): 4.16% β¦ 11.03%
Time (median): 3.047 s β GC (median): 7.62%
Time (mean Β± Ο): 3.047 s Β± 28.451 ms β GC (mean Β± Ο): 7.62% Β± 4.86%
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
3.03 s Histogram: frequency by time 3.07 s <
Memory estimate: 19.91 GiB, allocs estimate: 725427.
From what I can see, most of the allocations in there (calls to copy
or collect
) are actually not needed. Besides, it seems that your A
does not have the right format (remove the [...]
around sample
?). Hereβs the first version I tried:
function main(num=50000)
stops = [sample(1:num, 1000; replace=false) for _ in 1:num]
for i in 1:2:num
t1, t2 = i, i + 1
t1_stops, t2_stops = stops[t1], stops[t2]
j1, j2 = rand(1:length(t1_stops)), rand(1:length(t2_stops))
insert!(t2_stops, j2, t1_stops[j1])
splice!(t1_stops, j1)
end
end
Letβs see how it compares to yours (every time, the _parallel
counterpart just has a @floop
macro in front of the for
loop):
julia> @benchmark main()
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 1.329 s β¦ 1.534 s β GC (min β¦ max): 4.05% β¦ 9.71%
Time (median): 1.500 s β GC (median): 9.72%
Time (mean Β± Ο): 1.466 s Β± 94.484 ms β GC (mean Β± Ο): 8.43% Β± 2.84%
β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.33 s Histogram: frequency by time 1.53 s <
Memory estimate: 1.67 GiB, allocs estimate: 425322.
julia> @benchmark main_parallel()
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 1.163 s β¦ 1.401 s β GC (min β¦ max): 1.34% β¦ 14.10%
Time (median): 1.390 s β GC (median): 12.37%
Time (mean Β± Ο): 1.336 s Β± 115.509 ms β GC (mean Β± Ο): 10.42% Β± 5.87%
β β ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.16 s Histogram: frequency by time 1.4 s <
Memory estimate: 1.67 GiB, allocs estimate: 425427.
Not bad, right? Still, the speedup obtained by multithreading is not great, so I decided to profile my function. It turns out that most of the time is spent initializing the big array, which we do not parallelize. So I wrote another version which modifies an existing array instead.
function init(num=50000)
stops = [sample(1:num, 1000; replace=false) for _ in 1:num]
return stops
end
function main!(stops)
for i in 1:2:length(stops)
t1, t2 = i, i + 1
t1_stops, t2_stops = stops[t1], stops[t2]
j1, j2 = rand(1:length(t1_stops)), rand(1:length(t2_stops))
insert!(t2_stops, j2, t1_stops[j1])
splice!(t1_stops, j1)
end
end
Letβs run another benchmark!
julia> @benchmark main!(stops) setup=(stops=init())
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 36.784 ms β¦ 73.134 ms β GC (min β¦ max): 0.00% β¦ 48.15%
Time (median): 44.647 ms β GC (median): 14.84%
Time (mean Β± Ο): 49.803 ms Β± 17.156 ms β GC (mean Β± Ο): 24.33% Β± 23.17%
β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
36.8 ms Histogram: frequency by time 73.1 ms <
Memory estimate: 411.99 MiB, allocs estimate: 25000.
julia> @benchmark main_parallel!(stops) setup=(stops=init())
BenchmarkTools.Trial: 4 samples with 1 evaluation.
Range (min β¦ max): 29.238 ms β¦ 157.372 ms β GC (min β¦ max): 0.00% β¦ 80.41%
Time (median): 29.898 ms β GC (median): 0.00%
Time (mean Β± Ο): 61.602 ms Β± 63.848 ms β GC (mean Β± Ο): 51.35% Β± 40.20%
β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
29.2 ms Histogram: frequency by time 157 ms <
Memory estimate: 412.00 MiB, allocs estimate: 25111.
Unsurprisingly this is much faster, but still no benefits to multithreading. That is probably because array insertions and deletions still use too much memory. My best advice would be to take advantage of bounded array size and circumvent allocations that way.
All of this was run with 12 threads.