Hello all, I’m 3 days into my Julia journey - coming from python, I’m very excited by the speed. I’m currently toying with multithreading. I actually have tons of questions, but let’s start with this one…
In the actual script I’m writing, I have two 2D arrays, where each index of b is the sum of the 4 adjacent indices of b, plus the corresponding index of the second array, a. I wanted to make a very basic example to see if a thread safe version could be faster than my single threaded version or not (to make it thread safe, it first does the even rows, syncs, then does the odd rows, so that I’m never working with old values of b).
However in my tests I noticed some strange behaviours - above a certain array size, multithreading goes from being ~4.5x faster (for 8 threads, my machine has 8 cores/16 logical processors) to only about 20% faster, with the only difference being input arrays which are twice as large. Below is a minimal example to reproduce, on my machine.
using BenchmarkTools
function sum_lle(a::Array{Float32, 2}, b::Array{Float32, 2})
h, w = size(b)
Threads.@threads for jdx = 1:h
for kdx = 1:w
@inbounds b[jdx, kdx] = a[jdx, kdx + 1] + a[jdx + 2, kdx + 1] + a[jdx + 1, kdx] + a[jdx + 1, kdx + 2]
end
end
return
end
function sum_seq(a::Array{Float32, 2}, b::Array{Float32, 2})
h, w = size(b)
for jdx = 1:h
for kdx = 1:w
@inbounds b[jdx, kdx] = a[jdx, kdx + 1] + a[jdx + 2, kdx + 1] + a[jdx + 1, kdx] + a[jdx + 1, kdx + 2]
end
end
return
end
function main()
sm = 5
md = 14
lg = 15
a_sm = convert.(Float32, rand(64 + 2, 2^sm + 2))
b_sm = zeros(Float32, 64, 2^sm)
a_md = convert.(Float32, rand(64 + 2, 2^md + 2))
b_md = zeros(Float32, 64, 2^md)
a_lg = convert.(Float32, rand(64 + 2, 2^lg + 2))
b_lg = zeros(Float32, 64, 2^lg)
println(sizeof(a_sm))
println(sizeof(a_md))
println(sizeof(a_lg))
@btime sum_seq($a_sm, $b_sm)
@btime sum_seq($a_md, $b_md)
@btime sum_seq($a_lg, $b_lg)
@btime sum_lle($a_sm, $b_sm)
@btime sum_lle($a_md, $b_md)
@btime sum_lle($a_lg, $b_lg)
end
main()
The results I get:
8976
4325904
8651280
1.620 μs (0 allocations: 0 bytes)
1.375 ms (0 allocations: 0 bytes)
5.526 ms (0 allocations: 0 bytes)
2.962 μs (41 allocations: 4.16 KiB)
299.600 μs (41 allocations: 4.16 KiB)
4.172 ms (41 allocations: 4.16 KiB)
A few questions:
- Can anyone explain this? My suspicion is that it has to do with the shared cache size of my processor (L1 is 16MB) but I have no idea, it just seems convenient. I would have expected single thread performance and multithread performance to diverge more as the tasks get larger, and this seems to be true for smaller arrays, but the trend reverses at some point.
- Is there a good way to work around this?
- What the heck are the $ for, and why is it needed in the @btime macro?
- Any tips to make my code more Julian are welcome!
- Bonus question: why am I finding that explicitly writing out loops is faster than vectorized math? I would have thought Julia had heavily optimized operations like
my_arr[1, :, :] .= 0.0
Thank you in advance for the help!
Edit:
One additional question. I’ve found that when I swap the order of the loops (kdx, which indexes the very large dimension, on the outside, followed by jdx, size 64, on the inside) I quadruple the speed for both serial and parallel execution, and the problem mentioned above goes away for 2^15, but is still present for 2^16. Why is THIS faster?