Threads.@threads slow down when arrays become too large

Hello all, I’m 3 days into my Julia journey - coming from python, I’m very excited by the speed. I’m currently toying with multithreading. I actually have tons of questions, but let’s start with this one…

In the actual script I’m writing, I have two 2D arrays, where each index of b is the sum of the 4 adjacent indices of b, plus the corresponding index of the second array, a. I wanted to make a very basic example to see if a thread safe version could be faster than my single threaded version or not (to make it thread safe, it first does the even rows, syncs, then does the odd rows, so that I’m never working with old values of b).

However in my tests I noticed some strange behaviours - above a certain array size, multithreading goes from being ~4.5x faster (for 8 threads, my machine has 8 cores/16 logical processors) to only about 20% faster, with the only difference being input arrays which are twice as large. Below is a minimal example to reproduce, on my machine.

using BenchmarkTools

function sum_lle(a::Array{Float32, 2}, b::Array{Float32, 2})
    h, w = size(b)
    Threads.@threads for jdx = 1:h
        for kdx = 1:w
            @inbounds b[jdx, kdx] = a[jdx, kdx + 1] + a[jdx + 2, kdx + 1] + a[jdx + 1, kdx] + a[jdx + 1, kdx + 2]

function sum_seq(a::Array{Float32, 2}, b::Array{Float32, 2})
    h, w = size(b)
    for jdx = 1:h
        for kdx = 1:w
            @inbounds b[jdx, kdx] = a[jdx, kdx + 1] + a[jdx + 2, kdx + 1] + a[jdx + 1, kdx] + a[jdx + 1, kdx + 2]

function main()
    sm = 5
    md = 14
    lg = 15
    a_sm = convert.(Float32, rand(64 + 2, 2^sm + 2))
    b_sm = zeros(Float32, 64, 2^sm)
    a_md = convert.(Float32, rand(64 + 2, 2^md + 2))
    b_md = zeros(Float32, 64, 2^md)
    a_lg = convert.(Float32, rand(64 + 2, 2^lg + 2))
    b_lg = zeros(Float32, 64, 2^lg)


    @btime sum_seq($a_sm, $b_sm)
    @btime sum_seq($a_md, $b_md)
    @btime sum_seq($a_lg, $b_lg)
    @btime sum_lle($a_sm, $b_sm)
    @btime sum_lle($a_md, $b_md)
    @btime sum_lle($a_lg, $b_lg)

The results I get:

  1.620 μs (0 allocations: 0 bytes)
  1.375 ms (0 allocations: 0 bytes)
  5.526 ms (0 allocations: 0 bytes)
  2.962 μs (41 allocations: 4.16 KiB)
  299.600 μs (41 allocations: 4.16 KiB)
  4.172 ms (41 allocations: 4.16 KiB)

A few questions:

  1. Can anyone explain this? My suspicion is that it has to do with the shared cache size of my processor (L1 is 16MB) but I have no idea, it just seems convenient. I would have expected single thread performance and multithread performance to diverge more as the tasks get larger, and this seems to be true for smaller arrays, but the trend reverses at some point.
  2. Is there a good way to work around this?
  3. What the heck are the $ for, and why is it needed in the @btime macro?
  4. Any tips to make my code more Julian are welcome!
  5. Bonus question: why am I finding that explicitly writing out loops is faster than vectorized math? I would have thought Julia had heavily optimized operations like my_arr[1, :, :] .= 0.0

Thank you in advance for the help!


One additional question. I’ve found that when I swap the order of the loops (kdx, which indexes the very large dimension, on the outside, followed by jdx, size 64, on the inside) I quadruple the speed for both serial and parallel execution, and the problem mentioned above goes away for 2^15, but is still present for 2^16. Why is THIS faster?

  1. might be when array size becomes too large, you’re essentially memory speed/bandwidth bound
  2. if your task is memory bound, there’s not much you can do, luckily, real-world example would hopefully have more actual computation so you see speed up scales better
  3. See the docs:

If the expression you want to benchmark depends on external variables, you should use $ to

you can directly rand a matrix with a specific type:

rand(Float32, 64 + 2, 2^sm + 2)
  1. Because vectorized math is just a loop in a not-so-slow language (SIMD vectorization is an ~orthogonal thing to what appear to be “vectorized style” here). In general, if something should be fast and is not, feel free to raise issue, people love fixing broadcast slower than loop performance pitfalls.

Julia is column-major when it comes to array memory layout, just like Fortran / MATLAB.

1 Like