How to speed up this simple code? Multithreading, simd, inbounds

As @foobar_lv2 pointed out above, once you use a for loop or generator for the inner summation, the benchmark is flawed since the compiler will figure out that it can replace the loop with a formula. You can see this by either looking at the native code, or estimate if the elapsed time makes sense at all for the number of operations done, or simply grow the problem by a factor 10 and see how the time is affected. For this problem, it should be roughly quadratic, so if you see a linear growth, something is wrong.

To prevent this optimization from happening, you can instead use an array of numbers:

function my_sum(v)
    acc = zero(eltype(v))
    for i = 1:length(v), j = 1:i
        acc += v[j] * v[j]
    end
    acc
end

julia> Random.seed!(0); v = rand(100);

julia> @btime my_sum($v);
  5.776 μs (0 allocations: 0 bytes)

julia> Random.seed!(0); v = rand(1000);

julia> @btime my_sum($v);
  557.419 μs (0 allocations: 0 bytes)

Is this benchmark more accurate? Let’s quickly check:

  • A 10x larger problem resulted in a ~100x run-time. Quadratic growth; this is what we’d expect.
  • A 1000 element input vector should result in 1000*1001/2 = 500500 additions. A time of 557 μs means ~1.1 ns per addition, or ~3.2 clock cycles on my system. This is very reasonable.

Now let’s see if we can parallelize this.

To reduce the overhead of threading, it usually makes sense to apply multithreading on as coarse of a level as possible. For this problem, we can split the input into one chunk per thread. So if the input size is 1000, and you have 8 threads available, split the input into 8 chunks of 125, let each thread finish its work, then sum up these 8 chunk-sums in the end.

To make all chunks do about the same amount of work in this problem, we can interleave the i indexes. This is important, since otherwise some threads will finish much faster than others, and will just idle while the remaining threads finish their work.

Below is a sample implementation.

function my_multithreaded_sum(v; T = Threads.nthreads())
    acc = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        for i = t:T:length(v), j = 1:i
            s += v[j] * v[j]
        end
        acc[t] = s
    end
    return sum(acc)
end

Let’s see how efficient it is:

julia> Random.seed!(0); v = rand(1000);

julia> @btime my_sum($v) # original version
  557.422 μs (0 allocations: 0 bytes)
168854.26023687015
  
julia> @btime my_multithreaded_sum($v)
  75.522 μs (2 allocations: 192 bytes)
168854.2602369164

julia> Threads.nthreads()
8

julia> 557.422 / 75.522
7.380922115410079

(Note: slightly different sums due to rounding errors, this is expected)

27 Likes