How to speed up this simple code? Multithreading, simd, inbounds


Oh, I was playing with and without Int32 and trying more modifications, but if y the input is large you get overflow errors.
In general I’m interested in improving the speed using all tricks. But in this case I wanted to compare it to R using the same formula/algorithm on both platforms in order to get a fair comparison.
“Avoiding unnecessary allocations” and “removing type instabilities” was other of my aims.


R is an interpreted language, so even mildly idiomatic Julia code with loops will blow it out of the water. Eg


mysum <- function(n) {
    acc <- 0
    for (i in 1:n) {
        for (j in 1:i) {
            acc <- acc + j*j


> tic()
> mysum(10000)
[1] 8.336667e+14
> toc()
2.544 sec elapsed


using BenchmarkTools
function mysum(n)
    acc = 0
    for i in 1:n
        for j in 1:i
            acc += j*j
julia> @btime mysum(10000)
  21.943 μs (0 allocations: 0 bytes)

This comparison was “fair”, but quite meaningless. This is not where R shines, and this is well known; your only hope for fast runtime in R is when you find something called in C or Fortran. No one would use R for speed, only for the large selection of excellent packaged libraries. One can optimize the Julia code further I guess, but it would not make a huge difference to this conclusion.


Just because it is widely known does not make it meaningless. It is still quite meaningful to quantify why you can’t write performance-critical inner loops in R, and why you can in Julia.

(Many people have a vague notion that “loops in R are slow” but have no idea how slow, or why, or whether loops written by mere mortals—not just the godlike beings who write the “built-in” libraries—can be fast in a different dynamic programming environment.)

But this is a terrible example to study multi-threading with for the reason @bennedich pointed out above: this kind of summation (over only 1000 elements!) is too cheap to parallelize effectively.


In fact I’m using the library data.table and sometimes is faster than Julia, for example to calculate the mean by groups or to read csv files.
But yes, I know Julia is fast, that’s why I want to move to it.




As @foobar_lv2 pointed out above, once you use a for loop or generator for the inner summation, the benchmark is flawed since the compiler will figure out that it can replace the loop with a formula. You can see this by either looking at the native code, or estimate if the elapsed time makes sense at all for the number of operations done, or simply grow the problem by a factor 10 and see how the time is affected. For this problem, it should be roughly quadratic, so if you see a linear growth, something is wrong.

To prevent this optimization from happening, you can instead use an array of numbers:

function my_sum(v)
    acc = zero(eltype(v))
    for i = 1:length(v), j = 1:i
        acc += v[j] * v[j]

julia> Random.seed!(0); v = rand(100);

julia> @btime my_sum($v);
  5.776 μs (0 allocations: 0 bytes)

julia> Random.seed!(0); v = rand(1000);

julia> @btime my_sum($v);
  557.419 μs (0 allocations: 0 bytes)

Is this benchmark more accurate? Let’s quickly check:

  • A 10x larger problem resulted in a ~100x run-time. Quadratic growth; this is what we’d expect.
  • A 1000 element input vector should result in 1000*1001/2 = 500500 additions. A time of 557 μs means ~1.1 ns per addition, or ~3.2 clock cycles on my system. This is very reasonable.

Now let’s see if we can parallelize this.

To reduce the overhead of threading, it usually makes sense to apply multithreading on as coarse of a level as possible. For this problem, we can split the input into one chunk per thread. So if the input size is 1000, and you have 8 threads available, split the input into 8 chunks of 125, let each thread finish its work, then sum up these 8 chunk-sums in the end.

To make all chunks do about the same amount of work in this problem, we can interleave the i indexes. This is important, since otherwise some threads will finish much faster than others, and will just idle while the remaining threads finish their work.

Below is a sample implementation.

function my_multithreaded_sum(v; T = Threads.nthreads())
    acc = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        for i = t:T:length(v), j = 1:i
            s += v[j] * v[j]
        acc[t] = s
    return sum(acc)

Let’s see how efficient it is:

julia> Random.seed!(0); v = rand(1000);

julia> @btime my_sum($v) # original version
  557.422 μs (0 allocations: 0 bytes)
julia> @btime my_multithreaded_sum($v)
  75.522 μs (2 allocations: 192 bytes)

julia> Threads.nthreads()

julia> 557.422 / 75.522

(Note: slightly different sums due to rounding errors, this is expected)

How to code faster parallel for loop

Thank you for your amazing answers!

In this case it seems that a single thread + simd is a perfect match

function my_simd_sum(v)
    s = zero(eltype(v))
    for i = 1:length(v)
        @simd for j = 1:i
        @inbounds s += v[j] * v[j]
    return s
@btime my_simd_sum($v);    40.130 μs
@btime my_multithreaded_sum($v; T=8);   58.376 μs (2 allocations: 192 bytes)
  • Do you have any advice on how to implement a solution with Threads and SIMD?

Probably there is not enough computation in this example to make it worth it but, for the sake of knowledge, I would like to know how to approach a very simple problem mixing both @threads and @simd.

This is what I tried

function my_multithreaded_simd_sum(v; T = Threads.nthreads())
    partial_result_per_thread = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        len = div(length(v), T)
        domain_per_thread = ((t-1)*len +1):t*len
        for i in domain_per_thread
           @simd for j in 1:i
            @inbounds s += v[j] * v[j]
        partial_result_per_thread[t] = s
    return sum(partial_result_per_thread)
@btime my_multithreaded_simd_sum($v);
9.196 μs  (2 allocations: 192 bytes)


That looks alright to me, but see my note above about interleaving the i indexes. The point of this is to let each thread do the same amount of work. Let’s say you have 8 threads and a 1000 element vector. With your approach, thread 1 will be doing 125*126/2 = 7875 additions, while thread 8 does (1000*1001-875*876)/2 = 117250 additions. This means that thread 1 (and 2, 3, …) will finish long before thread 8 and just sit and idle. On the contrary, by interleaving the i indices, all threads will do approximately the same number of additions (this is very problem specific though). On my system, this doubles the performance (also with SIMD).

Btw, the implementation above will also not work correctly if the vector size is not a multiple of the number of threads, e.g.:

julia> my_multithreaded_simd_sum([1 2 3])

To fix that, you could do something like this:

n = length(v)
domain_per_thread = 1+((t-1)*n÷T):t*n÷T


I am not sure what you did. Could you post your function with the change you mention that is twice faster?

This type of exercices are helpful for learning a little bit how to use Threads.


Of course, it’s simply the code I posted in my previous post, but with @simd and @inbounds:

function my_multithreaded_sum_simd(v; T = Threads.nthreads())
    acc = zeros(eltype(v), T)
    Threads.@threads for t = 1:T
        s = zero(eltype(v))
        for i = t:T:length(v) # this is the "interleaving"
            @simd for j = 1:i
                @inbounds s += v[j] * v[j]
        acc[t] = s
    return sum(acc)


Can you make this more clear to me? How compiler make something that is polynomial in N computable in O(1)? I probably misunderstood some concepts here.


Because of math.


Excuse me, I need ask more about that. LLVM is able to recognize mathematical code, that have known formula and just use this formula? I’m ignorant in the world of compilers and after my C and C++ expirience, this looks unbelivebly good, so its hard to belive. To be sure, do I understand this correctly?

I don’t mean that compilers of C and C++ are bad, I just never noticing them doing something that. Maybe I just don’t look very cearfully.


LLVM and compilers in general do know about basic integer arithmetic identities and can transform e.g. sum of a consecutive range of integers into the corresponding polynomial expression for the sum. It doesn’t go too much beyond that and doesn’t work for floating-point since floating-point isn’t purely associative like integers are.


They do. If you write a C function that add a integer sequence, they should do this optimization automatically there too (both clang and gcc(edit: aparently not for gcc) at least).

I actually kind of doubt how useful this is though. Unless the pattern is somehow really hiden from the programmer this should be a fairly simple transform the programmer can do. Anyway, most optimization are trivial by themselves anyway so I’m certainly not complaining that they could do this…


Clang definitely does, but here gcc seems not to.


This looks nice. When I was using clojure I would rely on reducers quite heavily for performant code. This looks similar. It used java’s fork/join framework under the hood. I’d love to see a reducers.jl package.


My Transducers.jl package (which obviously is inspired by Clojure) also does threading with mapreduce. But it’s very simple and not optimized ATM:

using Transducers
mapreduce(Map(i -> mapfoldl(MapSplat(*), +, zip(1:i, 1:i); simd=true)),
          +, 1:N; init=0)


Thank you for that information. Now it is clear.


I only used gcc, so this is maybe part of reason. Thank you for that information.