How to speed up this simple code? Multithreading, simd, inbounds


I’m starting to use Julia and comparing its speed to R.

For this simple example their speeds are almost the same.
How would you make it faster?

function f2(N)
    S = 0
Threads.@threads for i in 1:N 
    se = 1:i
    S = S + sum(se .* se)

@time f2(30000)

Julia reports it’s using 8 cores but I can’t detect any improvement by using Threads.@threads.
@inbounds doesn’t seem to improve either.
I guess @simd is not appropriate here because swapping the two inner lines won’t produce the same result.


I’m far from a threading expert but: the code sample you provided doesn’t look thread-safe; the loop body reads from and modifies the variable S, so different iterations of the loop may simultaneously operate on S.

Maybe you could explain a little what the code is supposed to do? It also looks like the function doesn’t return anything.

All that said, I think you can speed up significantly by allocating less. Consider two similar functions:

function f_allocate(N)
    s = 0
    for i in 1:N
        range = 1:i
        s += sum(range .* range)
    return s

function f_fast(N)
    s = 0
    for i in 1:N
        s += sum(j^2 for j in 1:i)
    return s

They’re the same except for allocating the full vector range .* range. And the fast (non-allocating) version is 100 times faster!

using BenchmarkTools
@show @benchmark f_allocate(1000)
@show @benchmark f_fast(1000)

julia> @show @benchmark f_allocate(1000)
#= REPL[30]:1 =# @benchmark(f_allocate(1000)) = Trial(552.859 μs)
  memory estimate:  3.96 MiB
  allocs estimate:  1000
  minimum time:     552.859 μs (0.00% GC)
  median time:      683.663 μs (0.00% GC)
  mean time:        838.032 μs (16.92% GC)
  maximum time:     32.806 ms (98.22% GC)
  samples:          5946
  evals/sample:     1

julia> @show @benchmark f_fast(1000)
#= REPL[31]:1 =# @benchmark(f_fast(1000)) = Trial(4.193 μs)
  memory estimate:  0 bytes
  allocs estimate:  0
  minimum time:     4.193 μs (0.00% GC)
  median time:      4.397 μs (0.00% GC)
  mean time:        4.582 μs (0.00% GC)
  maximum time:     17.832 μs (0.00% GC)
  samples:          10000
  evals/sample:     7


Nothing important, it’s just to test.
It creates vectors of increasing size and sums the square of its elements, and then adds everything.

I think it’s thread-safe because the sum is commutative.

I forgot to say “if possible use the same algorithm” because the idea is to compare with R with the same algorithm. Though, maybe you are using the same algorithm, I need to think about it.


(I edited my post above with a performance suggestion.)

I am skeptical of that thread safety. Here’s a test:

function f_thread(n)
    s = 0
    Threads.@threads for i in 1:n
        s += sum(j*j for j in 1:i)
    return s

julia> f_thread(1000)

julia> f_thread(1000)

julia> f_thread(1000)

That f_thread function has the same pattern of modifying a variable outside the threaded loop as your f2, and despite having no randomness, its result is nondeterministic - because of race conditions.


And how would you add @simd or @inbounds or other interesting macro?


You are describing a property of the operation and the correct conclusion is that the operation is parallelizable (I think you also need associativity btw)

Thread safety is a property of the code, your code is not.


OK, I didn’t know there could exist thread-safety problems.
Can I parallelize it with other tools such as @parallel?

Though it wasn’t my intention to modify the code I have tried with:

function f2(n)
       s = sum( sum(j*j for j in 1:i) for i in 1:n )
    return s

but it doesn’t work it says:

UndefVarError: s not defined

 [1] top-level scope at In[17]:4


For thread-safety, each thread needs to accumulate into a different target. In order to not make your CPU weep, the targets need to live in different cache-lines. E.g.

function f_thread(n)
    svals = zeros(Int, 8*(1+Threads.nthreads()))
    Threads.@threads for i in 1:n
        svals[8*(1+Threads.threadid())] += sum(j*j for j in 1:i)
    return sum(svals)

The 1+Threads.threadid() is probably overly paranoid. It is to handle the case that the allocator decides to put the jl_array struct and the first entry of svals into the same cache line, and the compiler also fails to hoist the load of pointer(svals) out of the loop. In that case, the updates of the first thread might dirty the cache entry for pointer(svals) on the other cores.

Further, your example is no good.

julia> function f(N)
       s= 0
       for i=1:N for j=1:i s+= j*j end end

julia> @btime f(1000)
  3.067 μs (0 allocations: 0 bytes)

julia> @btime f(10000)
  30.536 μs (0 allocations: 0 bytes)

The result is in the wrong complexity class (you want to see quadratic number of operations, not linear!). What happens is that llvm manages to optimize the inner loop into an explicit expression: sum(j^p for j=1:N) actually is a polynomial in N, llvm figures this out and then computes this in O(1).

I would consider this a benchmarking artifact.


Why can I write this?

but not this?

It produces the following error:

syntax: extra token “)” after end of expression


j^2 for j in 1:i is not a for loop, it’s a generator.
for j in 1:i j^2 end will be a for loop, and it will not return any useful value.


IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in your inner loops.

Nonetheless, let’s look into the thread safety of your code. This operation:

S += sum(se .* se)

is not thread safe, since the way this operation works internally is by doing:

1. read current value of S
2. add to this value
3. write new value back to S

Now, if you have two threads doing this in parallel, these operations might interleave to produce the wrong result:

1. thread 1 reads current value of S
2. thread 2 reads current value of S
3. thread 1 adds to its version of S
4. thread 2 adds to its version of S
5. thread 1 writes new value to S
6. thread 2 writes new value to S, overwriting what thread 1 just wrote

This is commonly referred to as a race condition. You can see it in action in the example that @evanfields posted above:

function f_thread(n)
    s = 0
    Threads.@threads for i in 1:n
        s += sum(j*j for j in 1:i)
    return s

julia> f_thread(1000)

julia> f_thread(1000)

julia> f_thread(1000)

One solution to this problem is to use atomic operations:

function f_thread_atomic(n)
    s = Threads.Atomic{Int}(0)
    Threads.@threads for i in 1:n
        Threads.atomic_add!(s, sum(j*j for j in 1:i))
    return s[]

julia> f_thread_atomic(1000)

julia> f_thread_atomic(1000)

Comparing to the analytical solution we can see that this yields the correct result:

f_analytical(n) = (n+1)^2*(n^2+2n)÷12

julia> f_analytical(1000)


What a nice explanation!


I’ve tried the code with Threads.Atomic and it’s slower. I guess Julia needs a lot of resources to work with the threads.
What kind of things will improve with multithreading and what kind not?


Calculations which can be separated to parts which are themselves significant enough to justify the overhead are more likely to benefit from parallel processing. Your example above is exactly the opposite, small and cheap calculations (+) get swamped by the overhead.

using KissThreading

function f2(N)
    tmapreduce(+, 1:N, init = 0) do i 
        se = 1:i
        sum(se .* se)


You might want to appreciate a summation formula here:

f2(n) = n * (n+1)^2 * (n+2) ÷ 12

Might be faster, didn’t benchmark :wink:
An overflow-resistant implementation needs some thought though.


I know that :slight_smile:
But I wanted to try the same formula on both R and Julia to see how they perform.


That’s fine. But then why are you asking for help to speed it up? If you just want to test R vs Julia running identical code, then I guess you have already achieved that.

Speeding up code typically consists of (off the top of my head):

  • removing type instabilities
  • avoiding unnecessary operations
  • avoiding unnecessary allocations
  • changing the underlying algorithm
  • adding performance annotations

As I understand it, you are exclusively interested in the last point, is that correct? Unless you allow any of the first four points, multithreading wont work, @simd will do nothing, and there is nothing to @inbounds. @fastmath doesn’t apply to integer math, so what’s left, then?

One thing, though. Does your code even run? The type assertion 0::Int32 just gives me an error. Are you running on a 32-bit platform? Anyway, I suggest that you remove the type assertion, and ::Int32 in the function signature.


I meant using in-place functions that do the same algorithm but faster, for example by populating vectors and arrays without allocating a new array. Or adding macros such as @simd, @threads or @inbounds.


Does that mean that it is OK to change sum(se .* se)
into sum(abs2, 1:i) or sum(j^2 for j in 1:i)? Because those two (especially the last) will give enormous speedup, much more than multithreading ever could if it even were possible.