How to speed up this simple code? Multithreading, simd, inbounds

Juan · January 16, 2019, 1:50am

I’m starting to use Julia and comparing its speed to R.

For this simple example their speeds are almost the same.
How would you make it faster?

function f2(N)
    S = 0
Threads.@threads for i in 1:N 
    se = 1:i
    S = S + sum(se .* se)
    end 
end

@time f2(30000)

Julia reports it’s using 8 cores but I can’t detect any improvement by using Threads.@threads.
@inbounds doesn’t seem to improve either.
I guess @simd is not appropriate here because swapping the two inner lines won’t produce the same result.

evanfields · January 16, 2019, 2:16am

I’m far from a threading expert but: the code sample you provided doesn’t look thread-safe; the loop body reads from and modifies the variable S, so different iterations of the loop may simultaneously operate on S.

Maybe you could explain a little what the code is supposed to do? It also looks like the function doesn’t return anything.

All that said, I think you can speed up significantly by allocating less. Consider two similar functions:

function f_allocate(N)
    s = 0
    for i in 1:N
        range = 1:i
        s += sum(range .* range)
    end
    return s
end

function f_fast(N)
    s = 0
    for i in 1:N
        s += sum(j^2 for j in 1:i)
    end
    return s
end

They’re the same except for allocating the full vector range .* range. And the fast (non-allocating) version is 100 times faster!

using BenchmarkTools
@show @benchmark f_allocate(1000)
@show @benchmark f_fast(1000)


julia> @show @benchmark f_allocate(1000)
#= REPL[30]:1 =# @benchmark(f_allocate(1000)) = Trial(552.859 μs)
BenchmarkTools.Trial:
  memory estimate:  3.96 MiB
  allocs estimate:  1000
  --------------
  minimum time:     552.859 μs (0.00% GC)
  median time:      683.663 μs (0.00% GC)
  mean time:        838.032 μs (16.92% GC)
  maximum time:     32.806 ms (98.22% GC)
  --------------
  samples:          5946
  evals/sample:     1

julia> @show @benchmark f_fast(1000)
#= REPL[31]:1 =# @benchmark(f_fast(1000)) = Trial(4.193 μs)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.193 μs (0.00% GC)
  median time:      4.397 μs (0.00% GC)
  mean time:        4.582 μs (0.00% GC)
  maximum time:     17.832 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

Juan · January 16, 2019, 2:22am

Nothing important, it’s just to test.
It creates vectors of increasing size and sums the square of its elements, and then adds everything.

I think it’s thread-safe because the sum is commutative.

PS:
I forgot to say “if possible use the same algorithm” because the idea is to compare with R with the same algorithm. Though, maybe you are using the same algorithm, I need to think about it.

evanfields · January 16, 2019, 2:28am

(I edited my post above with a performance suggestion.)

I am skeptical of that thread safety. Here’s a test:

function f_thread(n)
    s = 0
    Threads.@threads for i in 1:n
        s += sum(j*j for j in 1:i)
    end
    return s
end

julia> f_thread(1000)
39967308241

julia> f_thread(1000)
4114724460

julia> f_thread(1000)
31381948695

That f_thread function has the same pattern of modifying a variable outside the threaded loop as your f2, and despite having no randomness, its result is nondeterministic - because of race conditions.

Juan · January 16, 2019, 2:31am

And how would you add @simd or @inbounds or other interesting macro?

yuyichao · January 16, 2019, 2:32am

You are describing a property of the operation and the correct conclusion is that the operation is parallelizable (I think you also need associativity btw)

Thread safety is a property of the code, your code is not.

Juan · January 16, 2019, 2:33am

OK, I didn’t know there could exist thread-safety problems.
Can I parallelize it with other tools such as @parallel?

Though it wasn’t my intention to modify the code I have tried with:

function f2(n)
       s = sum( sum(j*j for j in 1:i) for i in 1:n )
    end
    return s
end

but it doesn’t work it says:

UndefVarError: s not defined

Stacktrace:
 [1] top-level scope at In[17]:4

foobar_lv2 · January 16, 2019, 4:25am

For thread-safety, each thread needs to accumulate into a different target. In order to not make your CPU weep, the targets need to live in different cache-lines. E.g.

function f_thread(n)
    svals = zeros(Int, 8*(1+Threads.nthreads()))
    Threads.@threads for i in 1:n
        svals[8*(1+Threads.threadid())] += sum(j*j for j in 1:i)
    end
    return sum(svals)
end

The 1+Threads.threadid() is probably overly paranoid. It is to handle the case that the allocator decides to put the jl_array struct and the first entry of svals into the same cache line, and the compiler also fails to hoist the load of pointer(svals) out of the loop. In that case, the updates of the first thread might dirty the cache entry for pointer(svals) on the other cores.

Further, your example is no good.

julia> function f(N)
       s= 0
       for i=1:N for j=1:i s+= j*j end end
       s
       end;

julia> @btime f(1000)
  3.067 μs (0 allocations: 0 bytes)
83667083500

julia> @btime f(10000)
  30.536 μs (0 allocations: 0 bytes)
833666708335000

The result is in the wrong complexity class (you want to see quadratic number of operations, not linear!). What happens is that llvm manages to optimize the inner loop into an explicit expression: sum(j^p for j=1:N) actually is a polynomial in N, llvm figures this out and then computes this in O(1).

I would consider this a benchmarking artifact.

Juan · January 16, 2019, 11:19am

Why can I write this?

but not this?

It produces the following error:

syntax: extra token “)” after end of expression

yuyichao · January 16, 2019, 11:22am

j^2 for j in 1:i is not a for loop, it’s a generator.
for j in 1:i j^2 end will be a for loop, and it will not return any useful value.

bennedich · January 16, 2019, 11:29am

IMO this is the wrong end to approach multithreading. For several reasons, I’d rather recommend multithreading on as coarse of a level as possible, not in your inner loops.

Nonetheless, let’s look into the thread safety of your code. This operation:

S += sum(se .* se)

is not thread safe, since the way this operation works internally is by doing:

1. read current value of S
2. add to this value
3. write new value back to S

Now, if you have two threads doing this in parallel, these operations might interleave to produce the wrong result:

1. thread 1 reads current value of S
2. thread 2 reads current value of S
3. thread 1 adds to its version of S
4. thread 2 adds to its version of S
5. thread 1 writes new value to S
6. thread 2 writes new value to S, overwriting what thread 1 just wrote

This is commonly referred to as a race condition. You can see it in action in the example that @evanfields posted above:

function f_thread(n)
    s = 0
    Threads.@threads for i in 1:n
        s += sum(j*j for j in 1:i)
    end
    return s
end

julia> f_thread(1000)
43483971876

julia> f_thread(1000)
80136425181

julia> f_thread(1000)
57182835925

One solution to this problem is to use atomic operations:

function f_thread_atomic(n)
    s = Threads.Atomic{Int}(0)
    Threads.@threads for i in 1:n
        Threads.atomic_add!(s, sum(j*j for j in 1:i))
    end
    return s[]
end

julia> f_thread_atomic(1000)
83667083500

julia> f_thread_atomic(1000)
83667083500

Comparing to the analytical solution we can see that this yields the correct result:

f_analytical(n) = (n+1)^2*(n^2+2n)÷12

julia> f_analytical(1000)
83667083500

pablosanjose · January 16, 2019, 11:31am

What a nice explanation!

Juan · January 16, 2019, 12:31pm

I’ve tried the code with Threads.Atomic and it’s slower. I guess Julia needs a lot of resources to work with the threads.
What kind of things will improve with multithreading and what kind not?

Tamas_Papp · January 16, 2019, 12:37pm

Calculations which can be separated to parts which are themselves significant enough to justify the overhead are more likely to benefit from parallel processing. Your example above is exactly the opposite, small and cheap calculations (+) get swamped by the overhead.

mohamed82008 · January 16, 2019, 12:44pm

using KissThreading

function f2(N)
    tmapreduce(+, 1:N, init = 0) do i 
        se = 1:i
        sum(se .* se)
    end 
end

retrosnub · January 17, 2019, 7:31am

You might want to appreciate a summation formula here:

f2(n) = n * (n+1)^2 * (n+2) ÷ 12

Might be faster, didn’t benchmark
An overflow-resistant implementation needs some thought though.

Juan · January 17, 2019, 11:39am

I know that
But I wanted to try the same formula on both R and Julia to see how they perform.

DNF · January 17, 2019, 1:44pm

That’s fine. But then why are you asking for help to speed it up? If you just want to test R vs Julia running identical code, then I guess you have already achieved that.

Speeding up code typically consists of (off the top of my head):

removing type instabilities
avoiding unnecessary operations
avoiding unnecessary allocations
changing the underlying algorithm
adding performance annotations

As I understand it, you are exclusively interested in the last point, is that correct? Unless you allow any of the first four points, multithreading wont work, @simd will do nothing, and there is nothing to @inbounds. @fastmath doesn’t apply to integer math, so what’s left, then?

One thing, though. Does your code even run? The type assertion 0::Int32 just gives me an error. Are you running on a 32-bit platform? Anyway, I suggest that you remove the type assertion, and ::Int32 in the function signature.

Juan · January 17, 2019, 3:55pm

I meant using in-place functions that do the same algorithm but faster, for example by populating vectors and arrays without allocating a new array. Or adding macros such as @simd, @threads or @inbounds.

DNF · January 17, 2019, 4:12pm

Does that mean that it is OK to change sum(se .* se)
into sum(abs2, 1:i) or sum(j^2 for j in 1:i)? Because those two (especially the last) will give enormous speedup, much more than multithreading ever could if it even were possible.

Topic		Replies	Views
Threads maxing out all cores, but no performance increase General Usage performance , threads	16	1722	April 6, 2021
Slower execution with multi-threading using @threads macro Performance question , parallel , multithreading	5	716	August 13, 2020
Improving the code speed by employing parallelism for asynchronous task Performance	5	470	September 22, 2020
Problem on benchmarking multi-thread code Performance	3	376	February 10, 2021
Multi-threading doesn't seem to scale well on this counting problem Performance multithreading	15	2134	January 14, 2018

How to speed up this simple code? Multithreading, simd, inbounds

Related topics