What's the problem with this simple multi-thread code?

I defined a very simple function to compute the dot product, which is supposed to be equivalent to dot(x,A,y) in the package LinearAlgebra. For performance, I use Threads.@threads to make it multi-thread:

using BenchmarkTools, LinearAlgebra

const n=10000
x=rand(n)
A=randn(n,n)
y=rand(n)

function loop_dot(x, A, y)
    s = 0.0
    Threads.@threads for i = eachindex(y)
        for k = eachindex(x)
            s += x[k]*A[k,i]*y[i]
        end
    end
    return s
end

But the benchmark

@btime loop_dot(x,A,y)

result is

  2.720 s (200000042 allocations: 2.98 GiB)
-1397.5972007491994

while

@btime dot(x,A,y)

gives

  41.147 ms (1 allocation: 16 bytes)
4172.734572614711

So, what’s wrong here? Why so huge memory allocation for loop_dot(x,A,y)? Why is its result incorrect (not equal to dot(x,A,y))?

This is a reduction (on s) so the result will be wrong (if you are lucky enough).
Data race : all threads will write concurrently to s value. You should consider @tkf 's ThreadsX.jl reductions.

3 Likes

To stop your threads stomping on each other

function loop_dot(x, A, y)
    s = Matrix(undef, length(y), length(x))
    Threads.@threads for i = eachindex(y)
        for k = eachindex(x)
            s[i, k] = x[k]*A[k,i]*y[i]
        end
    end
    return sum(s)
end

Here’s some documentation around this topic

https://juliafolds.github.io/data-parallelism/

2 Likes

Some alternatives:

serial:

julia> function loop_dot_serial(x, A, y)
           s = 0.
           for i = eachindex(y)
               for k = eachindex(x)
                   s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_serial (generic function with 1 method)

julia> @btime loop_dot_serial($x,$A,$y)
  259.091 ms (0 allocations: 0 bytes)
98.64233873489854

manual:

julia> function loop_dot_manual(x, A, y; ntasks = Threads.nthreads())
           s = zeros(ntasks)
           Threads.@threads for it in 1:ntasks
               for i in it:ntasks:length(y)
                   for k = eachindex(x)
                       s[it] += x[k]*A[k,i]*y[i]
                   end
               end
           end
           return sum(s)
       end
loop_dot_manual (generic function with 1 method)

julia> @btime loop_dot_manual($x,$A,$y)
  105.539 ms (42 allocations: 4.02 KiB)
98.64233873543105

using Floops

julia> using FLoops

julia> function loop_dot(x, A, y)
           @floop for i = eachindex(y)
               for k = eachindex(x)
                   @reduce s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot (generic function with 1 method)

julia> @btime loop_dot($x,$A,$y)
  91.441 ms (116 allocations: 5.89 KiB)
98.64233873490912

using LoopVectorization (applies for this simple operation):

julia> using LoopVectorization

julia> function loop_dot_turbo(x, A, y)
           s = 0.
           @tturbo for i = eachindex(y)
               for k = eachindex(x)
                   s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_turbo (generic function with 1 method)

julia> @btime loop_dot_turbo($x,$A,$y)
  69.347 ms (0 allocations: 0 bytes)
98.64233873475337

using Tullio:

julia> using Tullio

julia> function loop_dot_tullio(x,A,y)
           s = 0.
           @tullio s += x[k]*A[k,i]*y[i]
           return s
       end
loop_dot_tullio (generic function with 1 method)

julia> @btime loop_dot_tullio($x,$A,$y)
  72.149 ms (203 allocations: 10.30 KiB)
98.64233873475337

7 Likes

Thank you. So I have a somewhat simplified version:

function loop_dot_auto(x, A, y; ntasks = Threads.nthreads())
    s = zeros(ntasks)
    @inbounds Threads.@threads for i = eachindex(y)
        id=Threads.threadid()
        @simd for k = eachindex(x)
            s[id] += x[k]*A[k,i]*y[i]
        end
    end
    return sum(s)
end

Please test it.

It seems FLoops does the job:

using BenchmarkTools, LinearAlgebra
using FLoops

const n=10000
x=rand(n)
A=randn(n,n)
y=rand(n)

function loop_dot(x, A, y)
    s = 0.0
    @floop for i = eachindex(y)
        for k = eachindex(x)
            @reduce s += x[k]*A[k,i]*y[i]
        end
    end
    return s
end

And by executing @btime loop_dot(x,A,y)
on my PC I found:

92.664 ms (54 allocations: 2.77 KiB)
3425.3843229334875

The use of threadid() is not a recommended pattern, because the tasks may migrate between threads. That is why the β€œmanual” version above is slightly more complicated than this, but it is safe.

1 Like

It maybe useful to mention that it is very unlikely that MT parallelization allows for satisfactory speedups (if any) in real life for this kernel because its Computing Density (or Arithmetic Intensity) is very low. In other words, most of the time is spent in data access and not in arithmetic calculations. This is (spuriously ?) mitigated by the hot cache set up provided by @btime.

In order to accelerate, the best strategy is to fuse different kernels to increase the CD.

2 Likes

The benchmark might be here tricky, perhaps, because what I get is that @turbo provides a similar speedup than using threads (4 threads here). I am trying to avoid the hot cache by re-initializing the arrays every time. It may be that for some reason SIMD operations are kicking in inside the threaded loop and not in the serial loop…

julia> using LoopVectorization

julia> function loop_dot_serial(x, A, y)
           s = 0.
           for i = eachindex(y)
               for k = eachindex(x)
                   @inbounds s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_serial (generic function with 1 method)

julia> function loop_dot_manual(x, A, y; ntasks = Threads.nthreads())
           s = zeros(ntasks)
           Threads.@threads for it in 1:ntasks
               for i in it:ntasks:length(y)
                   for k = eachindex(x)
                       @inbounds s[it] += x[k]*A[k,i]*y[i]
                   end
               end
           end
           return sum(s)
       end
loop_dot_manual (generic function with 1 method)

julia> function loop_dot_turbo_parallel(x, A, y)
           s = 0.
           @tturbo for i = eachindex(y)
               for k = eachindex(x)
                   s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_turbo_parallel (generic function with 1 method)

julia> function loop_dot_turbo_serial(x, A, y)
           s = 0.
           @turbo for i = eachindex(y)
               for k = eachindex(x)
                   s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_turbo_serial (generic function with 1 method)

julia> const n = 10^4
10000

julia> @btime loop_dot_serial(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  119.716 ms (0 allocations: 0 bytes)
1.2341963698459884e7

julia> @btime loop_dot_manual(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  30.759 ms (43 allocations: 4.05 KiB)
1.2442783679944485e7

julia> @btime loop_dot_turbo_serial(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  33.684 ms (0 allocations: 0 bytes)
1.2526249273799863e7

julia> @btime loop_dot_turbo_parallel(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  31.227 ms (0 allocations: 0 bytes)
1.2405906908949133e7

By using simd in the inner loop of the serial version, we get almost the same performance as the threaded versions:

julia> function loop_dot_serial(x, A, y)
           s = 0.
           @inbounds for i = eachindex(y)
               @simd for k = eachindex(x)
                   s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot_serial (generic function with 1 method)

julia> @btime loop_dot_serial(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  43.715 ms (0 allocations: 0 bytes)
1.237619409940113e7

3 Likes

@inbounds needs to go inside @threads What is the cause of this performance difference between Julia and Cython? - #4 by tkf

In general, try minimizing the scope of @inbounds.

I think this needs @inbounds to compare this properly with LoopVectorization.jl and Tullio.jl. This also applies to @threads and @floop.

GC may re-use the same memory region. Maybe a more robust approach is to allocate a lot of arrays such that the total number of bytes is at least as large as (say) a double of the L3 cache size and use them one by one.

2 Likes

Indeed, with @inbounds FLoops is as fast as I can get here.

I am not sure if the cache is a problem there, I tried different things, below, and none seems to make any difference:

julia> using FLoops

julia> function loop_dot(x, A, y)
           @floop for i = eachindex(y)
               @inbounds for k = eachindex(x)
                   @reduce s += x[k]*A[k,i]*y[i]
               end
           end
           return s
       end
loop_dot (generic function with 1 method)

julia> const n = 10^4
10000

julia> @btime loop_dot(x,A,y) setup=(x=rand($n);y=rand($n);A=rand($n,$n)) evals=1
  27.630 ms (116 allocations: 5.89 KiB)
1.2601847672491744e7

julia> buff = [ (rand(n),rand(n,n),rand(n)) for _ in 1:5 ];

julia> @btime loop_dot(x,A,y) setup=((x,A,y)=$(buff[rand(1:5)])) evals=1
  27.078 ms (115 allocations: 5.86 KiB)
1.2462334838512724e7

julia> (x,A,y) = buff[1];

julia> @btime loop_dot($x,$A,$y) 
  28.369 ms (115 allocations: 5.86 KiB)
1.2575566525724193e7

Maybe these strategies do not avoid the cache, maybe all of them are.

ps: The histograms using @benchmark are slightly different, but nothing conclusive either.

Matrix multiplication has low CD, but BLAS.gemm seems to always benefit from multiple threads, right? You can see how the BLAS version of the same operation (out)performs:

@btime transpose(x)*A*y

Nope.
gemm has a computational density prop to the matrix rank N:
(N^3 flops vs N^2 floats). Hence the computation density can arbitrarily large and makes this kernel so purely computation bound that it is used to measure the peak performance of computer via linpack or in julia peakflops function.

dot is the complete opposite and has a very low computation density.

1 Like

I agree with the caution on the way the benchmark is done and also on the advice that tuning performance requires more thoughts on hiding memory latencies. But this statement as-is sounds a bit too broad to me. It’s not entirely crazy to imagine an application that pre-allocates x, y, and A, re-computes them just before calling this dot function, and then repeats this over and over again. In the end, the programmers need to go back to their actual application and then measure its performance. It may not satisfy experienced HPC programmers but I think dealing with low-hanging fruit can be OK for many users.

1 Like

I do indeed appreciate low hanging fruits but I am concerned about possible disappointments for people giving a try to M.T. parallelism. Actually I though that this specific kernel would not show significant speed-up with M.T. compared to properly vectorized code.

So I try it on my machine and… I was wrong !
The thing is that I believe that the following results (could not resist to try UnicodePlots ;)) are very specific to apple M1 architecture. I guess that the situation will probably be very different on most x86 laptop with much lower memory bandwidth (but I would love to be prove wrong again).

on M1 max mac book and 8 Threads :

                                    GB/s (n=10)                
                   β”Œ                                        ┐ 
         floop_dot ─ 0.03401852308582023                      
   serial_loop_dot ─■■■■■■■■■■■■■■■■ 14.787430683918668       
    turbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 32.0   
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 32.0   
                   β””                                        β”˜ 
                                   GB/s (n=100)                
                   β”Œ                                        ┐ 
         floop_dot ─■■ 2.2606852702225364                     
   serial_loop_dot ─■■■■■■■ 8.656509695290858                 
    turbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■ 25.98161800526128   
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■ 22.966067635069184    
                   β””                                        β”˜ 
                                   GB/s (n=1000)               
                   β”Œ                                        ┐ 
         floop_dot ─■■■■■■■■■■■ 45.787336138570794            
   serial_loop_dot ─■■ 8.603306896088183                      
    turbo_loop_dot ─■■■■■■ 25.793970659358376                 
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■■ 91.4589880748912   
                   β””                                        β”˜ 
                                  GB/s (n=10000)               
                   β”Œ                                        ┐ 
         floop_dot ─■■■■■■■■■■■■■■■ 55.71654658230281         
   serial_loop_dot ─■■ 8.505742220725514                      
    turbo_loop_dot ─■■■■■■■ 24.53127382168756                 
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■ 76.60835653529132   
                   β””                                        β”˜ 
Here is the code to launch the benchmark
using FLoops
using BenchmarkTools
using LoopVectorization
using DataFrames
using UnicodePlots

function floop_dot(x, A, y)
    @floop for i = eachindex(y)
        @inbounds for k = eachindex(x)
            @reduce s += x[k] * A[k, i] * y[i]
        end
    end
    return s
end

function serial_loop_dot(x, A, y)
    s = zero(eltype(x))
    for i = eachindex(y)
        for k = eachindex(x)
            @inbounds s += x[k] * A[k, i] * y[i]
        end
    end
    return s
end


function turbo_loop_dot(x, A, y)
    s = zero(eltype(x))
    @turbo for i = eachindex(y)
        for k = eachindex(x)
            s += x[k]*A[k,i]*y[i]
        end
    end
    return s
end

function tturbo_loop_dot(x, A, y)
    s = zero(eltype(x))
    @tturbo for i = eachindex(y)
        for k = eachindex(x)
            s += x[k]*A[k,i]*y[i]
        end
    end
    return s
end

gflops(t,n) = 3n^2/(t*1.e9)
gbs(t,n) = (n^2)*sizeof(Float64)/(t*1.e9)

function benchit(f,n)
    t = @belapsed $f(x, A, y) setup = (x = rand($n); y = rand($n); A = rand($n, $n)) evals = 10
end

function go(es)
    @show Threads.nthreads()
    
    df = DataFrame()
    for p ∈ es
        n=10^p
        @show n
        for f ∈ (floop_dot,serial_loop_dot,turbo_loop_dot,tturbo_loop_dot)
            t=benchit(f,n)
            append!(df,DataFrame(n = n,fn=string(f),t=t,GFlops=gflops(t,n), GBs=gbs(t,n)))
        end
    end
    df
end

es=1:4
df=go(es)

ps=[ filter(row -> row.n == n,df) |>  dfn -> barplot(dfn.fn,dfn.GBs,title="GB/s (n=$n)") for n ∈ unique(df.n)]
ps
5 Likes

Almost all of my pre-conceived ideas about MT have been proved wrong by actually using it.

2 Likes

Having the roofline model in mind I was not so often surprised by M.T. but by the fact that this new machine challenges my intuition.

Here is what I got on my 2-years old x86 laptop (4cores 8threads):

                                 GB/s (n=10000)        
                   β”Œ                                        ┐ 
         floop_dot ─■■■■■■■■■■■■■■■■■■■■■ 26.65483671703491   
   serial_loop_dot ─■■■■■ 6.886856472901777                   
    turbo_loop_dot ─■■■■■■■■■■■■■■■■■■■ 24.17812734836673     
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■ 24.651486451815757    
                   β””                                        β”˜ 

where my bet that M.T. does not bring any speedup compared to the vectorized (turbo) is correct.

Compared to apple silicon:

                                  GB/s (n=10000)               
                   β”Œ                                        ┐ 
         floop_dot ─■■■■■■■■■■■■■■■ 55.71654658230281         
   serial_loop_dot ─■■ 8.505742220725514                      
    turbo_loop_dot ─■■■■■■■ 24.53127382168756                 
   tturbo_loop_dot ─■■■■■■■■■■■■■■■■■■■■■ 76.60835653529132   
                   β””                                        β”˜ 
1 Like

Yeah, I am always surprised when I notice that some of my intuitions about the performance of multi-threaded programs are actually very specific to the machine I happen to use.

1 Like