# What's the problem with this simple multi-thread code?

I defined a very simple function to compute the dot product, which is supposed to be equivalent to dot(x,A,y) in the package LinearAlgebra. For performance, I use Threads.@threads to make it multi-thread:

``````using BenchmarkTools, LinearAlgebra

const n=10000
x=rand(n)
A=randn(n,n)
y=rand(n)

function loop_dot(x, A, y)
s = 0.0
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
``````

But the benchmark

@btime loop_dot(x,A,y)

result is

``````  2.720 s (200000042 allocations: 2.98 GiB)
-1397.5972007491994
``````

while

@btime dot(x,A,y)

gives

``````  41.147 ms (1 allocation: 16 bytes)
4172.734572614711
``````

So, whatβs wrong here? Why so huge memory allocation for loop_dot(x,A,y)? Why is its result incorrect (not equal to dot(x,A,y))?

This is a reduction (on s) so the result will be wrong (if you are lucky enough).
Data race : all threads will write concurrently to s value. You should consider @tkf 's ThreadsX.jl reductions.

3 Likes

``````function loop_dot(x, A, y)
s = Matrix(undef, length(y), length(x))
for k = eachindex(x)
s[i, k] = x[k]*A[k,i]*y[i]
end
end
return sum(s)
end
``````

Hereβs some documentation around this topic

https://juliafolds.github.io/data-parallelism/

2 Likes

Some alternatives:

serial:

``````julia> function loop_dot_serial(x, A, y)
s = 0.
for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_serial (generic function with 1 method)

julia> @btime loop_dot_serial(\$x,\$A,\$y)
259.091 ms (0 allocations: 0 bytes)
98.64233873489854

``````

manual:

``````julia> function loop_dot_manual(x, A, y; ntasks = Threads.nthreads())
for k = eachindex(x)
s[it] += x[k]*A[k,i]*y[i]
end
end
end
return sum(s)
end
loop_dot_manual (generic function with 1 method)

julia> @btime loop_dot_manual(\$x,\$A,\$y)
105.539 ms (42 allocations: 4.02 KiB)
98.64233873543105

``````

using Floops

``````julia> using FLoops

julia> function loop_dot(x, A, y)
@floop for i = eachindex(y)
for k = eachindex(x)
@reduce s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot (generic function with 1 method)

julia> @btime loop_dot(\$x,\$A,\$y)
91.441 ms (116 allocations: 5.89 KiB)
98.64233873490912

``````

using LoopVectorization (applies for this simple operation):

``````julia> using LoopVectorization

julia> function loop_dot_turbo(x, A, y)
s = 0.
@tturbo for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_turbo (generic function with 1 method)

julia> @btime loop_dot_turbo(\$x,\$A,\$y)
69.347 ms (0 allocations: 0 bytes)
98.64233873475337

``````

using Tullio:

``````julia> using Tullio

julia> function loop_dot_tullio(x,A,y)
s = 0.
@tullio s += x[k]*A[k,i]*y[i]
return s
end
loop_dot_tullio (generic function with 1 method)

julia> @btime loop_dot_tullio(\$x,\$A,\$y)
72.149 ms (203 allocations: 10.30 KiB)
98.64233873475337

``````
7 Likes

Thank you. So I have a somewhat simplified version:

``````function loop_dot_auto(x, A, y; ntasks = Threads.nthreads())
@simd for k = eachindex(x)
s[id] += x[k]*A[k,i]*y[i]
end
end
return sum(s)
end
``````

It seems FLoops does the job:

``````using BenchmarkTools, LinearAlgebra
using FLoops

const n=10000
x=rand(n)
A=randn(n,n)
y=rand(n)

function loop_dot(x, A, y)
s = 0.0
@floop for i = eachindex(y)
for k = eachindex(x)
@reduce s += x[k]*A[k,i]*y[i]
end
end
return s
end

``````

And by executing @btime loop_dot(x,A,y)
on my PC I found:

92.664 ms (54 allocations: 2.77 KiB)
3425.3843229334875

The use of `threadid()` is not a recommended pattern, because the tasks may migrate between threads. That is why the βmanualβ version above is slightly more complicated than this, but it is safe.

1 Like

It maybe useful to mention that it is very unlikely that MT parallelization allows for satisfactory speedups (if any) in real life for this kernel because its Computing Density (or Arithmetic Intensity) is very low. In other words, most of the time is spent in data access and not in arithmetic calculations. This is (spuriously ?) mitigated by the hot cache set up provided by `@btime`.

In order to accelerate, the best strategy is to fuse different kernels to increase the CD.

2 Likes

The benchmark might be here tricky, perhaps, because what I get is that `@turbo` provides a similar speedup than using threads (4 threads here). I am trying to avoid the hot cache by re-initializing the arrays every time. It may be that for some reason `SIMD` operations are kicking in inside the threaded loop and not in the serial loopβ¦

``````julia> using LoopVectorization

julia> function loop_dot_serial(x, A, y)
s = 0.
for i = eachindex(y)
for k = eachindex(x)
@inbounds s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_serial (generic function with 1 method)

for k = eachindex(x)
@inbounds s[it] += x[k]*A[k,i]*y[i]
end
end
end
return sum(s)
end
loop_dot_manual (generic function with 1 method)

julia> function loop_dot_turbo_parallel(x, A, y)
s = 0.
@tturbo for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_turbo_parallel (generic function with 1 method)

julia> function loop_dot_turbo_serial(x, A, y)
s = 0.
@turbo for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_turbo_serial (generic function with 1 method)

julia> const n = 10^4
10000

julia> @btime loop_dot_serial(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
119.716 ms (0 allocations: 0 bytes)
1.2341963698459884e7

julia> @btime loop_dot_manual(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
30.759 ms (43 allocations: 4.05 KiB)
1.2442783679944485e7

julia> @btime loop_dot_turbo_serial(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
33.684 ms (0 allocations: 0 bytes)
1.2526249273799863e7

julia> @btime loop_dot_turbo_parallel(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
31.227 ms (0 allocations: 0 bytes)
1.2405906908949133e7

``````

By using simd in the inner loop of the serial version, we get almost the same performance as the threaded versions:

``````julia> function loop_dot_serial(x, A, y)
s = 0.
@inbounds for i = eachindex(y)
@simd for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot_serial (generic function with 1 method)

julia> @btime loop_dot_serial(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
43.715 ms (0 allocations: 0 bytes)
1.237619409940113e7

``````
3 Likes

`@inbounds` needs to go inside `@threads` What is the cause of this performance difference between Julia and Cython? - #4 by tkf

In general, try minimizing the scope of `@inbounds`.

I think this needs `@inbounds` to compare this properly with LoopVectorization.jl and Tullio.jl. This also applies to `@threads` and `@floop`.

GC may re-use the same memory region. Maybe a more robust approach is to allocate a lot of arrays such that the total number of bytes is at least as large as (say) a double of the L3 cache size and use them one by one.

2 Likes

Indeed, with `@inbounds` `FLoops` is as fast as I can get here.

I am not sure if the cache is a problem there, I tried different things, below, and none seems to make any difference:

``````julia> using FLoops

julia> function loop_dot(x, A, y)
@floop for i = eachindex(y)
@inbounds for k = eachindex(x)
@reduce s += x[k]*A[k,i]*y[i]
end
end
return s
end
loop_dot (generic function with 1 method)

julia> const n = 10^4
10000

julia> @btime loop_dot(x,A,y) setup=(x=rand(\$n);y=rand(\$n);A=rand(\$n,\$n)) evals=1
27.630 ms (116 allocations: 5.89 KiB)
1.2601847672491744e7

julia> buff = [ (rand(n),rand(n,n),rand(n)) for _ in 1:5 ];

julia> @btime loop_dot(x,A,y) setup=((x,A,y)=\$(buff[rand(1:5)])) evals=1
27.078 ms (115 allocations: 5.86 KiB)
1.2462334838512724e7

julia> (x,A,y) = buff[1];

julia> @btime loop_dot(\$x,\$A,\$y)
28.369 ms (115 allocations: 5.86 KiB)
1.2575566525724193e7

``````

Maybe these strategies do not avoid the cache, maybe all of them are.

ps: The histograms using `@benchmark` are slightly different, but nothing conclusive either.

Matrix multiplication has low CD, but BLAS.gemm seems to always benefit from multiple threads, right? You can see how the BLAS version of the same operation (out)performs:

``````@btime transpose(x)*A*y
``````

Nope.
`gemm` has a computational density prop to the matrix rank N:
(N^3 flops vs N^2 floats). Hence the computation density can arbitrarily large and makes this kernel so purely computation bound that it is used to measure the peak performance of computer via linpack or in julia `peakflops` function.

`dot` is the complete opposite and has a very low computation density.

1 Like

I agree with the caution on the way the benchmark is done and also on the advice that tuning performance requires more thoughts on hiding memory latencies. But this statement as-is sounds a bit too broad to me. Itβs not entirely crazy to imagine an application that pre-allocates `x`, `y`, and `A`, re-computes them just before calling this `dot` function, and then repeats this over and over again. In the end, the programmers need to go back to their actual application and then measure its performance. It may not satisfy experienced HPC programmers but I think dealing with low-hanging fruit can be OK for many users.

1 Like

I do indeed appreciate low hanging fruits but I am concerned about possible disappointments for people giving a try to M.T. parallelism. Actually I though that this specific kernel would not show significant speed-up with M.T. compared to properly vectorized code.

So I try it on my machine andβ¦ I was wrong !
The thing is that I believe that the following results (could not resist to try UnicodePlots ;)) are very specific to apple M1 architecture. I guess that the situation will probably be very different on most x86 laptop with much lower memory bandwidth (but I would love to be prove wrong again).

on M1 max mac book and 8 Threads :

``````                                    GB/s (n=10)
β                                        β
floop_dot β€ 0.03401852308582023
serial_loop_dot β€β β β β β β β β β β β β β β β β  14.787430683918668
turbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β  32.0
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β β  32.0
β                                        β
GB/s (n=100)
β                                        β
floop_dot β€β β  2.2606852702225364
serial_loop_dot β€β β β β β β β  8.656509695290858
turbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β  25.98161800526128
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β  22.966067635069184
β                                        β
GB/s (n=1000)
β                                        β
floop_dot β€β β β β β β β β β β β  45.787336138570794
serial_loop_dot β€β β  8.603306896088183
turbo_loop_dot β€β β β β β β  25.793970659358376
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β β  91.4589880748912
β                                        β
GB/s (n=10000)
β                                        β
floop_dot β€β β β β β β β β β β β β β β β  55.71654658230281
serial_loop_dot β€β β  8.505742220725514
turbo_loop_dot β€β β β β β β β  24.53127382168756
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β  76.60835653529132
β                                        β
``````
Here is the code to launch the benchmark
``````using FLoops
using BenchmarkTools
using LoopVectorization
using DataFrames
using UnicodePlots

function floop_dot(x, A, y)
@floop for i = eachindex(y)
@inbounds for k = eachindex(x)
@reduce s += x[k] * A[k, i] * y[i]
end
end
return s
end

function serial_loop_dot(x, A, y)
s = zero(eltype(x))
for i = eachindex(y)
for k = eachindex(x)
@inbounds s += x[k] * A[k, i] * y[i]
end
end
return s
end

function turbo_loop_dot(x, A, y)
s = zero(eltype(x))
@turbo for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end

function tturbo_loop_dot(x, A, y)
s = zero(eltype(x))
@tturbo for i = eachindex(y)
for k = eachindex(x)
s += x[k]*A[k,i]*y[i]
end
end
return s
end

gflops(t,n) = 3n^2/(t*1.e9)
gbs(t,n) = (n^2)*sizeof(Float64)/(t*1.e9)

function benchit(f,n)
t = @belapsed \$f(x, A, y) setup = (x = rand(\$n); y = rand(\$n); A = rand(\$n, \$n)) evals = 10
end

function go(es)

df = DataFrame()
for p β es
n=10^p
@show n
for f β (floop_dot,serial_loop_dot,turbo_loop_dot,tturbo_loop_dot)
t=benchit(f,n)
append!(df,DataFrame(n = n,fn=string(f),t=t,GFlops=gflops(t,n), GBs=gbs(t,n)))
end
end
df
end

es=1:4
df=go(es)

ps=[ filter(row -> row.n == n,df) |>  dfn -> barplot(dfn.fn,dfn.GBs,title="GB/s (n=\$n)") for n β unique(df.n)]
ps
``````
5 Likes

Almost all of my pre-conceived ideas about MT have been proved wrong by actually using it.

2 Likes

Having the roofline model in mind I was not so often surprised by M.T. but by the fact that this new machine challenges my intuition.

Here is what I got on my 2-years old x86 laptop (4cores 8threads):

``````                                 GB/s (n=10000)
β                                        β
floop_dot β€β β β β β β β β β β β β β β β β β β β β β  26.65483671703491
serial_loop_dot β€β β β β β  6.886856472901777
turbo_loop_dot β€β β β β β β β β β β β β β β β β β β β  24.17812734836673
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β  24.651486451815757
β                                        β
``````

where my bet that M.T. does not bring any speedup compared to the vectorized (turbo) is correct.

Compared to apple silicon:

``````                                  GB/s (n=10000)
β                                        β
floop_dot β€β β β β β β β β β β β β β β β  55.71654658230281
serial_loop_dot β€β β  8.505742220725514
turbo_loop_dot β€β β β β β β β  24.53127382168756
tturbo_loop_dot β€β β β β β β β β β β β β β β β β β β β β β  76.60835653529132
β                                        β
``````
1 Like

Yeah, I am always surprised when I notice that some of my intuitions about the performance of multi-threaded programs are actually very specific to the machine I happen to use.

1 Like