# Bad performances when using Multithreading and Distributed with heavy LinearAlgebra calculations

Hello,

I have noticed that Julia multithreading and Distributed scale very badly when calculating relatively heavy computations involving BLAS and LAPACK. As an example, here I show you the calculation of the `expv` function, which calculates `exp(A) * b` (with `A` a square matrix and `b` a vector`) without directly calculating `exp(A)` but rather using the Arnoldi algorithm.

Here I show you the main code of the algorithm, but I donβt think it is necessary to study it in depth. It should be well written (correct me if it is not), involving only in-place matrix-vector multiplications `mul!(y, A, x)`, where `x` and `y` can also be a `view`, and then it involves `exp(B)` where `B` is a 30x30 matrix. I paste it here just for completeness

Arnoldi expv
``````struct ArnoldiSpace{VT<:AbstractMatrix{<:BlasFloat},HT<:AbstractMatrix{<:BlasFloat},mT<:Integer}
V::VT
H::HT
Hcopy::HT
m::mT
end

function Base.copy(AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}}) where {T1<:BlasFloat}
return ArnoldiSpace(copy(AS.V), copy(AS.H), copy(AS.Hcopy), AS.m)
end

function Base.deepcopy(AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}}) where {T1<:BlasFloat}
return ArnoldiSpace(deepcopy(AS.V), deepcopy(AS.H), deepcopy(AS.Hcopy), AS.m)
end

function arnoldi_init!(A, b::AbstractVector{T}, V::AbstractMatrix{T}, H::AbstractMatrix{T}) where {T<:BlasFloat}
vβ = view(V, :, 1)
vβ = view(V, :, 2)
vβ .= b
normalize!(vβ)

mul!(vβ, A, vβ)
H[1, 1] = dot(vβ, vβ)
axpy!(-H[1, 1], vβ, vβ)
H[2, 1] = norm(vβ)
return vβ ./= H[2, 1]
end

function arnoldi_step!(A, V::AbstractMatrix{T}, H::AbstractMatrix{T}, i::TI) where {T<:BlasFloat,TI<:Integer}
vα΅’ = view(V, :, i)
vα΅’ββ = view(V, :, i + 1)
mul!(vα΅’ββ, A, vα΅’)
for j in 1:i
vβ±Ό = view(V, :, j)
H[j, i] = dot(vβ±Ό, vα΅’ββ)
axpy!(-H[j, i], vβ±Ό, vα΅’ββ)
end
Ξ² = H[i+1, i] = norm(vα΅’ββ)
vα΅’ββ ./= H[i+1, i]
return Ξ²
end

function arnoldi!(
AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}},
A,
b::AbstractVector{T2},
) where {T1<:BlasFloat,T2<:BlasFloat}
n = size(A, 2)
V = AS.V
H = AS.H
m = AS.m

n == size(V, 1) || throw(DimensionMismatch())
n == length(b) || throw(DimensionMismatch())

arnoldi_init!(A, b, V, H)
for i in 2:m
arnoldi_step!(A, V, H, i)
end
return AS
end

function arnoldi(A, b::AbstractVector{T}, m::Integer) where {T<:BlasFloat}
n = size(A, 2)
V = similar(b, n, m + 1)
H = zeros(T, m + 1, m)
AS = ArnoldiSpace(V, H, copy(H), m)
return arnoldi!(AS, A, b)
end

### EXPV TOOLS ###

function expv!(
x::AbstractVector{T1},
AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}},
t::T2,
b::AbstractVector{T1},
) where {T1<:BlasFloat,T2<:Union{BlasFloat,BlasInt}}
H = AS.H
Hcopy = AS.Hcopy
V = AS.V
m = AS.m

Hm = view(H, 1:m, 1:m)
Vm = view(V, :, 1:m)
lmul!(t, Hm)

Hcopym = view(Hcopy, 1:m, 1:m)
copyto!(Hcopym, Hm)
expH = LinearAlgebra.exp!(Hcopym)
# expH = exp(Hm)

Ξ² = norm(b)
expHe = expH[:, 1] # the view doesn't work when using copyto! with CUDA
cache = similar(x, m)
copyto!(cache, expHe) # just in case expHe is different from cache (e.g., with CUDA)
mul!(x, Vm, cache)
lmul!(Ξ², x)

return x
end

function expv!(
x::AbstractVector{T1},
A,
t::T2,
b::AbstractVector{T1};
m::Int = min(30, cld(2 * length(b), 3)),
) where {T1<:BlasFloat,T2<:Union{BlasFloat,BlasInt}}
AS = arnoldi(A, b, m)
return expv!(x, AS, t, b)
end

function expv(
A,
t::T1,
b::AbstractVector{T2};
m::Int = min(30, cld(2 * length(b), 3)),
) where {T1<:BlasFloat,T2<:BlasFloat}
x = similar(b)
return expv!(x, A, t, b, m = m)
end
``````

And then I create a file called `expv.jl` to make some benchmarks

``````# expv.jl #

using Distributed

##

@everywhere begin
using LinearAlgebra
using LinearAlgebra: BlasFloat, BlasInt
using SparseArrays
end

using BenchmarkTools
using JLD2

##

N = parse(Int, ARGS[1])
T = ComplexF64

H = sprand(T, N, N, 0.1/N)
H = H + H' # Hermitianize

V = sprand(T, N, N, 0.01/N)
V = V + V' # Hermitianize

Ο0 = rand(T, N)
normalize!(Ο0)

t = 0.4

expv(-1im * H, t, Ο0)

function calc_serial(H, V, Ο0, t, g_list)
res = similar(g_list)
for i in eachindex(g_list)
g = g_list[i]
A = -1im * (H + g * V)
res[i] = abs2( dot(Ο0, expv(A, t, Ο0)) )
end
return res
end

function calc_multithread(H, V, Ο0, t, g_list)
res = similar(g_list)
g = g_list[i]
A = -1im * (H + g * V)
res[i] = abs2( dot(Ο0, expv(A, t, Ο0)) )
end
return res
end

function calc_distributed(H, V, Ο0, t, g_list)
res = pmap(g_list) do g
A = -1im * (H + g * V)
abs2( dot(Ο0, expv(A, t, Ο0)) )
end
return res
end

##

g_list = range(0.0, 0.1, 32)

# calc_serial(H, V, Ο0, t, g_list)

# calc_multithread(H, V, Ο0, t, g_list)

# calc_distributed(H, V, Ο0, t, g_list)

##

bench_serial = @benchmark calc_serial(\$H, \$V, \$Ο0, \$t, \$g_list)
mean_time_serial = 1e-9 * sum(bench_serial.times) / length(bench_serial.times)
jldopen("benchmarks.jld2", "a+") do file
end
end

bench_distributed = @benchmark calc_distributed(\$H, \$V, \$Ο0, \$t, \$g_list)

mean_time_distributed = 1e-9 * sum(bench_distributed.times) / length(bench_distributed.times)

jldopen("benchmarks.jld2", "a+") do file
end
``````

And then running the following bash script

``````#!/bin/bash

# for ((N=1000; N<=100000; N*=10)); do
#     julia -t \$threads --project expv.jl \$N
# done
N=100000
julia -t \$threads --project expv.jl \$N
done
``````

The benchmarks are very disappointing. Especially for the Distributed case, were I would expect better performances, due to the heavy computations. The Serial case is a clean horizontal line because I did only one benchmark.

The general system infos are

``````julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 Γ 13th Gen Intel(R) Core(TM) i9-13900KF
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 16 default, 0 interactive, 8 GC (on 32 virtual cores)
Environment:
JULIA_EDITOR = code
``````

Just to check if multithreading is working properly, I tested a simple matrix-vector multiplication

``````function matvec_mul!(y, A, x)
for i in 1:size(A, 1)
y[i] = 0
for j in 1:size(A, 2)
y[i] += A[i, j] * x[j]
end
end
y
end

y[i] = 0
for j in 1:size(A, 2)
y[i] += A[i, j] * x[j]
end
end
y
end
``````

And I get 88ms (Serial) vs 16ms (Multithreaded) on a `ComplexF64` 5000x5000 matrix, which seems more reasonable (but still just a factor 5.5) when using 16 threads.

1 Like

Do you think this is related?

1 Like

Not sure if this applies to here, but BLAS is already multi-threaded which can complicate measurements like this. Probably worth setting the environment variable `OPENBLAS_NUM_THREADS=1` to start with.

1 Like

I have just set `BLAS.set_num_threads(1)`, it should be the same, right?

Yes, I think that is the same.

1 Like

What I find surprisingly strange is the Distributed case, which shouldnβt be correlated to the BLAS multithreading.

You could try to use ThreadPinning.jl to see whether this improves things.

Note that this naive implementation is memory bound. So I am not surprised that throwing more threads at it only helps so much after a point.

About your `expv` code: It could be that you hit a memory bound as well. In that case improvements are difficult. But I am not an expert on these things.

I suspect there might be some memory contention here. You have mutiple threads trying to access the same memory all ar once.

I have just added `pinthreads(:cores)` before calling the multithreaded function, the results are pretty much the same. Here I show the benchmarks with a matrix smaller by a factor 10

Moreover, what do you mean by βhitting a memory boundβ?

But this shouldnβt hold for the Distributed case, right? In this case each process should have its own copy of the memory.

Well, but the bandwidth between the CPU and the memory (RAM) is limited, no matter how many cores you have. Which hardware are you using?

Or are you using Distributed with multiple machines?

• Ryzen 7950X: Max. Memory Bandwidthβ¦ 73.4 GB/s
• Apple M3 Max: Memory Bandwidth with 96GB RAM 300.0 GB/s
1 Like

Do you have a gold standard implementation to compare to that you can share?

Without sharing your `expv` itβs hard to know. Thereβs a lot of details in how it can and should be done. ExponentialUtilities.jl did a bunch of optimizations here and found that the `balance!` thatβs required in the Arnoldi tends to be a limiting factor in some cases, and IIUC that does not multithread well. Thatβs where GenericSchur.jl and stuff then starts to come in, though that does not fix the multithreading as that library needs an update.

Also, OpenBLAS gets much worse thread performance than MKL, so switching the backends helped there.

Your cpu has 8 performance cores, you should perhaps not expect much more speedup with such a highly optimized library as BLAS. Also you have a maximum memory bandwidth of 90 GB/s. With many BLAS operations you might exceed it. You can check such metrics with packages like LIKWID.jl or LinuxPerf.jl

https://www.intel.com/content/www/us/en/products/sku/230497/intel-core-i913900kf-processor-36m-cache-up-to-5-80-ghz/specifications.html

1 Like

The `expv` code is in the beginning of the thread. Actually, the code I have is much more complex, stil involving Arnoldi but also using LinearSolve.jl and diagonalization. I have chosen the `expv` code just to have a relative clean code to check with a heavy computation.

Ok, memory bandwidth seems a reasonable bottleneck. Studying the `expv` function, I think that is comes from the `mul!(y, A, x)` functions, since the other βexpensiveβ function should be `exp!`, which is applied to a small 30x30 matrix. Do you think this is correct? Is there a way to avoid such memory efforts?

BTW, 90 GB/s is very large in my opinion. I mean, in the last plot I showed I considered a system of size 10000x10000 of type ComplexF64. This means that a single operation `mul!(y, A, x)` should move at least 10000x16 bytes = 156 KB, which is far below 90 GB/s, even when parallelizing with 16 cores.

I was trying to use LIKWID.jl to see the memory bandwidth usage of certain Julia code. I started using a single core `mul!(y, A, x)` case, where A is a dense matrix and x and y are vectors. If the dimension is `N = 10000`, the benchmarks give

``````
julia> @benchmark mul!(\$y, \$A, \$x)
BenchmarkTools.Trial: 299 samples with 1 evaluation.
Range (min β¦ max):  16.439 ms β¦  17.241 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     16.799 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   16.710 ms Β± 241.237 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
16.4 ms         Histogram: frequency by time         17.1 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

While LIKWID gives

``````julia> metrics, events = @perfmon "FLOPS_SP" mul!(y, A, x);
ERROR: The selected register PMC0 is in use.
Please run likwid with force option (-f, --force) to overwrite settings

Group: FLOPS_SP
βββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββ
βββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββ
β          ACTUAL_CPU_CLOCK β 1.02241e8 β 133354.0 β 323192.0 β 132258.0 β 131953.0 β 133884.0 β 137479.0 β 117064.0 β 128708.0 β   β―
β             MAX_CPU_CLOCK β 8.23769e7 β 130235.0 β 313180.0 β 128100.0 β 127610.0 β 130270.0 β 133035.0 β 126910.0 β 129395.0 β   β―
β      RETIRED_INSTRUCTIONS β       NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β   β―
β       CPU_CLOCKS_UNHALTED β 7.19432e7 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β RETIRED_SSE_AVX_FLOPS_ALL β    2.25e8 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β                     MERGE β       0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
βββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββ
23 columns omitted
ββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββ
ββββββββββββββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββ
β  Runtime (RDTSC) [s] β 0.0170919 β  0.0170919 β  0.0170919 β  0.0170919 β  0.0170919 β  0.0170919 β  0.0170919 β  0.0170919 β  0. β―
β Runtime unhalted [s] β 0.0292671 β 3.81736e-5 β 9.25162e-5 β 3.78599e-5 β 3.77725e-5 β 3.83253e-5 β 3.93544e-5 β 3.35105e-5 β 3.6 β―
β          Clock [MHz] β   4335.71 β    3577.02 β    3605.04 β    3606.75 β    3612.25 β    3590.27 β    3610.05 β    3222.33 β     β―
β                  CPI β       Inf β        NaN β        NaN β        NaN β        NaN β        NaN β        NaN β        NaN β     β―
β         SP [MFLOP/s] β   13164.1 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β     β―
ββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββ
24 columns omitted
``````

I donβt understand why I have 2.25 * 10^8 FLOPS, since the correct one should be 2 * 10^8. Aside from that, the calculated MFLOPS/s seems reasonable, since I divided the FLOPS by the 16 ms of the benchmark.

Now I do the memory case

``````julia> metrics, events = @perfmon "MEM_SP" mul!(y, A, x);
ERROR: The selected register PMC0 is in use.
Please run likwid with force option (-f, --force) to overwrite settings

Group: MEM_SP
βββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬ββββ
βββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌβββββββββββΌββββ
β          ACTUAL_CPU_CLOCK β  1.0262e8 β 134993.0 β 140043.0 β 136181.0 β 136959.0 β 139807.0 β 135305.0 β 121054.0 β 132740.0 β   β―
β             MAX_CPU_CLOCK β 8.27059e7 β 130760.0 β 135450.0 β 131635.0 β 132475.0 β 135905.0 β 132020.0 β 129045.0 β 133315.0 β   β―
β      RETIRED_INSTRUCTIONS β       NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β      NaN β   β―
β       CPU_CLOCKS_UNHALTED β 7.19892e7 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β RETIRED_SSE_AVX_FLOPS_ALL β    2.25e8 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β                     MERGE β       0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β            DRAM_CHANNEL_0 β  888725.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
β            DRAM_CHANNEL_1 β  888544.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β      0.0 β   β―
βββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββ
23 columns omitted
βββββββββββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬βββββββββββββ¬ββββββ
βββββββββββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββ
β               Runtime (RDTSC) [s] β  0.016978 β   0.016978 β   0.016978 β   0.016978 β   0.016978 β   0.016978 β   0.016978 β   0 β―
β              Runtime unhalted [s] β 0.0293757 β 3.86428e-5 β 4.00884e-5 β 3.89828e-5 β 3.92056e-5 β 4.00208e-5 β 3.87321e-5 β 3.4 β―
β                       Clock [MHz] β   4334.49 β    3606.45 β    3611.81 β     3614.0 β     3611.6 β    3593.66 β    3580.28 β     β―
β                               CPI β       Inf β        NaN β        NaN β        NaN β        NaN β        NaN β        NaN β     β―
β                      SP [MFLOP/s] β   13252.4 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β     β―
β       Memory bandwidth [MBytes/s] β   26798.2 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β     β―
β       Memory data volume [GBytes] β  0.454981 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β        0.0 β     β―
β Operational intensity [FLOP/Byte] β  0.494526 β        NaN β        NaN β        NaN β        NaN β        NaN β        NaN β     β―
βββββββββββββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββββ΄ββββββ
25 columns omitted
``````

I see 26 GB/s of memory bandwidth. So I guess that it is obtained by doing

``````sizeof(A) / 1024^3 / 0.0167
22.307127535700086
``````

Assuming that every is correct, this means that I could parallelize this operations just 3 times, before reaching the memory bandwidth limits of my hardware (assuming 75 GB/s). Is that right?

More or less, yes. Itβs not necessarily that simple because more threads may cause a different pattern of memory access, but more or less, yes.

Great. Iβm slightly confused, however. Is your CPU really this one? According to your `versioninfo()` output above it is. That would be strange because LIKWID doesnβt support Raptor Lake yet. Did you run the experiment on a different system/CPU?

Your estimation seems reasonable. Every element of `A` must must be read from memory at some point (because `sizeof(Float32) * 10000^2 / 1000^2` = 400 MB, `A` shoudnβt fit into cache). The values of `y` and `x` also need to be read (and written). But those vectors are much smaller and likely reside in cache. (You made a minor mistake though: MB = Mega = 1000^3 not Mebi = 1024^3.)

On a more general note, whether one is bounded by memory access or compute can depend on the specific implementation. For example, a naive implementation of a matrix-matrix product is memory bounded, a highly optimized implementation (say in BLAS) OTOH is bound by compute. Hence, estimating the data transfer isnβt always simple and requires care (and knowledge about the implementation).

Depending on the system you are working on, it can be non-trivial to achieve the maximum memory bandwidth. For example, there might be multiple memory channels that can only be accessed from certain cores. If your 3 parallel computations run on the wrong cores, you are not bounded by the maximal memory bandwidth (of the entire system) but of the same of a single memory channel. In such a case, thread pinning plays a crucial role.

1 Like

Yes, LIKWID.jl doesnβt support my current CPU, so I switched to another machine to do the benchmarks.

Could you explain better this? What do you mean by βmemory boundβ and βcompute bounds?