Bad performances when using Multithreading and Distributed with heavy LinearAlgebra calculations

Hello,

I have noticed that Julia multithreading and Distributed scale very badly when calculating relatively heavy computations involving BLAS and LAPACK. As an example, here I show you the calculation of the expv function, which calculates exp(A) * b (with A a square matrix and b a vector) without directly calculating exp(A)` but rather using the Arnoldi algorithm.

Here I show you the main code of the algorithm, but I don’t think it is necessary to study it in depth. It should be well written (correct me if it is not), involving only in-place matrix-vector multiplications mul!(y, A, x), where x and y can also be a view, and then it involves exp(B) where B is a 30x30 matrix. I paste it here just for completeness

Arnoldi expv
struct ArnoldiSpace{VT<:AbstractMatrix{<:BlasFloat},HT<:AbstractMatrix{<:BlasFloat},mT<:Integer}
    V::VT
    H::HT
    Hcopy::HT
    m::mT
end

function Base.copy(AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}}) where {T1<:BlasFloat}
    return ArnoldiSpace(copy(AS.V), copy(AS.H), copy(AS.Hcopy), AS.m)
end

function Base.deepcopy(AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}}) where {T1<:BlasFloat}
    return ArnoldiSpace(deepcopy(AS.V), deepcopy(AS.H), deepcopy(AS.Hcopy), AS.m)
end

function arnoldi_init!(A, b::AbstractVector{T}, V::AbstractMatrix{T}, H::AbstractMatrix{T}) where {T<:BlasFloat}
    v₁ = view(V, :, 1)
    vβ‚‚ = view(V, :, 2)
    v₁ .= b
    normalize!(v₁)

    mul!(vβ‚‚, A, v₁)
    H[1, 1] = dot(v₁, vβ‚‚)
    axpy!(-H[1, 1], v₁, vβ‚‚)
    H[2, 1] = norm(vβ‚‚)
    return vβ‚‚ ./= H[2, 1]
end

function arnoldi_step!(A, V::AbstractMatrix{T}, H::AbstractMatrix{T}, i::TI) where {T<:BlasFloat,TI<:Integer}
    vα΅’ = view(V, :, i)
    vα΅’β‚Šβ‚ = view(V, :, i + 1)
    mul!(vα΅’β‚Šβ‚, A, vα΅’)
    for j in 1:i
        vβ±Ό = view(V, :, j)
        H[j, i] = dot(vβ±Ό, vα΅’β‚Šβ‚)
        axpy!(-H[j, i], vβ±Ό, vα΅’β‚Šβ‚)
    end
    Ξ² = H[i+1, i] = norm(vα΅’β‚Šβ‚)
    vα΅’β‚Šβ‚ ./= H[i+1, i]
    return Ξ²
end

function arnoldi!(
    AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}},
    A,
    b::AbstractVector{T2},
) where {T1<:BlasFloat,T2<:BlasFloat}
    n = size(A, 2)
    V = AS.V
    H = AS.H
    m = AS.m

    n == size(V, 1) || throw(DimensionMismatch())
    n == length(b) || throw(DimensionMismatch())

    arnoldi_init!(A, b, V, H)
    for i in 2:m
        arnoldi_step!(A, V, H, i)
    end
    return AS
end

function arnoldi(A, b::AbstractVector{T}, m::Integer) where {T<:BlasFloat}
    n = size(A, 2)
    V = similar(b, n, m + 1)
    H = zeros(T, m + 1, m)
    AS = ArnoldiSpace(V, H, copy(H), m)
    return arnoldi!(AS, A, b)
end

### EXPV TOOLS ###

function expv!(
    x::AbstractVector{T1},
    AS::ArnoldiSpace{<:AbstractMatrix{T1},<:AbstractMatrix{T1}},
    t::T2,
    b::AbstractVector{T1},
) where {T1<:BlasFloat,T2<:Union{BlasFloat,BlasInt}}
    H = AS.H
    Hcopy = AS.Hcopy
    V = AS.V
    m = AS.m

    Hm = view(H, 1:m, 1:m)
    Vm = view(V, :, 1:m)
    lmul!(t, Hm)

    Hcopym = view(Hcopy, 1:m, 1:m)
    copyto!(Hcopym, Hm)
    expH = LinearAlgebra.exp!(Hcopym)
    # expH = exp(Hm)

    Ξ² = norm(b)
    expHe = expH[:, 1] # the view doesn't work when using copyto! with CUDA
    cache = similar(x, m)
    copyto!(cache, expHe) # just in case expHe is different from cache (e.g., with CUDA)
    mul!(x, Vm, cache)
    lmul!(Ξ², x)

    return x
end

function expv!(
    x::AbstractVector{T1},
    A,
    t::T2,
    b::AbstractVector{T1};
    m::Int = min(30, cld(2 * length(b), 3)),
) where {T1<:BlasFloat,T2<:Union{BlasFloat,BlasInt}}
    AS = arnoldi(A, b, m)
    return expv!(x, AS, t, b)
end

function expv(
    A,
    t::T1,
    b::AbstractVector{T2};
    m::Int = min(30, cld(2 * length(b), 3)),
) where {T1<:BlasFloat,T2<:BlasFloat}
    x = similar(b)
    return expv!(x, A, t, b, m = m)
end

And then I create a file called expv.jl to make some benchmarks

# expv.jl #

using Distributed
addprocs(Threads.nthreads())


##

@everywhere begin
    using LinearAlgebra
    using LinearAlgebra: BlasFloat, BlasInt
    using SparseArrays
    BLAS.set_num_threads(1)
end

using BenchmarkTools
using JLD2

##

N = parse(Int, ARGS[1])
T = ComplexF64

H = sprand(T, N, N, 0.1/N)
H = H + H' # Hermitianize 

V = sprand(T, N, N, 0.01/N)
V = V + V' # Hermitianize

ψ0 = rand(T, N)
normalize!(ψ0)

t = 0.4

expv(-1im * H, t, ψ0)

function calc_serial(H, V, ψ0, t, g_list)
    res = similar(g_list)
    for i in eachindex(g_list)
        g = g_list[i]
        A = -1im * (H + g * V)
        res[i] = abs2( dot(ψ0, expv(A, t, ψ0)) )
    end
    return res
end

function calc_multithread(H, V, ψ0, t, g_list)
    res = similar(g_list)
    Threads.@threads for i in eachindex(g_list)
        g = g_list[i]
        A = -1im * (H + g * V)
        res[i] = abs2( dot(ψ0, expv(A, t, ψ0)) )
    end
    return res    
end

function calc_distributed(H, V, ψ0, t, g_list)
    res = pmap(g_list) do g
        A = -1im * (H + g * V)
        abs2( dot(ψ0, expv(A, t, ψ0)) )
    end
    return res
end

##

g_list = range(0.0, 0.1, 32)

# calc_serial(H, V, ψ0, t, g_list)

# calc_multithread(H, V, ψ0, t, g_list)

# calc_distributed(H, V, ψ0, t, g_list)

##

if Threads.nthreads() == 1
    bench_serial = @benchmark calc_serial($H, $V, $ψ0, $t, $g_list)
    mean_time_serial = 1e-9 * sum(bench_serial.times) / length(bench_serial.times)
    jldopen("benchmarks.jld2", "a+") do file
        file["mean_time_serial_$(Threads.nthreads())_$N"] = mean_time_serial
    end
end

bench_multithread = @benchmark calc_multithread($H, $V, $ψ0, $t, $g_list)
bench_distributed = @benchmark calc_distributed($H, $V, $ψ0, $t, $g_list)

mean_time_multithread = 1e-9 * sum(bench_multithread.times) / length(bench_multithread.times)
mean_time_distributed = 1e-9 * sum(bench_distributed.times) / length(bench_distributed.times)

jldopen("benchmarks.jld2", "a+") do file
    file["mean_time_multithread_$(Threads.nthreads())_$N"] = mean_time_multithread
    file["mean_time_distributed_$(Threads.nthreads())_$N"] = mean_time_distributed
end

And then running the following bash script

#!/bin/bash

MAX_THREADS=32

for ((threads=1; threads<=MAX_THREADS; threads++)); do
    # for ((N=1000; N<=100000; N*=10)); do
    #     julia -t $threads --project expv.jl $N
    #     echo "$threads,$N"
    # done
    N=100000
    julia -t $threads --project expv.jl $N
    echo "$threads,$N"
done

The benchmarks are very disappointing. Especially for the Distributed case, were I would expect better performances, due to the heavy computations. The Serial case is a clean horizontal line because I did only one benchmark.

The general system infos are

julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 Γ— 13th Gen Intel(R) Core(TM) i9-13900KF
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
Threads: 16 default, 0 interactive, 8 GC (on 32 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 16

Just to check if multithreading is working properly, I tested a simple matrix-vector multiplication

function matvec_mul!(y, A, x)
    for i in 1:size(A, 1)
        y[i] = 0
        for j in 1:size(A, 2)
            y[i] += A[i, j] * x[j]
        end
    end
    y
end

function matvec_mul_threaded!(y, A, x)
    Threads.@threads for i in 1:size(A, 1)
        y[i] = 0
        for j in 1:size(A, 2)
            y[i] += A[i, j] * x[j]
        end
    end
    y
end

And I get 88ms (Serial) vs 16ms (Multithreaded) on a ComplexF64 5000x5000 matrix, which seems more reasonable (but still just a factor 5.5) when using 16 threads.

1 Like

Do you think this is related?

1 Like

Not sure if this applies to here, but BLAS is already multi-threaded which can complicate measurements like this. Probably worth setting the environment variable OPENBLAS_NUM_THREADS=1 to start with.

1 Like

I have just set BLAS.set_num_threads(1), it should be the same, right?

Yes, I think that is the same.

1 Like

What I find surprisingly strange is the Distributed case, which shouldn’t be correlated to the BLAS multithreading.

You could try to use ThreadPinning.jl to see whether this improves things.

Note that this naive implementation is memory bound. So I am not surprised that throwing more threads at it only helps so much after a point.

About your expv code: It could be that you hit a memory bound as well. In that case improvements are difficult. But I am not an expert on these things.

I suspect there might be some memory contention here. You have mutiple threads trying to access the same memory all ar once.

I tried ThreadPinning.jl.

I have just added pinthreads(:cores) before calling the multithreaded function, the results are pretty much the same. Here I show the benchmarks with a matrix smaller by a factor 10

Moreover, what do you mean by β€œhitting a memory bound”?

But this shouldn’t hold for the Distributed case, right? In this case each process should have its own copy of the memory.

Well, but the bandwidth between the CPU and the memory (RAM) is limited, no matter how many cores you have. Which hardware are you using?

Or are you using Distributed with multiple machines?

  • Ryzen 7950X: Max. Memory Bandwidth… 73.4 GB/s
  • Apple M3 Max: Memory Bandwidth with 96GB RAM 300.0 GB/s
1 Like

Do you have a gold standard implementation to compare to that you can share?

Without sharing your expv it’s hard to know. There’s a lot of details in how it can and should be done. ExponentialUtilities.jl did a bunch of optimizations here and found that the balance! that’s required in the Arnoldi tends to be a limiting factor in some cases, and IIUC that does not multithread well. That’s where GenericSchur.jl and stuff then starts to come in, though that does not fix the multithreading as that library needs an update.

Also, OpenBLAS gets much worse thread performance than MKL, so switching the backends helped there.

Your cpu has 8 performance cores, you should perhaps not expect much more speedup with such a highly optimized library as BLAS. Also you have a maximum memory bandwidth of 90 GB/s. With many BLAS operations you might exceed it. You can check such metrics with packages like LIKWID.jl or LinuxPerf.jl

https://www.intel.com/content/www/us/en/products/sku/230497/intel-core-i913900kf-processor-36m-cache-up-to-5-80-ghz/specifications.html

1 Like

The expv code is in the beginning of the thread. Actually, the code I have is much more complex, stil involving Arnoldi but also using LinearSolve.jl and diagonalization. I have chosen the expv code just to have a relative clean code to check with a heavy computation.

Ok, memory bandwidth seems a reasonable bottleneck. Studying the expv function, I think that is comes from the mul!(y, A, x) functions, since the other β€œexpensive” function should be exp!, which is applied to a small 30x30 matrix. Do you think this is correct? Is there a way to avoid such memory efforts?

BTW, 90 GB/s is very large in my opinion. I mean, in the last plot I showed I considered a system of size 10000x10000 of type ComplexF64. This means that a single operation mul!(y, A, x) should move at least 10000x16 bytes = 156 KB, which is far below 90 GB/s, even when parallelizing with 16 cores.

I was trying to use LIKWID.jl to see the memory bandwidth usage of certain Julia code. I started using a single core mul!(y, A, x) case, where A is a dense matrix and x and y are vectors. If the dimension is N = 10000, the benchmarks give


julia> @benchmark mul!($y, $A, $x)
BenchmarkTools.Trial: 299 samples with 1 evaluation.
 Range (min … max):  16.439 ms …  17.241 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     16.799 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   16.710 ms Β± 241.237 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–†β–†β–‡β–ˆ                                                         
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–ƒβ–‚β–β–β–β–β–β–β–‚β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–‚β–ƒβ–ƒβ–ˆβ–…β–„β–‡β–†β–ƒβ–ƒβ–ƒβ–…β–…β–ˆβ–„β–„β–„β–ƒβ–ƒβ–„β–„β–ˆβ–‡β–„β–„β–ƒβ–β–ƒ β–ƒ
  16.4 ms         Histogram: frequency by time         17.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

While LIKWID gives

julia> metrics, events = @perfmon "FLOPS_SP" mul!(y, A, x);
ERROR: The selected register PMC0 is in use.
Please run likwid with force option (-f, --force) to overwrite settings

Group: FLOPS_SP
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€
β”‚                     Event β”‚  Thread 1 β”‚ Thread 2 β”‚ Thread 3 β”‚ Thread 4 β”‚ Thread 5 β”‚ Thread 6 β”‚ Thread 7 β”‚ Thread 8 β”‚ Thread 9 β”‚ T β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚          ACTUAL_CPU_CLOCK β”‚ 1.02241e8 β”‚ 133354.0 β”‚ 323192.0 β”‚ 132258.0 β”‚ 131953.0 β”‚ 133884.0 β”‚ 137479.0 β”‚ 117064.0 β”‚ 128708.0 β”‚   β‹―
β”‚             MAX_CPU_CLOCK β”‚ 8.23769e7 β”‚ 130235.0 β”‚ 313180.0 β”‚ 128100.0 β”‚ 127610.0 β”‚ 130270.0 β”‚ 133035.0 β”‚ 126910.0 β”‚ 129395.0 β”‚   β‹―
β”‚      RETIRED_INSTRUCTIONS β”‚       NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚   β‹―
β”‚       CPU_CLOCKS_UNHALTED β”‚ 7.19432e7 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚ RETIRED_SSE_AVX_FLOPS_ALL β”‚    2.25e8 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚                     MERGE β”‚       0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
└───────────────────────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴────
                                                                                                                   23 columns omitted
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€
β”‚               Metric β”‚  Thread 1 β”‚   Thread 2 β”‚   Thread 3 β”‚   Thread 4 β”‚   Thread 5 β”‚   Thread 6 β”‚   Thread 7 β”‚   Thread 8 β”‚   T β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚  Runtime (RDTSC) [s] β”‚ 0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0.0170919 β”‚  0. β‹―
β”‚ Runtime unhalted [s] β”‚ 0.0292671 β”‚ 3.81736e-5 β”‚ 9.25162e-5 β”‚ 3.78599e-5 β”‚ 3.77725e-5 β”‚ 3.83253e-5 β”‚ 3.93544e-5 β”‚ 3.35105e-5 β”‚ 3.6 β‹―
β”‚          Clock [MHz] β”‚   4335.71 β”‚    3577.02 β”‚    3605.04 β”‚    3606.75 β”‚    3612.25 β”‚    3590.27 β”‚    3610.05 β”‚    3222.33 β”‚     β‹―
β”‚                  CPI β”‚       Inf β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚     β‹―
β”‚         SP [MFLOP/s] β”‚   13164.1 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚     β‹―
└──────────────────────┴───────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──────
                                                                                                                   24 columns omitted

I don’t understand why I have 2.25 * 10^8 FLOPS, since the correct one should be 2 * 10^8. Aside from that, the calculated MFLOPS/s seems reasonable, since I divided the FLOPS by the 16 ms of the benchmark.

Now I do the memory case

julia> metrics, events = @perfmon "MEM_SP" mul!(y, A, x);
ERROR: The selected register PMC0 is in use.
Please run likwid with force option (-f, --force) to overwrite settings

Group: MEM_SP
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€
β”‚                     Event β”‚  Thread 1 β”‚ Thread 2 β”‚ Thread 3 β”‚ Thread 4 β”‚ Thread 5 β”‚ Thread 6 β”‚ Thread 7 β”‚ Thread 8 β”‚ Thread 9 β”‚ T β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€
β”‚          ACTUAL_CPU_CLOCK β”‚  1.0262e8 β”‚ 134993.0 β”‚ 140043.0 β”‚ 136181.0 β”‚ 136959.0 β”‚ 139807.0 β”‚ 135305.0 β”‚ 121054.0 β”‚ 132740.0 β”‚   β‹―
β”‚             MAX_CPU_CLOCK β”‚ 8.27059e7 β”‚ 130760.0 β”‚ 135450.0 β”‚ 131635.0 β”‚ 132475.0 β”‚ 135905.0 β”‚ 132020.0 β”‚ 129045.0 β”‚ 133315.0 β”‚   β‹―
β”‚      RETIRED_INSTRUCTIONS β”‚       NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚      NaN β”‚   β‹―
β”‚       CPU_CLOCKS_UNHALTED β”‚ 7.19892e7 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚ RETIRED_SSE_AVX_FLOPS_ALL β”‚    2.25e8 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚                     MERGE β”‚       0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚            DRAM_CHANNEL_0 β”‚  888725.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
β”‚            DRAM_CHANNEL_1 β”‚  888544.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚      0.0 β”‚   β‹―
└───────────────────────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┴────
                                                                                                                   23 columns omitted
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€
β”‚                            Metric β”‚  Thread 1 β”‚   Thread 2 β”‚   Thread 3 β”‚   Thread 4 β”‚   Thread 5 β”‚   Thread 6 β”‚   Thread 7 β”‚   T β‹―
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€
β”‚               Runtime (RDTSC) [s] β”‚  0.016978 β”‚   0.016978 β”‚   0.016978 β”‚   0.016978 β”‚   0.016978 β”‚   0.016978 β”‚   0.016978 β”‚   0 β‹―
β”‚              Runtime unhalted [s] β”‚ 0.0293757 β”‚ 3.86428e-5 β”‚ 4.00884e-5 β”‚ 3.89828e-5 β”‚ 3.92056e-5 β”‚ 4.00208e-5 β”‚ 3.87321e-5 β”‚ 3.4 β‹―
β”‚                       Clock [MHz] β”‚   4334.49 β”‚    3606.45 β”‚    3611.81 β”‚     3614.0 β”‚     3611.6 β”‚    3593.66 β”‚    3580.28 β”‚     β‹―
β”‚                               CPI β”‚       Inf β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚     β‹―
β”‚                      SP [MFLOP/s] β”‚   13252.4 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚     β‹―
β”‚       Memory bandwidth [MBytes/s] β”‚   26798.2 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚     β‹―
β”‚       Memory data volume [GBytes] β”‚  0.454981 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚        0.0 β”‚     β‹―
β”‚ Operational intensity [FLOP/Byte] β”‚  0.494526 β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚        NaN β”‚     β‹―
└───────────────────────────────────┴───────────┴────────────┴────────────┴────────────┴────────────┴────────────┴────────────┴──────
                                                                                                                   25 columns omitted

I see 26 GB/s of memory bandwidth. So I guess that it is obtained by doing

sizeof(A) / 1024^3 / 0.0167
22.307127535700086

Assuming that every is correct, this means that I could parallelize this operations just 3 times, before reaching the memory bandwidth limits of my hardware (assuming 75 GB/s). Is that right?

More or less, yes. It’s not necessarily that simple because more threads may cause a different pattern of memory access, but more or less, yes.

Some (hopefully helpful) comments.

Great. I’m slightly confused, however. Is your CPU really this one? According to your versioninfo() output above it is. That would be strange because LIKWID doesn’t support Raptor Lake yet. Did you run the experiment on a different system/CPU?

Your estimation seems reasonable. Every element of A must must be read from memory at some point (because sizeof(Float32) * 10000^2 / 1000^2 = 400 MB, A shoudn’t fit into cache). The values of y and x also need to be read (and written). But those vectors are much smaller and likely reside in cache. (You made a minor mistake though: MB = Mega = 1000^3 not Mebi = 1024^3.)

On a more general note, whether one is bounded by memory access or compute can depend on the specific implementation. For example, a naive implementation of a matrix-matrix product is memory bounded, a highly optimized implementation (say in BLAS) OTOH is bound by compute. Hence, estimating the data transfer isn’t always simple and requires care (and knowledge about the implementation).

Depending on the system you are working on, it can be non-trivial to achieve the maximum memory bandwidth. For example, there might be multiple memory channels that can only be accessed from certain cores. If your 3 parallel computations run on the wrong cores, you are not bounded by the maximal memory bandwidth (of the entire system) but of the same of a single memory channel. In such a case, thread pinning plays a crucial role.

2 Likes

Yes, LIKWID.jl doesn’t support my current CPU, so I switched to another machine to do the benchmarks.

Could you explain better this? What do you mean by β€œmemory bound” and β€œcompute bounds?