LoopVec, Tullio losing to Matrix multiplication

ditfurth · June 18, 2024, 9:05am

Hi all,

I provide a simple mwe below

#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization


Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5


function gamma(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end


function gamma_tul(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
    return gamma_empty
end

@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)

As you can see, gamma_mat is doing the best on my linux cluster call using only 1 core. On more cores, this is even more pronounced. However, later, this function will be used inside a distributed computing process, where each process only has access to 1 core. So multithreading across more CPUs etc. (using @tturbo) is not an option. Why am I finding this?

abraemer · June 18, 2024, 10:43am

Hi there,
I have an idea where these speed differences might stem from: Did you check the value of BLAS.get_num_threads() (you need to using LinearAlgebra for that)? If you start Julia single-threaded, then by default BLAS uses 1 thread per core (which in my case is 8) and this of course is not a fair comparison.

When I set BLAS.set_num_threads(1) then Tullio.jl is slightly faster than your simple implementation and LoopVectorization.jl beats that by about 30%.

However, I think you could optimize the code a bit more and not compute the product of invlen and beta over and over. And your gamma_mat could also reuse the memory (which it doesn’t right now) by using mul!. Here is a slightly modified benchmark script and some timings:

#MWE
using BenchmarkTools, Random, Tullio, LoopVectorization, LinearAlgebra

BLAS.set_num_threads(1)

Random.seed!(123)
size_s = 3000
s_ijt = randn(size_s,100)*0.0005 .+ 0.1
gamma_empty = zeros(size_s,size_s)
beta = -5


function gamma(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end


function gamma_tul(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
    return gamma_empty
end
function gamma_mat2(s_ijt, gamma_empty, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty
end
@info("Correctness check",
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta))

@info "Benchmarks"
println("LV: ")
@btime gamma($s_ijt, $gamma_empty, $beta)
println("LA: ")
@btime gamma_mat($s_ijt, $gamma_empty, $beta)
println("Tullio: ")
@btime gamma_tul($s_ijt, $gamma_empty, $beta)
println("LA (v2): ")
@btime gamma_mat2($s_ijt, $gamma_empty, $beta);

Output:

julia> include("test.jl")
┌ Info: Correctness check
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta) = true
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta) = true
└   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta) = true
[ Info: Benchmarks
LV: 
  43.153 ms (0 allocations: 0 bytes)
LA: 
  67.933 ms (4 allocations: 70.95 MiB)
Tullio: 
  59.067 ms (0 allocations: 0 bytes)
LA (v2): 
  32.415 ms (0 allocations: 0 bytes)

If I change the line with BLAS.set_num_threads(8) which reflects my default, then I get:

julia> include("test.jl")
#...
LV: 
  43.105 ms (0 allocations: 0 bytes)
LA: 
  28.460 ms (4 allocations: 70.95 MiB)
Tullio: 
  62.924 ms (0 allocations: 0 bytes)
LA (v2): 
  18.774 ms (0 allocations: 0 bytes)

Also a small notes on style: Usually in Julia we name functions that mutate one or more of their arguments with an ! and put the mutated arguments first. So it would be better style to write:

function gamma_mat2!(gamma_empty, s_ijt, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty # note this return is unnecessary, mul! returns that anyways, but being explicit is almost never a bad thing
end

ditfurth · June 18, 2024, 11:18am

Hi,

so my BLAS.get_num_threads() is equal to 1. There is no difference possible from the number of threads. So, for me, the results are a good comparison, still, linearalgebra beats the others (see below).
Good point, it makes more sense to multiply beta with invlen once.
I have added all your comments to my mwe, and provide new results:

#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization, LinearAlgebra


Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5


function gamma!(s_ijt, gamma_empty, beta)
    invlen =  beta/size(s_ijt,2)
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end
BLAS.get_num_threads()

function gamma_tul!(s_ijt, gamma_empty, beta)
    invlen =  beta/size(s_ijt,2)
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * invlen)
end
function gamma_mat!(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    mult = beta/size(s_ijt,2)
    gamma_empty .= (s_ijt * transpose(s_ijt .*mult))# .* ownership_mat
    return gamma_empty
end
function gamma_mat2!(s_ijt, gamma_empty, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty
end
println("gamma","gamma_mat","gamma_mat2","gamma_tul")
@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_mat2(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)

with the results:
“gamma”: 115.997 ms (0 allocations: 0 bytes)
“gamma_mat”: 84.447 ms (4 allocations: 70.95 MiB)
“gamma_mat2”: 45.245 ms (0 allocations: 0 bytes)
“gamma_tul”: 87.458 ms (0 allocations: 0 bytes)

mikmoore · June 18, 2024, 2:04pm

Matrix multiplication is probably the most optimized piece of code that exists within BLAS, which is one of the most optimized libraries that exists. Matrix multiplication has been optimized continuously over decades, and sometimes goes so far as to feature hand-written assembly and architecture-specific implementations to maximize performance.

(I will note that there are multiple BLAS libraries in existence, but all of the common ones are “very good.” Intel’s MKL is sometimes regarded as the most performant, at least on x86 processors. But for this post I will ignore the fact that there are several competing libraries and speak as if they are one.)

This is to say that if anything could decisively beat BLAS at its own game, it would not be long after BLAS devs caught wind that they would match its performance. A common exception to this is small matrices (<10x10, for example), where specialized implementations can often win (see StaticArrays.jl).

LoopVectorization and Tullio fill a different niche. While BLAS addresses a relatively small number of ubiquitous operations, these packages are designed to generate performant kernels for other functions that aren’t necessarily on that short list. They usually generate “best practice” code (or something close to it), – EDIT: Thanks to the poster below! I mistakenly thought that LoopVectorization used blocking. It does not, so it actually leaves a bit on the table when that is profitable. Tullio alleges to do some basic cache-agnostic tiling, so it will do better but still fall a little short – but don’t get the extra profiling, analysis, and individual attention that goes into the few operations performed by BLAS. They are quite good, especially for machine-generated code, but don’t claim to be 100% optimal.

In summary: I would consider it more of an issue if LV/Tullio did beat BLAS. For matrix multiplication, I would use LinearAlgebra.mul! for maximum performance. Use LV/Tullio for other operations that aren’t covered by BLAS. In the special case of small matrices of fixed size (e.g, 3D modeling and simulation) use StaticArrays.

If you plan to multithread at a higher level in the code, then (as suggested above) you should definitely use BLAS.set_num_threads(1) for the same reasons you don’t plan to use @tturbo.

ditfurth · June 18, 2024, 2:48pm

Hi, Thanks, this is very helpful. Now I understand why, and I know that I should try to use BLAS to be fast instead of LoopVec and others.

Thank you!

stevengj · June 18, 2024, 2:54pm

I would also add that optimizing matrix multiplication involves “nonlocal” changes to the code — it’s not simply a matter of taking the naive 3-nested-loop algorithm and vectorizing/multithreading/fine-tuning the loops. The whole structure of the code is changed, e.g. to “block” the algorithm to improve cache performance (with multiple nested levels of “blocking” for multiple levels of the cache, even at the lowest level treating the registers as a kind of ideal cache). This is not the sort of transformation that something like LoopVectorization.jl does.

See also this thread: Julia matrix-multiplication performance

Elrod · July 23, 2024, 1:55pm

LV does do this, but that’s it.

stevengj · July 23, 2024, 2:05pm

To exploit the registers properly, my understanding is that you still have to tile the gemm loops into blocks. My colleague Matteo Frigo gave a talk about this once (slides), contrasting register allocation for FFT and matrix-multiplication kernels, in which he argued that the optimal aspect ratio of these lowest-level blocks is non-square (and depends on the ratio of load to store cost). Does LV actually do something like this, or does it just unroll the innermost loop?

Elrod · July 23, 2024, 2:44pm

Yes, it does this.
Taking a matrix multiply example:

using LoopVectorization, BenchmarkTools, Cthulhu
# Cthulhu is for looking at the generated code
function AmulB!(C,A,B)
  # `indices` is like a broadcasted `axes` that checks they're equal
  @turbo for n = indices((C,B), 2), m = indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k = indices((A,B), (2,1))
      Cmn += A[m,k]*B[k,n]
    end
    C[m,n] = Cmn
  end
end

M,K,N = 128,128,128; T=Float64;A=rand(T,M,K);B=rand(T,K,N);C0=A*B;C1=similar(C0);

t = @belapsed AmulB!($C1,$A,$B); 2e-9M*K*N/t # GFLOPS

C0 ≈ C1

@descend AmulB!(C1,A,B)

On an Intel 9940X (with AVX512), I get

julia> t = @belapsed AmulB!($C1,$A,$B); 2e-9M*K*N/t # GFLOPS
106.23870314083081

julia> C0 ≈ C1
true

This is a high percentage of peak flops, but the matrices must be small, as it only does the register-level caching.

Using Cthulhu to desciend into _turbo_!, turn off debuginfo, and show native code, you’ll see this inner loop:

L432:
	vbroadcastsd	zmm28, qword ptr [r8]
	vmovupd	zmm29, zmmword ptr [rdi]
	vmovupd	zmm30, zmmword ptr [rdi + 64]
	vmovupd	zmm27, zmmword ptr [rdi + 128]
	prefetcht0	byte ptr [rdi + rbx]
	prefetcht0	byte ptr [rdi + rbx + 64]
	prefetcht0	byte ptr [rdi + rbx + 128]
	vfmadd231pd	zmm18, zmm29, zmm28     # zmm18 = (zmm29 * zmm28) + zmm18
	vfmadd231pd	zmm9, zmm30, zmm28      # zmm9 = (zmm30 * zmm28) + zmm9
	vbroadcastsd	zmm31, qword ptr [r8 + rax]
	vfmadd231pd	zmm0, zmm27, zmm28      # zmm0 = (zmm27 * zmm28) + zmm0
	vfmadd231pd	zmm19, zmm29, zmm31     # zmm19 = (zmm29 * zmm31) + zmm19
	vfmadd231pd	zmm10, zmm30, zmm31     # zmm10 = (zmm30 * zmm31) + zmm10
	vfmadd231pd	zmm1, zmm27, zmm31      # zmm1 = (zmm27 * zmm31) + zmm1
	vbroadcastsd	zmm28, qword ptr [r8 + 2*rax]
	vfmadd231pd	zmm20, zmm29, zmm28     # zmm20 = (zmm29 * zmm28) + zmm20
	vfmadd231pd	zmm11, zmm30, zmm28     # zmm11 = (zmm30 * zmm28) + zmm11
	vfmadd231pd	zmm2, zmm27, zmm28      # zmm2 = (zmm27 * zmm28) + zmm2
	vbroadcastsd	zmm28, qword ptr [r8 + rcx]
	vfmadd231pd	zmm21, zmm29, zmm28     # zmm21 = (zmm29 * zmm28) + zmm21
	vfmadd231pd	zmm12, zmm30, zmm28     # zmm12 = (zmm30 * zmm28) + zmm12
	vfmadd231pd	zmm3, zmm27, zmm28      # zmm3 = (zmm27 * zmm28) + zmm3
	vbroadcastsd	zmm28, qword ptr [r8 + 4*rax]
	vfmadd231pd	zmm22, zmm29, zmm28     # zmm22 = (zmm29 * zmm28) + zmm22
	vfmadd231pd	zmm13, zmm30, zmm28     # zmm13 = (zmm30 * zmm28) + zmm13
	vfmadd231pd	zmm4, zmm27, zmm28      # zmm4 = (zmm27 * zmm28) + zmm4
	vbroadcastsd	zmm28, qword ptr [r8 + r15]
	vfmadd231pd	zmm23, zmm29, zmm28     # zmm23 = (zmm29 * zmm28) + zmm23
	vfmadd231pd	zmm14, zmm30, zmm28     # zmm14 = (zmm30 * zmm28) + zmm14
	vfmadd231pd	zmm5, zmm27, zmm28      # zmm5 = (zmm27 * zmm28) + zmm5
	vbroadcastsd	zmm28, qword ptr [r8 + 2*rcx]
	vfmadd231pd	zmm24, zmm29, zmm28     # zmm24 = (zmm29 * zmm28) + zmm24
	vfmadd231pd	zmm15, zmm30, zmm28     # zmm15 = (zmm30 * zmm28) + zmm15
	vbroadcastsd	zmm31, qword ptr [r8 + r12]
	vfmadd231pd	zmm6, zmm27, zmm28      # zmm6 = (zmm27 * zmm28) + zmm6
	vfmadd231pd	zmm25, zmm29, zmm31     # zmm25 = (zmm29 * zmm31) + zmm25
	vfmadd231pd	zmm16, zmm30, zmm31     # zmm16 = (zmm30 * zmm31) + zmm16
	vfmadd231pd	zmm7, zmm27, zmm31      # zmm7 = (zmm27 * zmm31) + zmm7
	vbroadcastsd	zmm28, qword ptr [r8 + 8*rax]
	vfmadd231pd	zmm26, zmm28, zmm29     # zmm26 = (zmm28 * zmm29) + zmm26
	vfmadd231pd	zmm17, zmm28, zmm30     # zmm17 = (zmm28 * zmm30) + zmm17
	vfmadd231pd	zmm8, zmm27, zmm28      # zmm8 = (zmm27 * zmm28) + zmm8
	add	rdi, r11
	add	r8, 8
	cmp	rdi, r10
	jbe	L432

Note, this loop features:
a. 3 contiguous loads (vmovupd), loading 8 contiguous elements each. It effectively loads A[m .+ (1:24), k].
b. 9 broadcast loads (vbroadcastsd). This is loading B[k, n .+ (1:9)]; each load goes into a separate vector, and fills it.
c. 27 fma instructions. These are performing the outer product, C[m.+(1:24), n.+(1:9)] += A[m .+ (1:24), k] * B[k, n .+ (1:9)]. Note we have 0 loads or stores to C within the inner most loop; these have been hoisted out.
d. A few miscellaneous instructions, such as prefetches to hide latency as the hardware prefetchers tend not to handle access across strides like for matrix A, and of course incremeneting our loop counters/array pointers, cmp to check if we’re done with the loop and the conditional jump if not.

So it’s doing a non-square tile, 24x9. This tile is non-square either way you look at it: total number of elements (24x9) or by number of registers (3x9).

LV solves an optimization problem to come up with the solution. It’s constrained by the number of architectural registers (which is 32 vector registers with AVX512; the systems actually generally have far more actual vector registers, e.g. 168 on the Skylake-X system I tested on, or 384 on the upcoming Zen5, but only 32 have architectural names we can actually use in the assembly), and the cost function tries to minimize the cost of the loads (also stores, but there aren’t any); contiguous loads are treated as more expensive than broadcast loads, which is why we end up with only 3 contiguous vs 9 broadcast. This consumes 31=3*9 + 3 + 1 vector registers; our unrollings are nested, we’re forced to carry the inner-most unrolling (m) across the outer-most, which is why we need to pay the full cost of 3. The outer-unrolled register gets to be reused. You can see that LLVM reused zmm28 and zmm31 for this (it could’ve reused only one of the two, but there’s no reason not to use all 32 registers here).

I have been working on cache-modeling lately, and think I now have a reasonably good handle/model of it. A naive triple loop will in the future be able to do much better than Octavian.jl, which was guided by extremely poor cache model/understanding of how caches work.

TLDR: LV does not unroll or vectorize the inner loop if it can help it. For a matrix multiply, it can help it (note, the inner k loop is neither unrolled or vectorized). It uses a non-square tile of registers as fast memory.

photor · July 25, 2024, 4:51am

Hope the Julia community can benefit from your new model in the near future, after LV and Octavian die

Topic		Replies	Views
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1062	April 10, 2022
Julia matrix-multiplication performance Performance linearalgebra	20	8691	October 30, 2022
Simple Mat-Vec multiply (understanding performance, without the bugs) Performance tullio	16	3299	August 12, 2020
Realistically, how close is Gaius.jl to becoming a full replacement for BLAS in Julia? Internals & Design tullio , loopvectorization , openblas	13	4718	August 16, 2020
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3268	August 19, 2021

LoopVec, Tullio losing to Matrix multiplication

Related topics