# LoopVec, Tullio losing to Matrix multiplication

Hi all,

I provide a simple mwe below

``````#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization

Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5

function gamma(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
@turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
g = zero(eltype(gamma_empty))
for a in axes(s_ijt,2)
g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
gamma_empty[j, k] = g
end
return gamma_empty
end

function gamma_tul(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
#gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
invlen = inv(size(s_ijt,2))
gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
return gamma_empty
end

@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)
``````

As you can see, gamma_mat is doing the best on my linux cluster call using only 1 core. On more cores, this is even more pronounced. However, later, this function will be used inside a distributed computing process, where each process only has access to 1 core. So multithreading across more CPUs etc. (using @tturbo) is not an option. Why am I finding this?

Hi there,
I have an idea where these speed differences might stem from: Did you check the value of `BLAS.get_num_threads()` (you need to `using LinearAlgebra` for that)? If you start Julia single-threaded, then by default `BLAS` uses 1 thread per core (which in my case is 8) and this of course is not a fair comparison.

When I set `BLAS.set_num_threads(1)` then Tullio.jl is slightly faster than your simple implementation and LoopVectorization.jl beats that by about 30%.

However, I think you could optimize the code a bit more and not compute the product of `invlen` and `beta` over and over. And your `gamma_mat` could also reuse the memory (which it doesn’t right now) by using `mul!`. Here is a slightly modified benchmark script and some timings:

``````#MWE
using BenchmarkTools, Random, Tullio, LoopVectorization, LinearAlgebra

Random.seed!(123)
size_s = 3000
s_ijt = randn(size_s,100)*0.0005 .+ 0.1
gamma_empty = zeros(size_s,size_s)
beta = -5

function gamma(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
@turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
g = zero(eltype(gamma_empty))
for a in axes(s_ijt,2)
g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
gamma_empty[j, k] = g
end
return gamma_empty
end

function gamma_tul(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
#gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
invlen = inv(size(s_ijt,2))
gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
return gamma_empty
end
function gamma_mat2(s_ijt, gamma_empty, beta)
mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
return gamma_empty
end
@info("Correctness check",
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta))

@info "Benchmarks"
println("LV: ")
@btime gamma(\$s_ijt, \$gamma_empty, \$beta)
println("LA: ")
@btime gamma_mat(\$s_ijt, \$gamma_empty, \$beta)
println("Tullio: ")
@btime gamma_tul(\$s_ijt, \$gamma_empty, \$beta)
println("LA (v2): ")
@btime gamma_mat2(\$s_ijt, \$gamma_empty, \$beta);
``````

Output:

``````julia> include("test.jl")
┌ Info: Correctness check
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta) = true
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta) = true
└   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta) = true
[ Info: Benchmarks
LV:
43.153 ms (0 allocations: 0 bytes)
LA:
67.933 ms (4 allocations: 70.95 MiB)
Tullio:
59.067 ms (0 allocations: 0 bytes)
LA (v2):
32.415 ms (0 allocations: 0 bytes)
``````

If I change the line with `BLAS.set_num_threads(8)` which reflects my default, then I get:

``````julia> include("test.jl")
#...
LV:
43.105 ms (0 allocations: 0 bytes)
LA:
28.460 ms (4 allocations: 70.95 MiB)
Tullio:
62.924 ms (0 allocations: 0 bytes)
LA (v2):
18.774 ms (0 allocations: 0 bytes)
``````

Also a small notes on style: Usually in Julia we name functions that mutate one or more of their arguments with an `!` and put the mutated arguments first. So it would be better style to write:

``````function gamma_mat2!(gamma_empty, s_ijt, beta)
mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
return gamma_empty # note this return is unnecessary, mul! returns that anyways, but being explicit is almost never a bad thing
end
``````
2 Likes

Hi,

1. so my BLAS.get_num_threads() is equal to 1. There is no difference possible from the number of threads. So, for me, the results are a good comparison, still, linearalgebra beats the others (see below).

2. Good point, it makes more sense to multiply beta with invlen once.

``````#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization, LinearAlgebra

Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5

function gamma!(s_ijt, gamma_empty, beta)
invlen =  beta/size(s_ijt,2)
@turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
g = zero(eltype(gamma_empty))
for a in axes(s_ijt,2)
g += s_ijt[j, a] * (s_ijt[k, a] * invlen)
end
gamma_empty[j, k] = g
end
return gamma_empty
end

function gamma_tul!(s_ijt, gamma_empty, beta)
invlen =  beta/size(s_ijt,2)
return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * invlen)
end
function gamma_mat!(s_ijt, gamma_empty, beta)
#gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
invlen = inv(size(s_ijt,2))
mult = beta/size(s_ijt,2)
gamma_empty .= (s_ijt * transpose(s_ijt .*mult))# .* ownership_mat
return gamma_empty
end
function gamma_mat2!(s_ijt, gamma_empty, beta)
mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
return gamma_empty
end
println("gamma","gamma_mat","gamma_mat2","gamma_tul")
@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_mat2(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)
``````

with the results:
“gamma”: 115.997 ms (0 allocations: 0 bytes)
“gamma_mat”: 84.447 ms (4 allocations: 70.95 MiB)
“gamma_mat2”: 45.245 ms (0 allocations: 0 bytes)
“gamma_tul”: 87.458 ms (0 allocations: 0 bytes)

Matrix multiplication is probably the most optimized piece of code that exists within BLAS, which is one of the most optimized libraries that exists. Matrix multiplication has been optimized continuously over decades, and sometimes goes so far as to feature hand-written assembly and architecture-specific implementations to maximize performance.

(I will note that there are multiple BLAS libraries in existence, but all of the common ones are “very good.” Intel’s MKL is sometimes regarded as the most performant, at least on x86 processors. But for this post I will ignore the fact that there are several competing libraries and speak as if they are one.)

This is to say that if anything could decisively beat BLAS at its own game, it would not be long after BLAS devs caught wind that they would match its performance. A common exception to this is small matrices (<10x10, for example), where specialized implementations can often win (see StaticArrays.jl).

LoopVectorization and Tullio fill a different niche. While BLAS addresses a relatively small number of ubiquitous operations, these packages are designed to generate performant kernels for other functions that aren’t necessarily on that short list. They usually generate “best practice” code (or something close to it), – EDIT: Thanks to the poster below! I mistakenly thought that LoopVectorization used blocking. It does not, so it actually leaves a bit on the table when that is profitable. Tullio alleges to do some basic cache-agnostic tiling, so it will do better but still fall a little short – but don’t get the extra profiling, analysis, and individual attention that goes into the few operations performed by BLAS. They are quite good, especially for machine-generated code, but don’t claim to be 100% optimal.

In summary: I would consider it more of an issue if LV/Tullio did beat BLAS. For matrix multiplication, I would use `LinearAlgebra.mul!` for maximum performance. Use LV/Tullio for other operations that aren’t covered by BLAS. In the special case of small matrices of fixed size (e.g, 3D modeling and simulation) use StaticArrays.

If you plan to multithread at a higher level in the code, then (as suggested above) you should definitely use `BLAS.set_num_threads(1)` for the same reasons you don’t plan to use `@tturbo`.

6 Likes

Hi, Thanks, this is very helpful. Now I understand why, and I know that I should try to use BLAS to be fast instead of LoopVec and others.

Thank you!

I would also add that optimizing matrix multiplication involves “nonlocal” changes to the code — it’s not simply a matter of taking the naive 3-nested-loop algorithm and vectorizing/multithreading/fine-tuning the loops. The whole structure of the code is changed, e.g. to “block” the algorithm to improve cache performance (with multiple nested levels of “blocking” for multiple levels of the cache, even at the lowest level treating the registers as a kind of ideal cache). This is not the sort of transformation that something like LoopVectorization.jl does.