LoopVec, Tullio losing to Matrix multiplication

Hi all,

I provide a simple mwe below

#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization


Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5


function gamma(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end


function gamma_tul(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
    return gamma_empty
end

@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)

image
As you can see, gamma_mat is doing the best on my linux cluster call using only 1 core. On more cores, this is even more pronounced. However, later, this function will be used inside a distributed computing process, where each process only has access to 1 core. So multithreading across more CPUs etc. (using @tturbo) is not an option. Why am I finding this?

Hi there,
I have an idea where these speed differences might stem from: Did you check the value of BLAS.get_num_threads() (you need to using LinearAlgebra for that)? If you start Julia single-threaded, then by default BLAS uses 1 thread per core (which in my case is 8) and this of course is not a fair comparison.

When I set BLAS.set_num_threads(1) then Tullio.jl is slightly faster than your simple implementation and LoopVectorization.jl beats that by about 30%.

However, I think you could optimize the code a bit more and not compute the product of invlen and beta over and over. And your gamma_mat could also reuse the memory (which it doesn’t right now) by using mul!. Here is a slightly modified benchmark script and some timings:

#MWE
using BenchmarkTools, Random, Tullio, LoopVectorization, LinearAlgebra

BLAS.set_num_threads(1)

Random.seed!(123)
size_s = 3000
s_ijt = randn(size_s,100)*0.0005 .+ 0.1
gamma_empty = zeros(size_s,size_s)
beta = -5


function gamma(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end


function gamma_tul(s_ijt, gamma_empty, beta)
    invlen = inv(size(s_ijt,2))
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
    return gamma_empty
end
function gamma_mat2(s_ijt, gamma_empty, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty
end
@info("Correctness check",
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta))

@info "Benchmarks"
println("LV: ")
@btime gamma($s_ijt, $gamma_empty, $beta)
println("LA: ")
@btime gamma_mat($s_ijt, $gamma_empty, $beta)
println("Tullio: ")
@btime gamma_tul($s_ijt, $gamma_empty, $beta)
println("LA (v2): ")
@btime gamma_mat2($s_ijt, $gamma_empty, $beta);

Output:

julia> include("test.jl")
┌ Info: Correctness check
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta) = true
│   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta) = true
└   gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta) = true
[ Info: Benchmarks
LV: 
  43.153 ms (0 allocations: 0 bytes)
LA: 
  67.933 ms (4 allocations: 70.95 MiB)
Tullio: 
  59.067 ms (0 allocations: 0 bytes)
LA (v2): 
  32.415 ms (0 allocations: 0 bytes)

If I change the line with BLAS.set_num_threads(8) which reflects my default, then I get:

julia> include("test.jl")
#...
LV: 
  43.105 ms (0 allocations: 0 bytes)
LA: 
  28.460 ms (4 allocations: 70.95 MiB)
Tullio: 
  62.924 ms (0 allocations: 0 bytes)
LA (v2): 
  18.774 ms (0 allocations: 0 bytes)

Also a small notes on style: Usually in Julia we name functions that mutate one or more of their arguments with an ! and put the mutated arguments first. So it would be better style to write:

function gamma_mat2!(gamma_empty, s_ijt, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty # note this return is unnecessary, mul! returns that anyways, but being explicit is almost never a bad thing
end
2 Likes

Hi,

  1. so my BLAS.get_num_threads() is equal to 1. There is no difference possible from the number of threads. So, for me, the results are a good comparison, still, linearalgebra beats the others (see below).
    image

  2. Good point, it makes more sense to multiply beta with invlen once.

  3. I have added all your comments to my mwe, and provide new results:

#MWE
using BenchmarkTools, Random, Distributions, Base.Threads, Tullio, LoopVectorization, LinearAlgebra


Random.seed!(123)
s_ijt = randn(3000,100)*0.0005 .+ 0.1
gamma_empty = zeros(size(s_ijt,1), size(s_ijt,1))
beta = -5


function gamma!(s_ijt, gamma_empty, beta)
    invlen =  beta/size(s_ijt,2)
    @turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
        g = zero(eltype(gamma_empty))
        for a in axes(s_ijt,2)
            g += s_ijt[j, a] * (s_ijt[k, a] * invlen)
        end
        gamma_empty[j, k] = g
    end
    return gamma_empty
end
BLAS.get_num_threads()

function gamma_tul!(s_ijt, gamma_empty, beta)
    invlen =  beta/size(s_ijt,2)
    return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * invlen)
end
function gamma_mat!(s_ijt, gamma_empty, beta)
    #gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
    invlen = inv(size(s_ijt,2))
    mult = beta/size(s_ijt,2)
    gamma_empty .= (s_ijt * transpose(s_ijt .*mult))# .* ownership_mat
    return gamma_empty
end
function gamma_mat2!(s_ijt, gamma_empty, beta)
    mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
    return gamma_empty
end
println("gamma","gamma_mat","gamma_mat2","gamma_tul")
@btime gamma(s_ijt, gamma_empty, beta)
@btime gamma_mat(s_ijt, gamma_empty, beta)
@btime gamma_mat2(s_ijt, gamma_empty, beta)
@btime gamma_tul(s_ijt, gamma_empty, beta)

with the results:
“gamma”: 115.997 ms (0 allocations: 0 bytes)
“gamma_mat”: 84.447 ms (4 allocations: 70.95 MiB)
“gamma_mat2”: 45.245 ms (0 allocations: 0 bytes)
“gamma_tul”: 87.458 ms (0 allocations: 0 bytes)

Matrix multiplication is probably the most optimized piece of code that exists within BLAS, which is one of the most optimized libraries that exists. Matrix multiplication has been optimized continuously over decades, and sometimes goes so far as to feature hand-written assembly and architecture-specific implementations to maximize performance.

(I will note that there are multiple BLAS libraries in existence, but all of the common ones are “very good.” Intel’s MKL is sometimes regarded as the most performant, at least on x86 processors. But for this post I will ignore the fact that there are several competing libraries and speak as if they are one.)

This is to say that if anything could decisively beat BLAS at its own game, it would not be long after BLAS devs caught wind that they would match its performance. A common exception to this is small matrices (<10x10, for example), where specialized implementations can often win (see StaticArrays.jl).

LoopVectorization and Tullio fill a different niche. While BLAS addresses a relatively small number of ubiquitous operations, these packages are designed to generate performant kernels for other functions that aren’t necessarily on that short list. They usually generate “best practice” code (or something close to it), – EDIT: Thanks to the poster below! I mistakenly thought that LoopVectorization used blocking. It does not, so it actually leaves a bit on the table when that is profitable. Tullio alleges to do some basic cache-agnostic tiling, so it will do better but still fall a little short – but don’t get the extra profiling, analysis, and individual attention that goes into the few operations performed by BLAS. They are quite good, especially for machine-generated code, but don’t claim to be 100% optimal.

In summary: I would consider it more of an issue if LV/Tullio did beat BLAS. For matrix multiplication, I would use LinearAlgebra.mul! for maximum performance. Use LV/Tullio for other operations that aren’t covered by BLAS. In the special case of small matrices of fixed size (e.g, 3D modeling and simulation) use StaticArrays.

If you plan to multithread at a higher level in the code, then (as suggested above) you should definitely use BLAS.set_num_threads(1) for the same reasons you don’t plan to use @tturbo.

6 Likes

Hi, Thanks, this is very helpful. Now I understand why, and I know that I should try to use BLAS to be fast instead of LoopVec and others.

Thank you!

I would also add that optimizing matrix multiplication involves “nonlocal” changes to the code — it’s not simply a matter of taking the naive 3-nested-loop algorithm and vectorizing/multithreading/fine-tuning the loops. The whole structure of the code is changed, e.g. to “block” the algorithm to improve cache performance (with multiple nested levels of “blocking” for multiple levels of the cache, even at the lowest level treating the registers as a kind of ideal cache). This is not the sort of transformation that something like LoopVectorization.jl does.

See also this thread: Julia matrix-multiplication performance

5 Likes