Hi there,

I have an idea where these speed differences might stem from: Did you check the value of `BLAS.get_num_threads()`

(you need to `using LinearAlgebra`

for that)? If you start Julia single-threaded, then by default `BLAS`

uses 1 thread per core (which in my case is 8) and this of course is not a fair comparison.

When I set `BLAS.set_num_threads(1)`

then Tullio.jl is slightly faster than your simple implementation and LoopVectorization.jl beats that by about 30%.

However, I think you could optimize the code a bit more and not compute the product of `invlen`

and `beta`

over and over. And your `gamma_mat`

could also reuse the memory (which it doesn’t right now) by using `mul!`

. Here is a slightly modified benchmark script and some timings:

```
#MWE
using BenchmarkTools, Random, Tullio, LoopVectorization, LinearAlgebra
BLAS.set_num_threads(1)
Random.seed!(123)
size_s = 3000
s_ijt = randn(size_s,100)*0.0005 .+ 0.1
gamma_empty = zeros(size_s,size_s)
beta = -5
function gamma(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
@turbo for j in axes(gamma_empty, 1), k in axes(gamma_empty, 2)
g = zero(eltype(gamma_empty))
for a in axes(s_ijt,2)
g += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
gamma_empty[j, k] = g
end
return gamma_empty
end
function gamma_tul(s_ijt, gamma_empty, beta)
invlen = inv(size(s_ijt,2))
return @tullio gamma_empty[j,k] += s_ijt[j, a] * (s_ijt[k, a] * beta * invlen)
end
function gamma_mat(s_ijt, gamma_empty, beta)
#gamma = Matrix{Float64}(undef, size(s_ijt,1), size(s_ijt,1));
invlen = inv(size(s_ijt,2))
gamma_empty .= (s_ijt * transpose(s_ijt .* beta .*invlen))# .* ownership_mat
return gamma_empty
end
function gamma_mat2(s_ijt, gamma_empty, beta)
mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
return gamma_empty
end
@info("Correctness check",
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta),
gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta))
@info "Benchmarks"
println("LV: ")
@btime gamma($s_ijt, $gamma_empty, $beta)
println("LA: ")
@btime gamma_mat($s_ijt, $gamma_empty, $beta)
println("Tullio: ")
@btime gamma_tul($s_ijt, $gamma_empty, $beta)
println("LA (v2): ")
@btime gamma_mat2($s_ijt, $gamma_empty, $beta);
```

Output:

```
julia> include("test.jl")
┌ Info: Correctness check
│ gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat(s_ijt, zeros(size_s, size_s), beta) = true
│ gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_tul(s_ijt, zeros(size_s, size_s), beta) = true
└ gamma(s_ijt, zeros(size_s, size_s), beta) ≈ gamma_mat2(s_ijt, zeros(size_s, size_s), beta) = true
[ Info: Benchmarks
LV:
43.153 ms (0 allocations: 0 bytes)
LA:
67.933 ms (4 allocations: 70.95 MiB)
Tullio:
59.067 ms (0 allocations: 0 bytes)
LA (v2):
32.415 ms (0 allocations: 0 bytes)
```

If I change the line with `BLAS.set_num_threads(8)`

which reflects my default, then I get:

```
julia> include("test.jl")
#...
LV:
43.105 ms (0 allocations: 0 bytes)
LA:
28.460 ms (4 allocations: 70.95 MiB)
Tullio:
62.924 ms (0 allocations: 0 bytes)
LA (v2):
18.774 ms (0 allocations: 0 bytes)
```

Also a small notes on style: Usually in Julia we name functions that mutate one or more of their arguments with an `!`

and put the mutated arguments first. So it would be better style to write:

```
function gamma_mat2!(gamma_empty, s_ijt, beta)
mul!(gamma_empty, s_ijt, transpose(s_ijt), beta/size(s_ijt,2), false)
return gamma_empty # note this return is unnecessary, mul! returns that anyways, but being explicit is almost never a bad thing
end
```