Yeah, this is something Iāve thought a little bit about. I had hoped that Tullio.jl plus LoopVectorization.jl could derive a sufficiently clever custom kernel automatically that itād beat BLAS for this, but that hope unfortunately did not materialize:
julia> using LinearAlgebra, BenchmarkTools, Tullio, LoopVectorization
julia> function fmul!(C, D, A, U)
mul!(D, A, U)
mul!(C, U', D)
end;
julia> ftullio!(C, A, U) = @tullio C[i, j] = U'[i, k] * A[k, l] * U[l, j];
julia> function favx!(C, A, U)
n = size(C, 1)
@assert (n, n) == size(C) == size(A) == size(U)
ax = axes(C, 1)
@avx for i ā ax, j ā ax
Cij = zero(eltype(C))
for k ā ax, l ā ax
Cij += conj(U[k, i]) * A[k, l] * U[l, j]
end
C[i, j] = Cij
end
C
end
favx! (generic function with 1 method)
julia> foreach((5, 10, 20, 50, 100, 500)) do n
@show n
A = randn(n, n); A = A + A'
U = eigvecs(A)
C = similar(U); D = similar(U)
@btime ftullio!($C, $A, $U)
@btime favx!($C, $A, $U)
@btime fmul!($C, $D, $A, $U)
println()
end
n = 5
199.142 ns (0 allocations: 0 bytes)
197.363 ns (0 allocations: 0 bytes)
335.792 ns (0 allocations: 0 bytes)
n = 10
1.406 μs (0 allocations: 0 bytes)
1.409 μs (0 allocations: 0 bytes)
655.056 ns (0 allocations: 0 bytes)
n = 20
17.910 μs (0 allocations: 0 bytes)
17.800 μs (0 allocations: 0 bytes)
2.322 μs (0 allocations: 0 bytes)
n = 50
171.148 μs (73 allocations: 5.47 KiB)
788.392 μs (0 allocations: 0 bytes)
25.299 μs (0 allocations: 0 bytes)
n = 100
2.277 ms (73 allocations: 5.47 KiB)
11.757 ms (0 allocations: 0 bytes)
78.519 μs (0 allocations: 0 bytes)
n = 500
1.404 s (75 allocations: 5.53 KiB)
7.343 s (0 allocations: 0 bytes)
4.129 ms (0 allocations: 0 bytes)
(Note, due to https://github.com/mcabbott/Tullio.jl/issues/50, you will need to use Tullio version 0.2.8 or earlier to reproduce these times, not the latest version.)
I think in theory, LoopVectorization.jl should be able to create a pretty good microkernel for this, but perhaps the optimal algorithm is really just one matmul after another. In some senses, this makes sense because the naive algorithm involving two temporary buffers is basically trading extra memory usage for better cache locality. But it could also be that this is just a use-case that LoopVectorization.jl is not well tuned for.
Iād be interested to know if anyone knowledgeable about LoopVectortorization.jl and BLAS has any thoughts on this. @Elrod?