Yeah, this is something Iāve thought a little bit about. I had hoped that Tullio.jl plus LoopVectorization.jl could derive a sufficiently clever custom kernel automatically that itād beat BLAS for this, but that hope unfortunately did not materialize:
julia> using LinearAlgebra, BenchmarkTools, Tullio, LoopVectorization
julia> function fmul!(C, D, A, U)
mul!(D, A, U)
mul!(C, U', D)
end;
julia> ftullio!(C, A, U) = @tullio C[i, j] = U'[i, k] * A[k, l] * U[l, j];
julia> function favx!(C, A, U)
n = size(C, 1)
@assert (n, n) == size(C) == size(A) == size(U)
ax = axes(C, 1)
@avx for i ā ax, j ā ax
Cij = zero(eltype(C))
for k ā ax, l ā ax
Cij += conj(U[k, i]) * A[k, l] * U[l, j]
end
C[i, j] = Cij
end
C
end
favx! (generic function with 1 method)
julia> foreach((5, 10, 20, 50, 100, 500)) do n
@show n
A = randn(n, n); A = A + A'
U = eigvecs(A)
C = similar(U); D = similar(U)
@btime ftullio!($C, $A, $U)
@btime favx!($C, $A, $U)
@btime fmul!($C, $D, $A, $U)
println()
end
n = 5
199.142 ns (0 allocations: 0 bytes)
197.363 ns (0 allocations: 0 bytes)
335.792 ns (0 allocations: 0 bytes)
n = 10
1.406 Ī¼s (0 allocations: 0 bytes)
1.409 Ī¼s (0 allocations: 0 bytes)
655.056 ns (0 allocations: 0 bytes)
n = 20
17.910 Ī¼s (0 allocations: 0 bytes)
17.800 Ī¼s (0 allocations: 0 bytes)
2.322 Ī¼s (0 allocations: 0 bytes)
n = 50
171.148 Ī¼s (73 allocations: 5.47 KiB)
788.392 Ī¼s (0 allocations: 0 bytes)
25.299 Ī¼s (0 allocations: 0 bytes)
n = 100
2.277 ms (73 allocations: 5.47 KiB)
11.757 ms (0 allocations: 0 bytes)
78.519 Ī¼s (0 allocations: 0 bytes)
n = 500
1.404 s (75 allocations: 5.53 KiB)
7.343 s (0 allocations: 0 bytes)
4.129 ms (0 allocations: 0 bytes)
(Note, due to https://github.com/mcabbott/Tullio.jl/issues/50, you will need to use Tullio version 0.2.8 or earlier to reproduce these times, not the latest version.)
I think in theory, LoopVectorization.jl should be able to create a pretty good microkernel for this, but perhaps the optimal algorithm is really just one matmul after another. In some senses, this makes sense because the naive algorithm involving two temporary buffers is basically trading extra memory usage for better cache locality. But it could also be that this is just a use-case that LoopVectorization.jl is not well tuned for.
Iād be interested to know if anyone knowledgeable about LoopVectortorization.jl and BLAS has any thoughts on this. @Elrod?