Some alternatives. Maybe I am not using Tullio quite properly here. I have 4 threads, and mul!
is generally multi-threaded because of BLAS:
julia> using Tullio, LinearAlgebra
julia> function f!(C,A,B)
for k in 1:10
C[:,:,k] = A[:,:,k] * B[:,:,k]
end
end
f! (generic function with 1 method)
julia> function f2!(C,A,B)
for k in 1:10
@views mul!(C[:,:,k],A[:,:,k],B[:,:,k])
end
end
f2! (generic function with 1 method)
julia> f3!(C,A,B) = @views @tullio C[:,:,k] = A[:,:,k] * B[:,:,k]
f3! (generic function with 1 method)
julia> A = rand(5,5,10); B = rand(5,5,10); C = zeros(5,5,10);
julia> @btime f!($C,$A,$B);
3.387 μs (30 allocations: 7.50 KiB)
julia> @btime f2!($C,$A,$B);
1.750 μs (0 allocations: 0 bytes)
julia> @btime f3!($C,$A,$B);
2.240 μs (10 allocations: 2.50 KiB)
(edit: added @views
to Tullio).
Octavian gives me this:
julia> using Octavian
julia> function f5!(C,A,B)
for k in 1:10
@views Octavian.matmul!(C[:,:,k],A[:,:,k],B[:,:,k])
end
end
f5! (generic function with 1 method)
julia> @btime f5!($C,$A,$B);
196.513 ns (0 allocations: 0 bytes)
to good to be true?
(the timings are probably very dependent on the matrix sizes, so you probably need to test with the actual data).