An example: reshape + view + '
for transpose + mul! will have some overhead.
using Random
using LinearAlgebra
using BenchmarkTools
buf = zeros(1000)
v1 = reshape(view(buf, 1:100), (20, 5))
v2 = view(v1, 1:16, :)
m3 = zeros(5, 5)
m4 = zeros(5, 16);
rand!(v2); rand!(m3);
m2 = zeros(16, 5); m2 .= v2
then test:
julia> @btime mul!(m4, m3, v2')
249.301 ns (1 allocation: 112 bytes)
julia> @btime mul!(m4, m3, m2')
246.672 ns (1 allocation: 16 bytes)
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, m2, 0.0, m4)
223.786 ns (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, v2, 0.0, m4)
220.938 ns (0 allocations: 0 bytes)
using BLAS.gemm!
is nice, but mul!
will allocate if '
is used for transpose, and when this happens, reshape + view will make the allocation much larger.
ps: When performing heavy calculations, Julia gc seems to be inadequate (sometimes memory peaks related to gc exist), and if we want to implement a memory pool on our own, reshape+view will arguably be the easiest way?