Overhead of view / reshape / transpose in linear algebra

An example: reshape + view + ' for transpose + mul! will have some overhead.

using Random
using LinearAlgebra
using BenchmarkTools
buf = zeros(1000)
v1 = reshape(view(buf, 1:100), (20, 5))
v2 = view(v1, 1:16, :)
m3 = zeros(5, 5)
m4 = zeros(5, 16);
rand!(v2); rand!(m3);
m2 = zeros(16, 5); m2 .= v2

then test:

julia> @btime mul!(m4, m3, v2')
  249.301 ns (1 allocation: 112 bytes)
julia> @btime mul!(m4, m3, m2')
  246.672 ns (1 allocation: 16 bytes)
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, m2, 0.0, m4)
  223.786 ns (0 allocations: 0 bytes)
julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, v2, 0.0, m4)
  220.938 ns (0 allocations: 0 bytes)

using BLAS.gemm! is nice, but mul! will allocate if ' is used for transpose, and when this happens, reshape + view will make the allocation much larger.

ps: When performing heavy calculations, Julia gc seems to be inadequate (sometimes memory peaks related to gc exist), and if we want to implement a memory pool on our own, reshape+view will arguably be the easiest way?

You’re seeing benchmarking artifacts. You want to flag the arguments as arguments with $:

julia> @btime mul!($m4, $m3, $v2');
  147.178 ns (0 allocations: 0 bytes)

julia> @btime mul!($m4, $m3, $m2');
  146.382 ns (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', 1.0, $m3, $m2, 0.0, $m4);
  142.447 ns (0 allocations: 0 bytes)

BenchmarkTools tries to evaluate the expression you passed (warts and all) as though it were plopped inside a function body. If you just have m2 without the $, then it’s using it as a type-unstable non-constant global access. The $ makes it behave as though it was passed as an argument to the function body.

1 Like

Thanks for correction, I’ve met this problem in practice and try to find a MWE. Obviously I’ve made a mistake and this is not a MWE. I’ll accept your answer for present and reopen the issue if I would find one later.

Maybe I’ve found another case, and since it’s so highly related to this topic, I just unaccept your answer and put it here. I’m using Julia 1.10.2.

julia> using LinearAlgebra
julia> using BenchmarkTools
julia> buffer = zeros(ComplexF64, 10000);
julia> psi = reshape(view(buffer, 1:5000), 50, 100);
julia> hpsi = reshape(view(buffer, 5001:10000), 50, 100);
julia> function _f1!(psi, hpsi)
           view(psi, :, 1) .+= view(hpsi, :, 1)
       end
_f1! (generic function with 1 method)
julia> function _f2!(psi, hpsi)
           BLAS.axpy!(50, complex(1.0, 0.0), pointer(hpsi, 1), 1, pointer(psi, 1), 1)
       end
_f2! (generic function with 1 method)
julia> @btime _f1!(psi, hpsi);
  4.547 ÎĽs (3 allocations: 78.27 KiB)
julia> @btime _f2!(psi, hpsi);
  45.122 ns (1 allocation: 16 bytes)
julia> @btime _f1!($psi, $hpsi);
  4.585 ÎĽs (2 allocations: 78.17 KiB)
julia> @btime _f2!($psi, $hpsi);
  29.575 ns (0 allocations: 0 bytes)

I think they are doing the same thing and the last two measurements should have no artifacts?

This is a different case now — psi and hpsi share the same buffer (albeit in a disjoint manner). Broadcast is making a defensive copy here because it wasn’t able to easily prove that the two don’t overlap… and if they did overlap you’d get very unexpected results from the raw BLAS calls as they won’t make these defensive copies.

4 Likes

Then is there something like “unsafe broadcast”? As I mentioned in the post, I have to use a custom memory pool, otherwise there will be gc peaks when my program is running. Currently I’ve changed to pure BLAS/LAPACK whenever needed, but I think broadcasting is more concise and readable.

There is no “unsafe” broadcasting, but broadcasting is really the same a writing a loop. So I’d say in these cases you just need to write out the broadcasting operation as a for loop.
Note: In contrast to Python/R and the like, explicit loops in Julia are not slow! In Julia there is per se no need for “vectorizing” code in order to get performance.

2 Likes

It’s also a case where I think the project would accept straightforward improvements to the aliasing detection — especially if it’s cheap or even free. I think a few simple definitions like these would improve the world for everyone:

mightalias(A::ReshapedArray, B::ReshapedArray) = mightalias(A.parent, B.parent)
mightalias(A::ReshapedArray, B::SubArray) = mightalias(A.parent, B)
mightalias(A::SubArray, B::ReshapedArray) = mightalias(A, B.parent)
1 Like

FastBroadcast.jl doesn’t have such checks.

1 Like