Overhead of view / reshape / transpose in linear algebra

jsjie · February 26, 2024, 8:38am

An example: reshape + view + ' for transpose + mul! will have some overhead.

using Random
using LinearAlgebra
using BenchmarkTools
buf = zeros(1000)
v1 = reshape(view(buf, 1:100), (20, 5))
v2 = view(v1, 1:16, :)
m3 = zeros(5, 5)
m4 = zeros(5, 16);
rand!(v2); rand!(m3);
m2 = zeros(16, 5); m2 .= v2

then test:

julia> @btime mul!(m4, m3, v2')
  249.301 ns (1 allocation: 112 bytes)

julia> @btime mul!(m4, m3, m2')
  246.672 ns (1 allocation: 16 bytes)

julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, m2, 0.0, m4)
  223.786 ns (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', 1.0, m3, v2, 0.0, m4)
  220.938 ns (0 allocations: 0 bytes)

using BLAS.gemm! is nice, but mul! will allocate if ' is used for transpose, and when this happens, reshape + view will make the allocation much larger.

ps: When performing heavy calculations, Julia gc seems to be inadequate (sometimes memory peaks related to gc exist), and if we want to implement a memory pool on our own, reshape+view will arguably be the easiest way?

mbauman · February 26, 2024, 9:55pm

You’re seeing benchmarking artifacts. You want to flag the arguments as arguments with $:

julia> @btime mul!($m4, $m3, $v2');
  147.178 ns (0 allocations: 0 bytes)

julia> @btime mul!($m4, $m3, $m2');
  146.382 ns (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', 1.0, $m3, $m2, 0.0, $m4);
  142.447 ns (0 allocations: 0 bytes)

BenchmarkTools tries to evaluate the expression you passed (warts and all) as though it were plopped inside a function body. If you just have m2 without the $, then it’s using it as a type-unstable non-constant global access. The $ makes it behave as though it was passed as an argument to the function body.

jsjie · February 27, 2024, 12:58am

Thanks for correction, I’ve met this problem in practice and try to find a MWE. Obviously I’ve made a mistake and this is not a MWE. I’ll accept your answer for present and reopen the issue if I would find one later.

jsjie · June 13, 2024, 6:43am

Maybe I’ve found another case, and since it’s so highly related to this topic, I just unaccept your answer and put it here. I’m using Julia 1.10.2.

julia> using LinearAlgebra
julia> using BenchmarkTools
julia> buffer = zeros(ComplexF64, 10000);
julia> psi = reshape(view(buffer, 1:5000), 50, 100);
julia> hpsi = reshape(view(buffer, 5001:10000), 50, 100);
julia> function _f1!(psi, hpsi)
           view(psi, :, 1) .+= view(hpsi, :, 1)
       end
_f1! (generic function with 1 method)
julia> function _f2!(psi, hpsi)
           BLAS.axpy!(50, complex(1.0, 0.0), pointer(hpsi, 1), 1, pointer(psi, 1), 1)
       end
_f2! (generic function with 1 method)
julia> @btime _f1!(psi, hpsi);
  4.547 μs (3 allocations: 78.27 KiB)
julia> @btime _f2!(psi, hpsi);
  45.122 ns (1 allocation: 16 bytes)
julia> @btime _f1!($psi, $hpsi);
  4.585 μs (2 allocations: 78.17 KiB)
julia> @btime _f2!($psi, $hpsi);
  29.575 ns (0 allocations: 0 bytes)

I think they are doing the same thing and the last two measurements should have no artifacts?

mbauman · June 13, 2024, 12:47pm

This is a different case now — psi and hpsi share the same buffer (albeit in a disjoint manner). Broadcast is making a defensive copy here because it wasn’t able to easily prove that the two don’t overlap… and if they did overlap you’d get very unexpected results from the raw BLAS calls as they won’t make these defensive copies.

jsjie · June 14, 2024, 1:43am

Then is there something like “unsafe broadcast”? As I mentioned in the post, I have to use a custom memory pool, otherwise there will be gc peaks when my program is running. Currently I’ve changed to pure BLAS/LAPACK whenever needed, but I think broadcasting is more concise and readable.

abraemer · June 14, 2024, 6:18am

There is no “unsafe” broadcasting, but broadcasting is really the same a writing a loop. So I’d say in these cases you just need to write out the broadcasting operation as a for loop.
Note: In contrast to Python/R and the like, explicit loops in Julia are not slow! In Julia there is per se no need for “vectorizing” code in order to get performance.

mbauman · June 17, 2024, 5:05pm

It’s also a case where I think the project would accept straightforward improvements to the aliasing detection — especially if it’s cheap or even free. I think a few simple definitions like these would improve the world for everyone:

mightalias(A::ReshapedArray, B::ReshapedArray) = mightalias(A.parent, B.parent)
mightalias(A::ReshapedArray, B::SubArray) = mightalias(A.parent, B)
mightalias(A::SubArray, B::ReshapedArray) = mightalias(A, B.parent)

Elrod · June 17, 2024, 7:27pm

FastBroadcast.jl doesn’t have such checks.

Topic		Replies	Views
Passing views to function without allocation New to Julia	14	1563	December 18, 2020
Reduce allocations for shaped views Performance	9	1025	February 13, 2019
Reshape() still allocating memory in Julia 1.5.0 Performance	2	1406	August 6, 2020
Using `view` in 0.6 on a ragged array General Usage question	12	1094	July 22, 2017
Migrating code from Matlab: Accessing arrays without allocating Performance	20	1989	November 19, 2018

Overhead of view / reshape / transpose in linear algebra

Related topics