Why does `A .= sign.(view(A, diagind(A))) .* A` allocate?

zsoerenm · November 6, 2024, 12:41pm

Hey,

consider this function:

using LinearAlgebra
function correct_sign(A)
  A .= sign.(view(A, diagind(A))) .* A
end

Why would that allocate? Am I missing something?

using BenchmarkTools
B = randn(3,3)
@btime correct_sign($B)
# 93.727 ns (3 allocations: 160 bytes)

Context: I am trying to correct the sign of the upper triangular R of the QR decomposition to be positive.

Sevi · November 6, 2024, 1:14pm

Can’t say why it happens exactly, but at least part of the allocations seem to come from the view combined with the broadcast?

E.g. this avoids the view (and some allocations)

function correct_sign(A)
    A .= sign.(diag(A)) .* A
end

EDIT: Ah, but the remaining allocation in that version comes from allocating the diag I guess…

roflmaostc · November 6, 2024, 1:25pm

Note it only allocates a vector and not the full array.

I’ve seen that a couple of times that Julia does a copy to avoid aliasing. Since you assign on the left A, somehow there is a safety copy of view(A, diagind(A)). Because otherwise you might overwrite your sign already

If you need it allocation free, you can use a for loop. But I think because of the for loop order it is much slower.
You also see here why there needs to be a copy done. If you would go over the columns, you would overwrite A[i, i] and read it later again:

julia> using LinearAlgebra

julia> function correct_sign(A)
                A .= sign.(view(A, diagind(A))) .* A
              end
correct_sign (generic function with 1 method)

julia> @time correct_sign(A);
  0.297726 seconds (323.41 k allocations: 22.103 MiB, 3.25% gc time, 99.69% compilation time)

julia> @time correct_sign(A);
  0.000751 seconds (3 allocations: 8.203 KiB)


 julia> function correct_sign_loop(A)
           for i in axes(A, 1)
               s = sign(A[i, i])
               for j in axes(A, 2)
                   @inbounds A[i, j] *= s
               end
           end
           return A
       end
correct_sign_loop (generic function with 1 method)

julia> @time correct_sign_loop(A);
  0.047466 seconds (5.65 k allocations: 383.172 KiB, 75.01% compilation time)

julia> @time correct_sign_loop(A);
  0.012281 seconds

You can also write a for loop version where you collect the signs on the diagonal before. But my quick test showed it’s the same speeds as the broadcast (it’s mainly memory limited anway).

zsoerenm · November 6, 2024, 2:02pm

But then this shouldn’t allocate, right?

function correct_sign!(A, result)
  result .= sign.(view(A, diagind(A))) .* A
end

It does though:

B = randn(3,3)
D = similar(A)
@btime correct_sign!($B, $D)
# 52.309 ns (2 allocations: 80 bytes)

It’s one less though.

roflmaostc · November 6, 2024, 2:05pm

It doesn’t really allocate, it allocates 80Bytes but nothing large.
The

julia> res = similar(A);

julia> A = rand(1000, 1000);

julia> res = similar(A);

julia> using LinearAlgebra

julia> function correct_sign!(A, result)
         result .= sign.(view(A, diagind(A))) .* A
       end
correct_sign! (generic function with 1 method)

julia> @time correct_sign!(A, res);
  0.342393 seconds (360.13 k allocations: 24.659 MiB, 2.79% gc time, 98.78% compilation time)

julia> @time correct_sign!(A, res);
  0.002313 seconds (2 allocations: 80 bytes)

julia> using BenchmarkTools

julia> @btime view($A, diagind($A));
  19.529 ns (2 allocations: 80 bytes)

mbauman · November 6, 2024, 2:10pm

It’s two things: aliasing and reshape(::Array, ...) is slow and allocates · Issue #36313 · JuliaLang/julia · GitHub

When you take a view of a matrix with linear indices (like diagind produces), the matrix is reshaped to a vector behind the scenes.

danielwe · November 6, 2024, 4:21pm

Here’s a simple non-allocating version:

function correct_sign(A)
    @views for i in axes(A, 1)
        A[i, :] .= flipsign.(A[i, :], A[i, i])
    end
    return A
end

julia> B = randn(3, 3);

julia> @btime correct_sign(B);
  25.954 ns (0 allocations: 0 bytes)

mbauman · November 6, 2024, 4:40pm

It brings me a little bit of joy knowing that broadcast’s unaliasing pass does save folks’ bacons in real world usages.

Yeah, its false positives can make for an annoying performance snag. But it’s so much better than having your correct_sign be incorrect on the lower triangle! And it’s so much easier to identify than a silent sign flip would be.

Andrew_Swann · November 6, 2024, 5:20pm

For me using A[i,:] = -A[i,:] is quicker than flipsign if I just test the sign of A[i,i] once per row.

function adjust_signs(A)
  for i in axes(A,1)
    if sign(A[i,i]) < 0
      A[i, :] = -A[i, :]
    end
  end
  return A
end

The difference is a factor of 2 for 3x3 matrices, but is notably more as n gets larger.

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> 

julia> function correct_sign(A)
         A .= sign.(view(A, diagind(A))) .* A
       end
correct_sign (generic function with 1 method)

julia> 

julia> function correct_sign_loop(A)
         for i in axes(A, 1)
           s = sign(A[i, i])
           for j in axes(A, 2)
             @inbounds A[i, j] *= s
           end
         end
         return A
       end
correct_sign_loop (generic function with 1 method)

julia> 

julia> function flip_signs(A)
         @views for i in axes(A, 1)
           A[i, :] .= flipsign.(A[i, :], A[i, i])
         end
         return A
       end
flip_signs (generic function with 1 method)

julia> 

julia> function adjust_signs(A)
         for i in axes(A,1)
           if sign(A[i,i]) < 0
             A[i, :] = -A[i, :]
           end
         end
         return A
       end
adjust_signs (generic function with 1 method)

julia> 

julia> n = 3
3

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  79.071 ns …  1.360 μs  ┊ GC (min … max): 0.00% … 90.95%
 Time  (median):     86.260 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   89.630 ns ± 43.153 ns  ┊ GC (mean ± σ):  3.20% ±  5.94%

             ▂▄▄▇██▆▆▅▃▁                                       
  ▁▁▁▁▁▁▂▂▃▅▆████████████▆▄▄▄▃▂▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  79.1 ns         Histogram: frequency by time         104 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):   9.500 ns … 21.916 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.250 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.148 ns ±  0.469 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▅  ▅ ▂  ▁                                ▄  █ ▆  ▃ ▂  ▁   ▂
  █▁█▁▁█▁█▁▁█▁█▁▁▄▁▇▁▁▆▁▁▅▁▄▁▁▁▁▁▁▁▄▁▁▁▁▄▁▄▃▁█▁▁█▁█▁▁█▁█▁▁█▁▇ █
  9.5 ns       Histogram: log(frequency) by time      10.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  9.259 ns … 20.562 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.301 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.349 ns ±  0.376 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           █         ▂                                        
  ▃▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂ ▂
  9.26 ns        Histogram: frequency by time        9.51 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.791 ns … 14.291 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.292 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.311 ns ±  0.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁                               ▅   █   ▄  ▃   ▅    ▁
  ▄▁▁█▁▁▁█▁▁▅▁▁▁▃▁▁▁▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁█▁▁▁█▁▁▁█▁▁█▁▁▁█▁▁▇ █
  5.79 ns      Histogram: log(frequency) by time     6.46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

julia> n = 30
30

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
 Range (min … max):  255.224 ns …  2.571 μs  ┊ GC (min … max): 0.00% … 88.50%
 Time  (median):     266.280 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   271.691 ns ± 55.239 ns  ┊ GC (mean ± σ):  1.57% ±  5.73%

  ▂██▄▄▁▁                                                      ▂
  ████████▇▇▄▃▃▄▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅ █
  255 ns        Histogram: log(frequency) by time       574 ns <

 Memory estimate: 336 bytes, allocs estimate: 3.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 320 evaluations.
 Range (min … max):  264.972 ns … 354.947 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     292.319 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   287.701 ns ±   9.718 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▅▅           ▁             █▃          ▂▁            ▁
  ▃▁▁▁▁▁▁▁▁██▆▆▆▆▆▆▆█▆▆▇█▅▆▇▆▆▆▆▆▆▆▆▅▆██▅▃▄▅▅██▅▄▄██▅▄▄▄▅▆▇▇▅▅▅ █
  265 ns        Histogram: log(frequency) by time        311 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 320 evaluations.
 Range (min … max):  272.784 ns … 376.694 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     292.837 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   293.674 ns ±   4.581 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▂         █▆          ▂▃                      ▁
  ▆▁▁▄▁▁▁▁▃▄▁▁▃▁▁▆█▃▁▁▁▁▃▁▃▁██▆▃▃▁▅▅▇▄▅▄██▅▅▅▅▄▅▇▇▇▆▅▅▄▁▅▄▄▃▄▃▃ █
  273 ns        Histogram: log(frequency) by time        318 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  20.206 ns … 38.327 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     21.710 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.773 ns ±  0.549 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                              ▃█▅▂▃▁   ▁▃▁                    ▁
  █▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁██████▇▅▆███▆▅▅▁▃▃▁▃▃▄▁▄▁▄▃▄▃▃▄ █
  20.2 ns      Histogram: log(frequency) by time      23.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

julia> n = 1000
1000

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  164.917 μs …  2.012 ms  ┊ GC (min … max): 0.00% … 90.28%
 Time  (median):     178.542 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   179.446 μs ± 22.207 μs  ┊ GC (mean ± σ):  0.17% ±  1.25%

                                   ▅██▃                         
  ▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▂▄█████▅▃▃▄▄▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  165 μs          Histogram: frequency by time          188 μs <

 Memory estimate: 7.97 KiB, allocs estimate: 4.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  364.917 μs … 463.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     381.333 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   381.398 μs ±   3.586 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                  ▂▅▇██▄▁                        
  ▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▄▅▇▇▅▄▃▃▄▅▇████████▇▇▆▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  365 μs           Histogram: frequency by time          393 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  381.708 μs … 498.958 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     390.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   391.702 μs ±   4.005 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▂▃▅▆▇▇███▇▆▅▅▅▅▅▅▅▄▄▃▃▁▁▁▁▁▁▁▁▁        ▁▁▁  ▃
  ▃▄▃▁▁▄▁▁▁▁▁▁▃▃▃▁▆████████████████████████████████████████████ █
  382 μs        Histogram: log(frequency) by time        403 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  600.657 ns … 792.371 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     641.382 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   643.524 ns ±   8.269 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                               █▇▂       ▁ ▃▃                   ▁
  ▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇███▆▄▃▃▃▄██▄██▆▃▄▄▅▆▆▇█▇▆▆▆▅▅▅▅▅ █
  601 ns        Histogram: log(frequency) by time        684 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

roflmaostc · November 6, 2024, 5:23pm

No allocation but performance is terrible:

julia> using LinearAlgebra, BenchmarkTools

julia> A = rand(1024, 1024);

julia> function correct_sign(A)
         A .= sign.(view(A, diagind(A))) .* A
       end
correct_sign (generic function with 1 method)

julia> @benchmark correct_sign($A)
BenchmarkTools.Trial: 9741 samples with 1 evaluation.
 Range (min … max):  211.793 μs …   2.461 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     552.317 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   507.512 μs ± 252.929 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █               ▂                                           
  ▁▂██▃▂▂▂▁▂▁▁▁▁▂▂▃▄▇█▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  212 μs           Histogram: frequency by time         1.61 ms <

 Memory estimate: 8.20 KiB, allocs estimate: 3.

julia> function correct_sign_2(A)
           @views for i in axes(A, 1)
               A[i, :] .= flipsign.(A[i, :], A[i, i])
           end
           return A
       end
correct_sign (generic function with 1 method)

julia> @benchmark correct_sign_2($A)
BenchmarkTools.Trial: 494 samples with 1 evaluation.
 Range (min … max):   4.647 ms … 16.680 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.529 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.123 ms ±  2.998 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃                                       ▂▂▇█               
  ▃██▃▂▃▂▂▂▂▁▂▂▁▂▂▁▁▂▂▂▂▂▁▁▂▂▂▁▂▂▂▃▄▃▃▃▃▃▄▇▇████▇▅▄▄▃▃▃▄▂▃▃▂▄ ▃
  4.65 ms         Histogram: frequency by time        14.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.