Why does `A .= sign.(view(A, diagind(A))) .* A` allocate?

Hey,

consider this function:

using LinearAlgebra
function correct_sign(A)
  A .= sign.(view(A, diagind(A))) .* A
end

Why would that allocate? Am I missing something?

using BenchmarkTools
B = randn(3,3)
@btime correct_sign($B)
# 93.727 ns (3 allocations: 160 bytes)

Context: I am trying to correct the sign of the upper triangular R of the QR decomposition to be positive.

Can’t say why it happens exactly, but at least part of the allocations seem to come from the view combined with the broadcast?

E.g. this avoids the view (and some allocations)

function correct_sign(A)
    A .= sign.(diag(A)) .* A
end

EDIT: Ah, but the remaining allocation in that version comes from allocating the diag I guess…

1 Like

Note it only allocates a vector and not the full array.

I’ve seen that a couple of times that Julia does a copy to avoid aliasing. Since you assign on the left A, somehow there is a safety copy of view(A, diagind(A)). Because otherwise you might overwrite your sign already

If you need it allocation free, you can use a for loop. But I think because of the for loop order it is much slower.
You also see here why there needs to be a copy done. If you would go over the columns, you would overwrite A[i, i] and read it later again:

julia> using LinearAlgebra

julia> function correct_sign(A)
                A .= sign.(view(A, diagind(A))) .* A
              end
correct_sign (generic function with 1 method)

julia> @time correct_sign(A);
  0.297726 seconds (323.41 k allocations: 22.103 MiB, 3.25% gc time, 99.69% compilation time)

julia> @time correct_sign(A);
  0.000751 seconds (3 allocations: 8.203 KiB)


 julia> function correct_sign_loop(A)
           for i in axes(A, 1)
               s = sign(A[i, i])
               for j in axes(A, 2)
                   @inbounds A[i, j] *= s
               end
           end
           return A
       end
correct_sign_loop (generic function with 1 method)

julia> @time correct_sign_loop(A);
  0.047466 seconds (5.65 k allocations: 383.172 KiB, 75.01% compilation time)

julia> @time correct_sign_loop(A);
  0.012281 seconds

You can also write a for loop version where you collect the signs on the diagonal before. But my quick test showed it’s the same speeds as the broadcast (it’s mainly memory limited anway).

2 Likes

But then this shouldn’t allocate, right?

function correct_sign!(A, result)
  result .= sign.(view(A, diagind(A))) .* A
end

It does though:

B = randn(3,3)
D = similar(A)
@btime correct_sign!($B, $D)
# 52.309 ns (2 allocations: 80 bytes)

It’s one less though.

It doesn’t really allocate, it allocates 80Bytes but nothing large.
The

julia> res = similar(A);

julia> A = rand(1000, 1000);

julia> res = similar(A);

julia> using LinearAlgebra

julia> function correct_sign!(A, result)
         result .= sign.(view(A, diagind(A))) .* A
       end
correct_sign! (generic function with 1 method)

julia> @time correct_sign!(A, res);
  0.342393 seconds (360.13 k allocations: 24.659 MiB, 2.79% gc time, 98.78% compilation time)

julia> @time correct_sign!(A, res);
  0.002313 seconds (2 allocations: 80 bytes)

julia> using BenchmarkTools

julia> @btime view($A, diagind($A));
  19.529 ns (2 allocations: 80 bytes)

It’s two things: aliasing and reshape(::Array, ...) is slow and allocates Β· Issue #36313 Β· JuliaLang/julia Β· GitHub

When you take a view of a matrix with linear indices (like diagind produces), the matrix is reshaped to a vector behind the scenes.

4 Likes

Here’s a simple non-allocating version:

function correct_sign(A)
    @views for i in axes(A, 1)
        A[i, :] .= flipsign.(A[i, :], A[i, i])
    end
    return A
end
julia> B = randn(3, 3);

julia> @btime correct_sign(B);
  25.954 ns (0 allocations: 0 bytes)
3 Likes

It brings me a little bit of joy knowing that broadcast’s unaliasing pass does save folks’ bacons in real world usages.

Yeah, its false positives can make for an annoying performance snag. But it’s so much better than having your correct_sign be incorrect on the lower triangle! And it’s so much easier to identify than a silent sign flip would be.

3 Likes

For me using A[i,:] = -A[i,:] is quicker than flipsign if I just test the sign of A[i,i] once per row.

function adjust_signs(A)
  for i in axes(A,1)
    if sign(A[i,i]) < 0
      A[i, :] = -A[i, :]
    end
  end
  return A
end

The difference is a factor of 2 for 3x3 matrices, but is notably more as n gets larger.

julia> using LinearAlgebra

julia> using BenchmarkTools

julia> 

julia> function correct_sign(A)
         A .= sign.(view(A, diagind(A))) .* A
       end
correct_sign (generic function with 1 method)

julia> 

julia> function correct_sign_loop(A)
         for i in axes(A, 1)
           s = sign(A[i, i])
           for j in axes(A, 2)
             @inbounds A[i, j] *= s
           end
         end
         return A
       end
correct_sign_loop (generic function with 1 method)

julia> 

julia> function flip_signs(A)
         @views for i in axes(A, 1)
           A[i, :] .= flipsign.(A[i, :], A[i, i])
         end
         return A
       end
flip_signs (generic function with 1 method)

julia> 

julia> function adjust_signs(A)
         for i in axes(A,1)
           if sign(A[i,i]) < 0
             A[i, :] = -A[i, :]
           end
         end
         return A
       end
adjust_signs (generic function with 1 method)

julia> 

julia> n = 3
3

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 968 evaluations.
 Range (min … max):  79.071 ns …  1.360 ΞΌs  β”Š GC (min … max): 0.00% … 90.95%
 Time  (median):     86.260 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   89.630 ns Β± 43.153 ns  β”Š GC (mean Β± Οƒ):  3.20% Β±  5.94%

             β–‚β–„β–„β–‡β–ˆβ–ˆβ–†β–†β–…β–ƒβ–                                       
  β–β–β–β–β–β–β–‚β–‚β–ƒβ–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–„β–„β–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  79.1 ns         Histogram: frequency by time         104 ns <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):   9.500 ns … 21.916 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     10.250 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.148 ns Β±  0.469 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁ β–…  β–… β–‚  ▁                                β–„  β–ˆ β–†  β–ƒ β–‚  ▁   β–‚
  β–ˆβ–β–ˆβ–β–β–ˆβ–β–ˆβ–β–β–ˆβ–β–ˆβ–β–β–„β–β–‡β–β–β–†β–β–β–…β–β–„β–β–β–β–β–β–β–β–„β–β–β–β–β–„β–β–„β–ƒβ–β–ˆβ–β–β–ˆβ–β–ˆβ–β–β–ˆβ–β–ˆβ–β–β–ˆβ–β–‡ β–ˆ
  9.5 ns       Histogram: log(frequency) by time      10.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  9.259 ns … 20.562 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     9.301 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.349 ns Β±  0.376 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–ˆ         β–‚                                        
  β–ƒβ–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–ˆβ–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–‚ β–‚
  9.26 ns        Histogram: frequency by time        9.51 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.791 ns … 14.291 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     6.292 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   6.311 ns Β±  0.169 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         ▁                               β–…   β–ˆ   β–„  β–ƒ   β–…    ▁
  β–„β–β–β–ˆβ–β–β–β–ˆβ–β–β–…β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–β–β–ˆβ–β–β–β–ˆβ–β–β–β–ˆβ–β–β–ˆβ–β–β–β–ˆβ–β–β–‡ β–ˆ
  5.79 ns      Histogram: log(frequency) by time     6.46 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

julia> n = 30
30

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 343 evaluations.
 Range (min … max):  255.224 ns …  2.571 ΞΌs  β”Š GC (min … max): 0.00% … 88.50%
 Time  (median):     266.280 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   271.691 ns Β± 55.239 ns  β”Š GC (mean Β± Οƒ):  1.57% Β±  5.73%

  β–‚β–ˆβ–ˆβ–„β–„β–β–                                                      β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–„β–ƒβ–ƒβ–„β–β–β–β–β–β–β–β–β–ƒβ–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–„β–… β–ˆ
  255 ns        Histogram: log(frequency) by time       574 ns <

 Memory estimate: 336 bytes, allocs estimate: 3.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 320 evaluations.
 Range (min … max):  264.972 ns … 354.947 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     292.319 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   287.701 ns Β±   9.718 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–…β–…           ▁             β–ˆβ–ƒ          ▂▁            ▁
  β–ƒβ–β–β–β–β–β–β–β–β–ˆβ–ˆβ–†β–†β–†β–†β–†β–†β–†β–ˆβ–†β–†β–‡β–ˆβ–…β–†β–‡β–†β–†β–†β–†β–†β–†β–†β–†β–…β–†β–ˆβ–ˆβ–…β–ƒβ–„β–…β–…β–ˆβ–ˆβ–…β–„β–„β–ˆβ–ˆβ–…β–„β–„β–„β–…β–†β–‡β–‡β–…β–…β–… β–ˆ
  265 ns        Histogram: log(frequency) by time        311 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 320 evaluations.
 Range (min … max):  272.784 ns … 376.694 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     292.837 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   293.674 ns Β±   4.581 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                  β–‚         β–ˆβ–†          β–‚β–ƒ                      ▁
  β–†β–β–β–„β–β–β–β–β–ƒβ–„β–β–β–ƒβ–β–β–†β–ˆβ–ƒβ–β–β–β–β–ƒβ–β–ƒβ–β–ˆβ–ˆβ–†β–ƒβ–ƒβ–β–…β–…β–‡β–„β–…β–„β–ˆβ–ˆβ–…β–…β–…β–…β–„β–…β–‡β–‡β–‡β–†β–…β–…β–„β–β–…β–„β–„β–ƒβ–„β–ƒβ–ƒ β–ˆ
  273 ns        Histogram: log(frequency) by time        318 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  20.206 ns … 38.327 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     21.710 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   21.773 ns Β±  0.549 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                              β–ƒβ–ˆβ–…β–‚β–ƒβ–   ▁▃▁                    ▁
  β–ˆβ–ƒβ–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–†β–ˆβ–ˆβ–ˆβ–†β–…β–…β–β–ƒβ–ƒβ–β–ƒβ–ƒβ–„β–β–„β–β–„β–ƒβ–„β–ƒβ–ƒβ–„ β–ˆ
  20.2 ns      Histogram: log(frequency) by time      23.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 

julia> n = 1000
1000

julia> B = randn(n,n);

julia> @benchmark correct_sign($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  164.917 ΞΌs …  2.012 ms  β”Š GC (min … max): 0.00% … 90.28%
 Time  (median):     178.542 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   179.446 ΞΌs Β± 22.207 ΞΌs  β”Š GC (mean Β± Οƒ):  0.17% Β±  1.25%

                                   β–…β–ˆβ–ˆβ–ƒ                         
  β–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–‚β–β–β–β–β–β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–ƒβ–ƒβ–„β–„β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  165 ΞΌs          Histogram: frequency by time          188 ΞΌs <

 Memory estimate: 7.97 KiB, allocs estimate: 4.

julia> B = randn(n,n);

julia> @benchmark correct_sign_loop($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  364.917 ΞΌs … 463.750 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     381.333 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   381.398 ΞΌs Β±   3.586 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                  β–‚β–…β–‡β–ˆβ–ˆβ–„β–                        
  β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–ƒβ–„β–…β–‡β–‡β–…β–„β–ƒβ–ƒβ–„β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–…β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  365 ΞΌs           Histogram: frequency by time          393 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark flip_signs($B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  381.708 ΞΌs … 498.958 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     390.750 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   391.702 ΞΌs Β±   4.005 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                    β–‚β–ƒβ–…β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–…β–…β–…β–…β–…β–„β–„β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–        ▁▁▁  β–ƒ
  β–ƒβ–„β–ƒβ–β–β–„β–β–β–β–β–β–β–ƒβ–ƒβ–ƒβ–β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β–ˆ
  382 ΞΌs        Histogram: log(frequency) by time        403 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> B = randn(n,n);

julia> @benchmark adjust_signs($B)
BenchmarkTools.Trial: 10000 samples with 178 evaluations.
 Range (min … max):  600.657 ns … 792.371 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     641.382 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   643.524 ns Β±   8.269 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                               β–ˆβ–‡β–‚       ▁ β–ƒβ–ƒ                   ▁
  β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‡β–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–„β–ˆβ–ˆβ–„β–ˆβ–ˆβ–†β–ƒβ–„β–„β–…β–†β–†β–‡β–ˆβ–‡β–†β–†β–†β–…β–…β–…β–…β–… β–ˆ
  601 ns        Histogram: log(frequency) by time        684 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
2 Likes

No allocation but performance is terrible:

julia> using LinearAlgebra, BenchmarkTools

julia> A = rand(1024, 1024);

julia> function correct_sign(A)
         A .= sign.(view(A, diagind(A))) .* A
       end
correct_sign (generic function with 1 method)

julia> @benchmark correct_sign($A)
BenchmarkTools.Trial: 9741 samples with 1 evaluation.
 Range (min … max):  211.793 ΞΌs …   2.461 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     552.317 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   507.512 ΞΌs Β± 252.929 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–ˆ               β–‚                                           
  β–β–‚β–ˆβ–ˆβ–ƒβ–‚β–‚β–‚β–β–‚β–β–β–β–β–‚β–‚β–ƒβ–„β–‡β–ˆβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  212 ΞΌs           Histogram: frequency by time         1.61 ms <

 Memory estimate: 8.20 KiB, allocs estimate: 3.

julia> function correct_sign_2(A)
           @views for i in axes(A, 1)
               A[i, :] .= flipsign.(A[i, :], A[i, i])
           end
           return A
       end
correct_sign (generic function with 1 method)

julia> @benchmark correct_sign_2($A)
BenchmarkTools.Trial: 494 samples with 1 evaluation.
 Range (min … max):   4.647 ms … 16.680 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     11.529 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.123 ms Β±  2.998 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆβ–ƒ                                       β–‚β–‚β–‡β–ˆ               
  β–ƒβ–ˆβ–ˆβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–β–‚β–‚β–β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–β–‚β–‚β–‚β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–„β–„β–ƒβ–ƒβ–ƒβ–„β–‚β–ƒβ–ƒβ–‚β–„ β–ƒ
  4.65 ms         Histogram: frequency by time        14.1 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.