Uniform scaling inplace addition with matrix

jarl · April 24, 2021, 12:12pm

I want to do an inplace add of a multiple of the identity to a matrix. I can get the efficiency I want with a view of the diagonal:

using LinearAlgebra
function f1!(A,s)
   D = view(A, diagind(A, 0))
   D .+= s;
end

julia> s=1e-5;
julia> A=randn(1000,1000); x=diag(A);
julia> f1!(A,s);
julia> norm(diag(A) - (x.+s))
0.0
julia> @btime f1!(A,s)
  4.627 μs (3 allocations: 144 bytes)

For various reasons, I would prefer to use the UniformScaling. Is it possible to achieve the same efficiency with UniformScaling?

For reference:

function f2!(A,s)
    A[:,:] += s*I;
end

julia> @btime f2!(A,s);
  5.750 ms (8 allocations: 15.26 MiB)

As far as I understand, there are two problems with f2!:

Extra memory allocation (associated parse time conversion of +=)
The +(A::Matrix,J::UniformScaling) is a for-loop referenced below, rather than a blas level-1 axpy-call which is what we get in f1!.

github.com

JuliaLang/julia/blob/f9720dc2ebd6cd9e3086365f281e62506444ef37/stdlib/LinearAlgebra/src/uniformscaling.jl#L215-L222


      
          function (+)(A::AbstractMatrix, J::UniformScaling)
              checksquare(A)
              B = copy_oftype(A, Base._return_type(+, Tuple{eltype(A), typeof(J)}))
              @inbounds for i in axes(A, 1)
                  B[i,i] += J
              end
              return B
          end

stevengj · April 24, 2021, 12:19pm

I don’t know of a way to do this with I, but a loop works and is efficient (faster than your f1! on my machine):

function f3!(A,s)
    m = min(size(A)...)
    for i = 1:m; A[i,i] += s; end
    return A
end

f1! calls the broadcast machinery under the hood, not BLAS. It’s possible but a bit tricky to use a BLAS axpy function for this operation:

function f4!(A::Matrix{T}, s) where {T}    m = min(size(A)...)
    m = min(size(A)...)
    incA = stride(A,1) + stride(A,2)
    sr = Ref{T}(s)
    GC.@preserve sr LinearAlgebra.BLAS.axpy!(m, one(T), Base.unsafe_convert(Ptr{T}, sr), 0, A, incA)
    return A
end

but on my machine it’s only 5% faster than my looping implementation f3! for your 1000×1000 benchmark. And even this small performance difference goes away if I use @inbounds in my f3! loop.

jarl · April 24, 2021, 12:57pm

Thanks! Learning a lot. I thought f1! would be effectively be the same as f4!. So, the BLAS incx=0-trick for vec .+ scalar is never used in julia?

stevengj · April 24, 2021, 1:00pm

Not as far as I know. (There’s little or no benefit to BLAS over a simple for loop for axpy anyway, especially for vector .+ scalar; compilers are good at axpy-like loops and there’s no possibility of fancy blocking hand optimizations like there is for BLAS-3 / matrix multiply / gemm, while loops are far more versatile in supporting more types etcetera.)

(As far as I know, BLAS axpy is not even used for vector + vector in Julia. There’s no point. Besides, if you care about performance you’re much better off combining multiple vector operations into a single loop, or using “dot fusion”, than breaking your calculation up into a sequence of axpy and other elementary operations.)

In general, performance optimization in Julia doesn’t rely on “mining” the standard library in the hope of finding a “vectorized / built-in” function that does exactly what you want (unlike e.g. Matlab or Python). Properly written user code and loops are fast.

jarl · April 24, 2021, 6:04pm

Great. Thanks. That’s helpful in other places in my code.

I’ll leave this post open for a while since an efficient version involving I would be helpful.

ChrisRackauckas · April 29, 2021, 3:31pm

You can use diagind in order to do this inplace on the diagonal. It’s how we do it in OrdinaryDiffEq to make it GPU-compatible:

github.com

SciML/OrdinaryDiffEq.jl/blob/v5.52.7/src/derivative_utils.jl#L379-L383


      
          if MT <: UniformScaling
            copyto!(W, J)
            idxs = diagind(W)
            λ = -mass_matrix.λ
            @.. @view(W[idxs]) = muladd(λ, invdtgamma, @view(J[idxs]))

jarl · April 30, 2021, 6:17am

Thanks. Does that compile to the same as the use of diagind in f1! ?

Edit: Ah. Now I understand your point. It does make it “work” for I.

jarl · May 4, 2021, 5:46pm

Wouldn’t an extension of mul! be a natural place to put this functionality?

import LinearAlgebra.mul!
function mul!(X::StridedMatrix{T},a::Bool,B::UniformScaling{T},alpha::Bool,beta::Bool) where {T}
   if (a  & alpha & beta)  
         D = view(X, diagind(X)) # Or more efficient version
         D .+= B.λ
   else
        easytoimplement()
   end
   return X
end
function f6!(A::StridedMatrix,s)
   mul!(A,true,s*I,true,true);
end

julia> A=randn(1000,1000);
julia> @btime f6!(A,3.0);
  3.732 μs (2 allocations: 80 bytes)

ChrisRackauckas · May 5, 2021, 6:32am

That’s probably the right way to add it.

jarl · May 5, 2021, 9:54am

Okay. I will add a PR eventually. It doesn’t really involve a multiplication, so mul! is not an obvious for users who don’t know how five-argument mul! can be used.

josuagrw · May 5, 2021, 10:07am

Yes, mul!() is a strange function for an addition method. I think I would look for this functionality in axpby!():

axpby!() in manual:

jarl · May 5, 2021, 11:07am

I agree. axpby! does have a heritage from blas (even the manual is referring to BLAS.axpby!) and this feature is far from blas. mul! is julia specific. I wonder if axpby! is really meant to be used as add! analogous to mul!.

jarl · May 6, 2021, 11:09am

Discussion can be continued here: Five arg mul! for UniformScaling and improvement in exp! by jarlebring · Pull Request #40731 · JuliaLang/julia · GitHub

DNF · May 7, 2021, 1:01pm

The elegant solution would be for this to work:

A .+= (s*I)

But it doesn’t.

There is something blocking this, but I’m not sure what: Broadcasting UniformScaling Operations · Issue #23197 · JuliaLang/julia · GitHub

Topic		Replies	Views
Inplace axpy! but storing to a third arguement rather than y Performance blas	4	245	December 6, 2023
Inplace multiplication by a square matrix General Usage	11	7512	February 17, 2017
In-place multiplication methods for UniformScaling type Internals & Design linearalgebra	4	546	October 3, 2018
Scaling a sparse matrix row-wise and column-wise too slow Performance broadcast , sparse	20	452	June 23, 2024
Working with `LinearAlgebra.mul!` Performance question , linearalgebra	12	1086	May 7, 2022

Uniform scaling inplace addition with matrix

Related topics