Uniform scaling inplace addition with matrix

I want to do an inplace add of a multiple of the identity to a matrix. I can get the efficiency I want with a view of the diagonal:

using LinearAlgebra
function f1!(A,s)
   D = view(A, diagind(A, 0))
   D .+= s;
end
julia> s=1e-5;
julia> A=randn(1000,1000); x=diag(A);
julia> f1!(A,s);
julia> norm(diag(A) - (x.+s))
0.0
julia> @btime f1!(A,s)
  4.627 μs (3 allocations: 144 bytes)

For various reasons, I would prefer to use the UniformScaling. Is it possible to achieve the same efficiency with UniformScaling?

For reference:

function f2!(A,s)
    A[:,:] += s*I;
end
julia> @btime f2!(A,s);
  5.750 ms (8 allocations: 15.26 MiB)

As far as I understand, there are two problems with f2!:

  • Extra memory allocation (associated parse time conversion of +=)
  • The +(A::Matrix,J::UniformScaling) is a for-loop referenced below, rather than a blas level-1 axpy-call which is what we get in f1!.

I don’t know of a way to do this with I, but a loop works and is efficient (faster than your f1! on my machine):

function f3!(A,s)
    m = min(size(A)...)
    for i = 1:m; A[i,i] += s; end
    return A
end

f1! calls the broadcast machinery under the hood, not BLAS. It’s possible but a bit tricky to use a BLAS axpy function for this operation:

function f4!(A::Matrix{T}, s) where {T}    m = min(size(A)...)
    m = min(size(A)...)
    incA = stride(A,1) + stride(A,2)
    sr = Ref{T}(s)
    GC.@preserve sr LinearAlgebra.BLAS.axpy!(m, one(T), Base.unsafe_convert(Ptr{T}, sr), 0, A, incA)
    return A
end

but on my machine it’s only 5% faster than my looping implementation f3! for your 1000×1000 benchmark. And even this small performance difference goes away if I use @inbounds in my f3! loop.

Thanks! Learning a lot. I thought f1! would be effectively be the same as f4!. So, the BLAS incx=0-trick for vec .+ scalar is never used in julia?

Not as far as I know. (There’s little or no benefit to BLAS over a simple for loop for axpy anyway, especially for vector .+ scalar; compilers are good at axpy-like loops and there’s no possibility of fancy blocking hand optimizations like there is for BLAS-3 / matrix multiply / gemm, while loops are far more versatile in supporting more types etcetera.)

(As far as I know, BLAS axpy is not even used for vector + vector in Julia. There’s no point. Besides, if you care about performance you’re much better off combining multiple vector operations into a single loop, or using “dot fusion”, than breaking your calculation up into a sequence of axpy and other elementary operations.)

In general, performance optimization in Julia doesn’t rely on “mining” the standard library in the hope of finding a “vectorized / built-in” function that does exactly what you want (unlike e.g. Matlab or Python). Properly written user code and loops are fast.

Great. Thanks. That’s helpful in other places in my code.

I’ll leave this post open for a while since an efficient version involving I would be helpful.

You can use diagind in order to do this inplace on the diagonal. It’s how we do it in OrdinaryDiffEq to make it GPU-compatible:

Thanks. Does that compile to the same as the use of diagind in f1! ?

Edit: Ah. Now I understand your point. It does make it “work” for I.

Wouldn’t an extension of mul! be a natural place to put this functionality?

import LinearAlgebra.mul!
function mul!(X::StridedMatrix{T},a::Bool,B::UniformScaling{T},alpha::Bool,beta::Bool) where {T}
   if (a  & alpha & beta)  
         D = view(X, diagind(X)) # Or more efficient version
         D .+= B.λ
   else
        easytoimplement()
   end
   return X
end
function f6!(A::StridedMatrix,s)
   mul!(A,true,s*I,true,true);
end
julia> A=randn(1000,1000);
julia> @btime f6!(A,3.0);
  3.732 μs (2 allocations: 80 bytes)

That’s probably the right way to add it.

Okay. I will add a PR eventually. It doesn’t really involve a multiplication, so mul! is not an obvious for users who don’t know how five-argument mul! can be used.

Yes, mul!() is a strange function for an addition method. I think I would look for this functionality in axpby!():

axpby!() in manual:

I agree. axpby! does have a heritage from blas (even the manual is referring to BLAS.axpby!) and this feature is far from blas. mul! is julia specific. I wonder if axpby! is really meant to be used as add! analogous to mul!.

Discussion can be continued here: Five arg mul! for UniformScaling and improvement in exp! by jarlebring · Pull Request #40731 · JuliaLang/julia · GitHub

The elegant solution would be for this to work:

A .+= (s*I)

But it doesn’t.

There is something blocking this, but I’m not sure what: Broadcasting UniformScaling Operations · Issue #23197 · JuliaLang/julia · GitHub