Uniform scaling inplace addition with matrix

I want to do an inplace add of a multiple of the identity to a matrix. I can get the efficiency I want with a view of the diagonal:

using LinearAlgebra
function f1!(A,s)
   D = view(A, diagind(A, 0))
   D .+= s;
end
julia> s=1e-5;
julia> A=randn(1000,1000); x=diag(A);
julia> f1!(A,s);
julia> norm(diag(A) - (x.+s))
0.0
julia> @btime f1!(A,s)
  4.627 μs (3 allocations: 144 bytes)

For various reasons, I would prefer to use the UniformScaling. Is it possible to achieve the same efficiency with UniformScaling?

For reference:

function f2!(A,s)
    A[:,:] += s*I;
end
julia> @btime f2!(A,s);
  5.750 ms (8 allocations: 15.26 MiB)

As far as I understand, there are two problems with f2!:

  • Extra memory allocation (associated parse time conversion of +=)
  • The +(A::Matrix,J::UniformScaling) is a for-loop referenced below, rather than a blas level-1 axpy-call which is what we get in f1!.

I don’t know of a way to do this with I, but a loop works and is efficient (faster than your f1! on my machine):

function f3!(A,s)
    m = min(size(A)...)
    for i = 1:m; A[i,i] += s; end
    return A
end

f1! calls the broadcast machinery under the hood, not BLAS. It’s possible but a bit tricky to use a BLAS axpy function for this operation:

function f4!(A::Matrix{T}, s) where {T}    m = min(size(A)...)
    m = min(size(A)...)
    incA = stride(A,1) + stride(A,2)
    sr = Ref{T}(s)
    GC.@preserve sr LinearAlgebra.BLAS.axpy!(m, one(T), Base.unsafe_convert(Ptr{T}, sr), 0, A, incA)
    return A
end

but on my machine it’s only 5% faster than my looping implementation f3! for your 1000×1000 benchmark. And even this small performance difference goes away if I use @inbounds in my f3! loop.

3 Likes

Thanks! Learning a lot. I thought f1! would be effectively be the same as f4!. So, the BLAS incx=0-trick for vec .+ scalar is never used in julia?

Not as far as I know. (There’s little or no benefit to BLAS over a simple for loop for axpy anyway, especially for vector .+ scalar; compilers are good at axpy-like loops and there’s no possibility of fancy blocking hand optimizations like there is for BLAS-3 / matrix multiply / gemm, while loops are far more versatile in supporting more types etcetera.)

(As far as I know, BLAS axpy is not even used for vector + vector in Julia. There’s no point. Besides, if you care about performance you’re much better off combining multiple vector operations into a single loop, or using “dot fusion”, than breaking your calculation up into a sequence of axpy and other elementary operations.)

In general, performance optimization in Julia doesn’t rely on “mining” the standard library in the hope of finding a “vectorized / built-in” function that does exactly what you want (unlike e.g. Matlab or Python). Properly written user code and loops are fast.

4 Likes

Great. Thanks. That’s helpful in other places in my code.

I’ll leave this post open for a while since an efficient version involving I would be helpful.

You can use diagind in order to do this inplace on the diagonal. It’s how we do it in OrdinaryDiffEq to make it GPU-compatible:

4 Likes

Thanks. Does that compile to the same as the use of diagind in f1! ?

Edit: Ah. Now I understand your point. It does make it “work” for I.

Wouldn’t an extension of mul! be a natural place to put this functionality?

import LinearAlgebra.mul!
function mul!(X::StridedMatrix{T},a::Bool,B::UniformScaling{T},alpha::Bool,beta::Bool) where {T}
   if (a  & alpha & beta)  
         D = view(X, diagind(X)) # Or more efficient version
         D .+= B.λ
   else
        easytoimplement()
   end
   return X
end
function f6!(A::StridedMatrix,s)
   mul!(A,true,s*I,true,true);
end
julia> A=randn(1000,1000);
julia> @btime f6!(A,3.0);
  3.732 μs (2 allocations: 80 bytes)

That’s probably the right way to add it.

Okay. I will add a PR eventually. It doesn’t really involve a multiplication, so mul! is not an obvious for users who don’t know how five-argument mul! can be used.

Yes, mul!() is a strange function for an addition method. I think I would look for this functionality in axpby!():

axpby!() in manual:

I agree. axpby! does have a heritage from blas (even the manual is referring to BLAS.axpby!) and this feature is far from blas. mul! is julia specific. I wonder if axpby! is really meant to be used as add! analogous to mul!.

1 Like

Discussion can be continued here: Five arg mul! for UniformScaling and improvement in exp! by jarlebring · Pull Request #40731 · JuliaLang/julia · GitHub

The elegant solution would be for this to work:

A .+= (s*I)

But it doesn’t.

There is something blocking this, but I’m not sure what: Broadcasting UniformScaling Operations · Issue #23197 · JuliaLang/julia · GitHub

5 Likes