Inplace multiplication by a square matrix

gdkrmr · January 26, 2017, 2:06pm

I have a d \times d matrix R and a n \times d matrix X, I want to multiply R with X and store the result in X, is there a way in julia to to do this without making a copy of X?

stevengj · January 26, 2017, 2:45pm

You can operate on a few columns at a time to avoid copying the whole matrix, e.g. to do it one column at a time:

b = similar(X, size(X,1)) # pre-allocate array for storing results of R * column
for i = 1:size(X, 2)
    X[:,i] = A_mul_B!(R, @view(X[:,i]), b)
end

(It would probably be more efficient to multiply a few columns at a time, because that makes better use of BLAS.)

juliohm · January 26, 2017, 3:40pm

@stevengj, can this convoluted for loop be replaced by the dot syntax you discussed in your post?

stevengj · January 26, 2017, 3:44pm

Not without doing something even more convoluted, as far as I can tell.

You could just do

@views for i = 1:size(X, 2)
    X[:,i] = R * X[:,i]
end

in 0.6 with the new @views macro, but you need to call A_mul_B! if you want to do the matrix*vector with a preallocated output vector.

wsshin · February 15, 2017, 11:57pm

Related to this, I tried a few things and got some questions about the benchmark result.

Below, I have two functions that apply a 2-by-2 matrix A to a 2-by-1000 matrix whose each column is a point on the unit circle.

julia> VERSION
v"0.5.1-pre+31"

julia> function applyA(A)
           θ = linspace(0, 2π, 1001)[1:end-1]  # 1000 angles
           X = [cos(θ) sin(θ)]'
           Z = A*X

           return Z
       end
applyA (generic function with 1 method)

julia> function applyAinplace(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           x = Vector{Float64}(2)
           y = Vector{Float64}(2)
           for k = 1:n  # 1000 angles
               x[1] = cos(k*∆θ)
               x[2] = sin(k*∆θ)
               BLAS.gemv!(y, 'N', A, x)
               view(Z,:,k) = y
           end

           return Z
       end
applyAinplace (generic function with 1 method)

Here is the benchmark result.

julia> @benchmark applyA($A)
BenchmarkTools.Trial:
  memory estimate:  63.34 KiB
  allocs estimate:  13
  mean time:        57.686 μs (0.78% GC)

julia> @benchmark applyAinplace($A)
BenchmarkTools.Trial:
  memory estimate:  15.94 KiB
  allocs estimate:  3
  mean time:        97.919 μs (1.53% GC)

So, even though the in-place version uses less allocations, it is slower. Is this because the in-place version uses level-2 BLAS whereas the other uses level-3 BLAS?

ChrisRackauckas · February 16, 2017, 12:43am

Or use this package to make things a little easier:

stevengj · February 16, 2017, 1:11pm

Just writing out the 2x2 multiply (unrolling the loop for multiplying each column of the 2x1000 matrix — it’s only 16 operations) and sticking @simd in front will surely be faster than calling out to BLAS for such a small matrix? And using the StaticArrays package will unroll the loops for you.

Note also that view(Z,:,k) = y is unnecessary. Z[:,k] = y will write in-place into Z. (Slicing on the left-hand side of an assignment operator calls setindex! on the slice.) But with only two components it is probably faster to unroll the loop and assign them individually.

i.e. just do

@simd for k = 1:n
    θ = k*∆θ
    y = A * SVector(cos(θ), sin(θ))
    Z[1,k] = y[1]
    Z[2,k] = y[2]
end

where A is a 2x2 SMatrix (from StaticArrays).

wsshin · February 16, 2017, 10:30pm

Didn’t know about @simd. Thanks!

A few comments.

I put view(Z,:,k) because it made a huge difference in the number of allocations. Compare the allocs estimate below (≈ 500) with the one shown above (= 3):

function applyAinplace2(A)
    n = 1000
    ∆θ = 2π/n
    Z = Array{Float64}(2,n)
    x = Vector{Float64}(2)
    y = Vector{Float64}(2)
    for k = 1:n  # 1000 angles
        x[1] = cos(k*∆θ)
        x[2] = sin(k*∆θ)
        BLAS.gemv!(y, 'N', A, x)
        Z[:,k] = y
    end

    return Z
end

julia> @benchmark applyAinplace2($A)
BenchmarkTools.Trial:
  memory estimate:  23.58 KiB
  allocs estimate:  492
  mean time:        115.919 μs (2.25% GC)

Similar trend in Julia v0.6:

julia> VERSION
v"0.6.0-dev.2673"

julia> @benchmark applyAinplace(A)
BenchmarkTools.Trial:
  memory estimate:  15.94 KiB
  allocs estimate:  3
  median time:      90.076 μs (0.00% GC)

julia> @benchmark applyAinplace2(A)
BenchmarkTools.Trial:
  memory estimate:  39.20 KiB
  allocs estimate:  1492
  mean time:        140.926 μs (1.55% GC)

Should I file an issue on this?

StaticArrays helps a lot, but @simd doesn’t seem to be giving an extra speedup. Am I doing something wrong? I built Julia with make -j 8, though not sure if that’s relevant.

julia> function applyAstatic(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           for k = 1:n
               θ = k*∆θ
               y = A * SVector(cos(θ), sin(θ))
               Z[1,k] = y[1]
               Z[2,k] = y[2]
           end

           return Z
       end
applyAstatic (generic function with 1 method)

julia> function applyAstaticSIMD(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           @simd for k = 1:n
               θ = k*∆θ
               y = A * SVector(cos(θ), sin(θ))
               Z[1,k] = y[1]
               Z[2,k] = y[2]
           end

           return Z
       end
applyAstaticSIMD (generic function with 1 method)

julia> A = rand(2,2); SA = SArray{size(A)}(A);

julia> @benchmark applyAstatic($SA)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  mean time:        31.877 μs (0.00% GC)

julia> @benchmark applyAstaticSIMD($SA)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  mean time:        30.492 μs (0.00% GC)

stevengj · February 16, 2017, 11:19pm

are you passing an SMatrix for A?

wsshin · February 16, 2017, 11:31pm

Yes. In the code quoted in my previous posting, you can see that I created a static array version of A as follows:

julia> A = rand(2,2); SA = SArray{size(A)}(A);

I passed SA to the functions.

Technically, SA is SArray and not SMatrix, but I guess @simd should work for both cases? I also tried SA = @SMatrix rand(2,2), and the result was no different.

stevengj · February 17, 2017, 2:15am

@simd just turns on LLVM auto-vectorization optimizations, but compilers are pretty limited in what they can vectorize.

Ralph_Smith · February 17, 2017, 3:26am

Most of the time in your static versions is spent computing sines and cosines, not in matrix products. If you need to compute these every time, try a vector math library (e.g. Yeppp or MKL). SIMD conversions of integers to floats are also not always available - you might need to build a native system image to get that.
If you want this really fast, precompute the sine/cosine arrays, rearrange to make k the first index of Z, and go back to ordinary matrices - BLAS libraries have well-tuned code for this case.

Topic		Replies	Views
How to do in-place matrix operation in Julia? New to Julia question	12	2715	February 18, 2023
Multiplication of sub matrices General Usage	7	665	September 7, 2018
In-place matrix multiplication compatible with 0.5 and 0.6 General Usage	3	722	January 10, 2017
Non-allocating matrix inversion Performance matrices	13	3577	April 23, 2022
Matrix-by-(slice of)vector multiplication with limited allocation New to Julia question	6	816	September 11, 2020

Inplace multiplication by a square matrix

Related topics