Inplace multiplication by a square matrix

I have a d \times d matrix R and a n \times d matrix X, I want to multiply R with X and store the result in X, is there a way in julia to to do this without making a copy of X?

You can operate on a few columns at a time to avoid copying the whole matrix, e.g. to do it one column at a time:

b = similar(X, size(X,1)) # pre-allocate array for storing results of R * column
for i = 1:size(X, 2)
    X[:,i] = A_mul_B!(R, @view(X[:,i]), b)
end

(It would probably be more efficient to multiply a few columns at a time, because that makes better use of BLAS.)

@stevengj, can this convoluted for loop be replaced by the dot syntax you discussed in your post?

Not without doing something even more convoluted, as far as I can tell.

You could just do

@views for i = 1:size(X, 2)
    X[:,i] = R * X[:,i]
end

in 0.6 with the new @views macro, but you need to call A_mul_B! if you want to do the matrix*vector with a preallocated output vector.

Related to this, I tried a few things and got some questions about the benchmark result.

Below, I have two functions that apply a 2-by-2 matrix A to a 2-by-1000 matrix whose each column is a point on the unit circle.

julia> VERSION
v"0.5.1-pre+31"

julia> function applyA(A)
           θ = linspace(0, 2π, 1001)[1:end-1]  # 1000 angles
           X = [cos(θ) sin(θ)]'
           Z = A*X

           return Z
       end
applyA (generic function with 1 method)

julia> function applyAinplace(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           x = Vector{Float64}(2)
           y = Vector{Float64}(2)
           for k = 1:n  # 1000 angles
               x[1] = cos(k*∆θ)
               x[2] = sin(k*∆θ)
               BLAS.gemv!(y, 'N', A, x)
               view(Z,:,k) = y
           end

           return Z
       end
applyAinplace (generic function with 1 method)

Here is the benchmark result.

julia> @benchmark applyA($A)
BenchmarkTools.Trial:
  memory estimate:  63.34 KiB
  allocs estimate:  13
  mean time:        57.686 μs (0.78% GC)

julia> @benchmark applyAinplace($A)
BenchmarkTools.Trial:
  memory estimate:  15.94 KiB
  allocs estimate:  3
  mean time:        97.919 μs (1.53% GC)

So, even though the in-place version uses less allocations, it is slower. Is this because the in-place version uses level-2 BLAS whereas the other uses level-3 BLAS?

Or use this package to make things a little easier:

1 Like

Just writing out the 2x2 multiply (unrolling the loop for multiplying each column of the 2x1000 matrix — it’s only 16 operations) and sticking @simd in front will surely be faster than calling out to BLAS for such a small matrix? And using the StaticArrays package will unroll the loops for you.

Note also that view(Z,:,k) = y is unnecessary. Z[:,k] = y will write in-place into Z. (Slicing on the left-hand side of an assignment operator calls setindex! on the slice.) But with only two components it is probably faster to unroll the loop and assign them individually.

i.e. just do

@simd for k = 1:n
    θ = k*∆θ
    y = A * SVector(cos(θ), sin(θ))
    Z[1,k] = y[1]
    Z[2,k] = y[2]
end

where A is a 2x2 SMatrix (from StaticArrays).

1 Like

Didn’t know about @simd. Thanks!

A few comments.

  1. I put view(Z,:,k) because it made a huge difference in the number of allocations. Compare the allocs estimate below (≈ 500) with the one shown above (= 3):
function applyAinplace2(A)
    n = 1000
    ∆θ = 2π/n
    Z = Array{Float64}(2,n)
    x = Vector{Float64}(2)
    y = Vector{Float64}(2)
    for k = 1:n  # 1000 angles
        x[1] = cos(k*∆θ)
        x[2] = sin(k*∆θ)
        BLAS.gemv!(y, 'N', A, x)
        Z[:,k] = y
    end

    return Z
end

julia> @benchmark applyAinplace2($A)
BenchmarkTools.Trial:
  memory estimate:  23.58 KiB
  allocs estimate:  492
  mean time:        115.919 μs (2.25% GC)

Similar trend in Julia v0.6:

julia> VERSION
v"0.6.0-dev.2673"

julia> @benchmark applyAinplace(A)
BenchmarkTools.Trial:
  memory estimate:  15.94 KiB
  allocs estimate:  3
  median time:      90.076 μs (0.00% GC)

julia> @benchmark applyAinplace2(A)
BenchmarkTools.Trial:
  memory estimate:  39.20 KiB
  allocs estimate:  1492
  mean time:        140.926 μs (1.55% GC)

Should I file an issue on this?

  1. StaticArrays helps a lot, but @simd doesn’t seem to be giving an extra speedup. Am I doing something wrong? I built Julia with make -j 8, though not sure if that’s relevant.
julia> function applyAstatic(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           for k = 1:n
               θ = k*∆θ
               y = A * SVector(cos(θ), sin(θ))
               Z[1,k] = y[1]
               Z[2,k] = y[2]
           end

           return Z
       end
applyAstatic (generic function with 1 method)

julia> function applyAstaticSIMD(A)
           n = 1000
           ∆θ = 2π/n
           Z = Array{Float64}(2,n)
           @simd for k = 1:n
               θ = k*∆θ
               y = A * SVector(cos(θ), sin(θ))
               Z[1,k] = y[1]
               Z[2,k] = y[2]
           end

           return Z
       end
applyAstaticSIMD (generic function with 1 method)

julia> A = rand(2,2); SA = SArray{size(A)}(A);

julia> @benchmark applyAstatic($SA)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  mean time:        31.877 μs (0.00% GC)

julia> @benchmark applyAstaticSIMD($SA)
BenchmarkTools.Trial:
  memory estimate:  15.75 KiB
  allocs estimate:  1
  mean time:        30.492 μs (0.00% GC)

are you passing an SMatrix for A?

1 Like

Yes. In the code quoted in my previous posting, you can see that I created a static array version of A as follows:

julia> A = rand(2,2); SA = SArray{size(A)}(A);

I passed SA to the functions.

Technically, SA is SArray and not SMatrix, but I guess @simd should work for both cases? I also tried SA = @SMatrix rand(2,2), and the result was no different.

@simd just turns on LLVM auto-vectorization optimizations, but compilers are pretty limited in what they can vectorize.

Most of the time in your static versions is spent computing sines and cosines, not in matrix products. If you need to compute these every time, try a vector math library (e.g. Yeppp or MKL). SIMD conversions of integers to floats are also not always available - you might need to build a native system image to get that.
If you want this really fast, precompute the sine/cosine arrays, rearrange to make k the first index of Z, and go back to ordinary matrices - BLAS libraries have well-tuned code for this case.

1 Like