I have a d \times d
matrix R
and a n \times d
matrix X
, I want to multiply R
with X
and store the result in X
, is there a way in julia to to do this without making a copy of X
?
Inplace multiplication by a square matrix
You can operate on a few columns at a time to avoid copying the whole matrix, e.g. to do it one column at a time:
b = similar(X, size(X,1)) # preallocate array for storing results of R * column
for i = 1:size(X, 2)
X[:,i] = A_mul_B!(R, @view(X[:,i]), b)
end
(It would probably be more efficient to multiply a few columns at a time, because that makes better use of BLAS.)
@stevengj, can this convoluted for loop be replaced by the dot syntax you discussed in your post?
Not without doing something even more convoluted, as far as I can tell.
You could just do
@views for i = 1:size(X, 2)
X[:,i] = R * X[:,i]
end
in 0.6 with the new @views
macro, but you need to call A_mul_B!
if you want to do the matrix*vector
with a preallocated output vector.
Related to this, I tried a few things and got some questions about the benchmark result.
Below, I have two functions that apply a 2by2 matrix A
to a 2by1000 matrix whose each column is a point on the unit circle.
julia> VERSION
v"0.5.1pre+31"
julia> function applyA(A)
θ = linspace(0, 2π, 1001)[1:end1] # 1000 angles
X = [cos(θ) sin(θ)]'
Z = A*X
return Z
end
applyA (generic function with 1 method)
julia> function applyAinplace(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
x = Vector{Float64}(2)
y = Vector{Float64}(2)
for k = 1:n # 1000 angles
x[1] = cos(k*∆θ)
x[2] = sin(k*∆θ)
BLAS.gemv!(y, 'N', A, x)
view(Z,:,k) = y
end
return Z
end
applyAinplace (generic function with 1 method)
Here is the benchmark result.
julia> @benchmark applyA($A)
BenchmarkTools.Trial:
memory estimate: 63.34 KiB
allocs estimate: 13
mean time: 57.686 μs (0.78% GC)
julia> @benchmark applyAinplace($A)
BenchmarkTools.Trial:
memory estimate: 15.94 KiB
allocs estimate: 3
mean time: 97.919 μs (1.53% GC)
So, even though the inplace version uses less allocations, it is slower. Is this because the inplace version uses level2 BLAS whereas the other uses level3 BLAS?
Just writing out the 2x2 multiply (unrolling the loop for multiplying each column of the 2x1000 matrix — it’s only 16 operations) and sticking @simd
in front will surely be faster than calling out to BLAS for such a small matrix? And using the StaticArrays package will unroll the loops for you.
Note also that view(Z,:,k) = y
is unnecessary. Z[:,k] = y
will write inplace into Z
. (Slicing on the lefthand side of an assignment operator calls setindex!
on the slice.) But with only two components it is probably faster to unroll the loop and assign them individually.
i.e. just do
@simd for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
where A
is a 2x2 SMatrix
(from StaticArrays).
Didn’t know about @simd
. Thanks!
A few comments.
 I put
view(Z,:,k)
because it made a huge difference in the number of allocations. Compare theallocs estimate
below (≈ 500) with the one shown above (= 3):
function applyAinplace2(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
x = Vector{Float64}(2)
y = Vector{Float64}(2)
for k = 1:n # 1000 angles
x[1] = cos(k*∆θ)
x[2] = sin(k*∆θ)
BLAS.gemv!(y, 'N', A, x)
Z[:,k] = y
end
return Z
end
julia> @benchmark applyAinplace2($A)
BenchmarkTools.Trial:
memory estimate: 23.58 KiB
allocs estimate: 492
mean time: 115.919 μs (2.25% GC)
Similar trend in Julia v0.6:
julia> VERSION
v"0.6.0dev.2673"
julia> @benchmark applyAinplace(A)
BenchmarkTools.Trial:
memory estimate: 15.94 KiB
allocs estimate: 3
median time: 90.076 μs (0.00% GC)
julia> @benchmark applyAinplace2(A)
BenchmarkTools.Trial:
memory estimate: 39.20 KiB
allocs estimate: 1492
mean time: 140.926 μs (1.55% GC)
Should I file an issue on this?

StaticArrays
helps a lot, but@simd
doesn’t seem to be giving an extra speedup. Am I doing something wrong? I built Julia withmake j 8
, though not sure if that’s relevant.
julia> function applyAstatic(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
return Z
end
applyAstatic (generic function with 1 method)
julia> function applyAstaticSIMD(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
@simd for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
return Z
end
applyAstaticSIMD (generic function with 1 method)
julia> A = rand(2,2); SA = SArray{size(A)}(A);
julia> @benchmark applyAstatic($SA)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
mean time: 31.877 μs (0.00% GC)
julia> @benchmark applyAstaticSIMD($SA)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
mean time: 30.492 μs (0.00% GC)
Yes. In the code quoted in my previous posting, you can see that I created a static array version of A
as follows:
julia> A = rand(2,2); SA = SArray{size(A)}(A);
I passed SA
to the functions.
Technically, SA
is SArray
and not SMatrix
, but I guess @simd
should work for both cases? I also tried SA = @SMatrix rand(2,2)
, and the result was no different.
@simd
just turns on LLVM autovectorization optimizations, but compilers are pretty limited in what they can vectorize.
Most of the time in your static versions is spent computing sines and cosines, not in matrix products. If you need to compute these every time, try a vector math library (e.g. Yeppp or MKL). SIMD conversions of integers to floats are also not always available  you might need to build a native system image to get that.
If you want this really fast, precompute the sine/cosine arrays, rearrange to make k
the first index of Z
, and go back to ordinary matrices  BLAS libraries have welltuned code for this case.