I have a `d \times d`

matrix `R`

and a `n \times d`

matrix `X`

, I want to multiply `R`

with `X`

and store the result in `X`

, is there a way in julia to to do this without making a copy of `X`

?

You can operate on a few columns at a time to avoid copying the whole matrix, e.g. to do it one column at a time:

```
b = similar(X, size(X,1)) # pre-allocate array for storing results of R * column
for i = 1:size(X, 2)
X[:,i] = A_mul_B!(R, @view(X[:,i]), b)
end
```

(It would probably be more efficient to multiply a few columns at a time, because that makes better use of BLAS.)

@stevengj, can this convoluted for loop be replaced by the dot syntax you discussed in your post?

Not without doing something even more convoluted, as far as I can tell.

You could just do

```
@views for i = 1:size(X, 2)
X[:,i] = R * X[:,i]
end
```

in 0.6 with the new `@views`

macro, but you need to call `A_mul_B!`

if you want to do the `matrix*vector`

with a preallocated output vector.

Related to this, I tried a few things and got some questions about the benchmark result.

Below, I have two functions that apply a 2-by-2 matrix `A`

to a 2-by-1000 matrix whose each column is a point on the unit circle.

```
julia> VERSION
v"0.5.1-pre+31"
julia> function applyA(A)
θ = linspace(0, 2π, 1001)[1:end-1] # 1000 angles
X = [cos(θ) sin(θ)]'
Z = A*X
return Z
end
applyA (generic function with 1 method)
julia> function applyAinplace(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
x = Vector{Float64}(2)
y = Vector{Float64}(2)
for k = 1:n # 1000 angles
x[1] = cos(k*∆θ)
x[2] = sin(k*∆θ)
BLAS.gemv!(y, 'N', A, x)
view(Z,:,k) = y
end
return Z
end
applyAinplace (generic function with 1 method)
```

Here is the benchmark result.

```
julia> @benchmark applyA($A)
BenchmarkTools.Trial:
memory estimate: 63.34 KiB
allocs estimate: 13
mean time: 57.686 μs (0.78% GC)
julia> @benchmark applyAinplace($A)
BenchmarkTools.Trial:
memory estimate: 15.94 KiB
allocs estimate: 3
mean time: 97.919 μs (1.53% GC)
```

So, even though the in-place version uses less allocations, it is slower. Is this because the in-place version uses level-2 BLAS whereas the other uses level-3 BLAS?

Or use this package to make things a little easier:

Just writing out the 2x2 multiply (unrolling the loop for multiplying each column of the 2x1000 matrix — it’s only 16 operations) and sticking `@simd`

in front will surely be faster than calling out to BLAS for such a small matrix? And using the StaticArrays package will unroll the loops for you.

Note also that `view(Z,:,k) = y`

is unnecessary. `Z[:,k] = y`

will write in-place into `Z`

. (Slicing on the left-hand side of an assignment operator calls `setindex!`

on the slice.) But with only two components it is probably faster to unroll the loop and assign them individually.

i.e. just do

```
@simd for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
```

where `A`

is a 2x2 `SMatrix`

(from StaticArrays).

Didn’t know about `@simd`

. Thanks!

A few comments.

- I put
`view(Z,:,k)`

because it made a*huge*difference in the number of allocations. Compare the`allocs estimate`

below (≈ 500) with the one shown above (= 3):

```
function applyAinplace2(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
x = Vector{Float64}(2)
y = Vector{Float64}(2)
for k = 1:n # 1000 angles
x[1] = cos(k*∆θ)
x[2] = sin(k*∆θ)
BLAS.gemv!(y, 'N', A, x)
Z[:,k] = y
end
return Z
end
julia> @benchmark applyAinplace2($A)
BenchmarkTools.Trial:
memory estimate: 23.58 KiB
allocs estimate: 492
mean time: 115.919 μs (2.25% GC)
```

Similar trend in Julia v0.6:

```
julia> VERSION
v"0.6.0-dev.2673"
julia> @benchmark applyAinplace(A)
BenchmarkTools.Trial:
memory estimate: 15.94 KiB
allocs estimate: 3
median time: 90.076 μs (0.00% GC)
julia> @benchmark applyAinplace2(A)
BenchmarkTools.Trial:
memory estimate: 39.20 KiB
allocs estimate: 1492
mean time: 140.926 μs (1.55% GC)
```

Should I file an issue on this?

`StaticArrays`

helps a lot, but`@simd`

doesn’t seem to be giving an extra speedup. Am I doing something wrong? I built Julia with`make -j 8`

, though not sure if that’s relevant.

```
julia> function applyAstatic(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
return Z
end
applyAstatic (generic function with 1 method)
julia> function applyAstaticSIMD(A)
n = 1000
∆θ = 2π/n
Z = Array{Float64}(2,n)
@simd for k = 1:n
θ = k*∆θ
y = A * SVector(cos(θ), sin(θ))
Z[1,k] = y[1]
Z[2,k] = y[2]
end
return Z
end
applyAstaticSIMD (generic function with 1 method)
julia> A = rand(2,2); SA = SArray{size(A)}(A);
julia> @benchmark applyAstatic($SA)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
mean time: 31.877 μs (0.00% GC)
julia> @benchmark applyAstaticSIMD($SA)
BenchmarkTools.Trial:
memory estimate: 15.75 KiB
allocs estimate: 1
mean time: 30.492 μs (0.00% GC)
```

are you passing an SMatrix for A?

Yes. In the code quoted in my previous posting, you can see that I created a static array version of `A`

as follows:

```
julia> A = rand(2,2); SA = SArray{size(A)}(A);
```

I passed `SA`

to the functions.

Technically, `SA`

is `SArray`

and not `SMatrix`

, but I guess `@simd`

should work for both cases? I also tried `SA = @SMatrix rand(2,2)`

, and the result was no different.

`@simd`

just turns on LLVM auto-vectorization optimizations, but compilers are pretty limited in what they can vectorize.

Most of the time in your static versions is spent computing sines and cosines, not in matrix products. If you need to compute these every time, try a vector math library (e.g. Yeppp or MKL). SIMD conversions of integers to floats are also not always available - you might need to build a native system image to get that.

If you want this really fast, precompute the sine/cosine arrays, rearrange to make `k`

the first index of `Z`

, and go back to ordinary matrices - BLAS libraries have well-tuned code for this case.