Hi, I have vectors (say 20 of them) of the same size (say 256 Float64).

Now I want to add two of them and put the result into the third one, with the index of which one given as input.

*TL;DR:* what is the fastest way to perform linear algebra operations on such vectors on CPU?

Sample code for adding two vectors:

```
vectors1 = Vector{Vector{Float64}}([zeros(Float64, 256) for _ = 1:20]);
@inline function add_vec(mem, out, in1, in2)
@inbounds mem[out] .= mem[in1] .+ mem[in2]
end
```

When I check the performance, itβs unsatisfactory β because Iβm doing this operation *a lot* so it adds up.

```
using BenchmarkTools
@benchmark add_vec(vectors1, 3, 1, 2)
```

```
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
Range (min β¦ max): 30.430 ns β¦ 95.037 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 31.739 ns β GC (median): 0.00%
Time (mean Β± Ο): 31.967 ns Β± 1.484 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββ
ββββ β
ββββββββββββββββββββββββ
ββββ
ββββ
βββ
βββββ
ββββββββββββββ
ββ
β
ββ
β
30.4 ns Histogram: log(frequency) by time 36.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
```

Although obviously can always be worse, this is what I started with:

```
function add_vec_unoptimized(mem, out, in1, in2)
mem[out] = mem[in1] + mem[in2]
end
@benchmark add_vec_unoptimized(vectors1, 3, 1, 2)
```

```
BenchmarkTools.Trial: 10000 samples with 681 evaluations.
Range (min β¦ max): 164.420 ns β¦ 4.327 ΞΌs β GC (min β¦ max): 0.00% β¦ 86.74%
Time (median): 214.507 ns β GC (median): 0.00%
Time (mean Β± Ο): 304.479 ns Β± 341.656 ns β GC (mean Β± Ο): 17.15% Β± 13.98%
βββ
βββ ββ β
βββββββββββββββββββββ
βββββββββββββββββββ
ββ
βββββββββββ
β
β
βββ
βββ β
164 ns Histogram: log(frequency) by time 2.15 ΞΌs <
Memory estimate: 2.12 KiB, allocs estimate: 1.
```

When looking at the underlying code using `code_warntype`

and `code_native`

, I see itβs doing lots of things internally β checking dimensions, `materialize`

, `broadcast`

(in the `.+`

version), `axes`

, `shape`

, `convert`

, `alias`

etc.

Tried also using StaticArrays.jl to hint the compiler about the vector size, but it was slower.

What is the correct way to do this?

Iβm asking for this one operation for sake of simplicity and analysis, but Iβm looking for a solution for all linear algebra operations.

Preferably single-threaded CPU.

Thanks!