I’m excited to announce LoopVectorization.

LoopVectorization provides the `@avx`

macro.

This macro can be prepended to a `for`

loop or a broadcast statement, and it tries its best to vectorize the loop.

It can handle nested loops. It models the cost of evaluating the loop, and then unrolls either one or two loops.

## Examples

Lets jump straight into the deep end with the classic: matmul.

```
function gemmavx!(C, A, B)
@avx for i ∈ 1:size(A,1), j ∈ 1:size(B,2)
Cᵢⱼ = 0.0
for k ∈ 1:size(A,2)
Cᵢⱼ += A[i,k] * B[k,j]
end
C[i,j] = Cᵢⱼ
end
end
```

For comparison, we have:

- Intel MKL
- Simple Julia
- Clang + Polly
- GFortran
- GFortran, matmul intrinsic

You can see the code I used to run the benchmarks here.

Eg, include the file, and then

```
sizes = # abstract vector of `Int`s or `NTuple{3,Int}`s for sizes
br = benchmark_gemm(sizes)
plot(br)
```

Aside from an assortment of Julia libraries (like `PrettyTables`

and `VegaLite`

),

Running the benchmarks also requires `gfortran`

and `Clang+Polly`

. It also requires the following Julia libraries:

```
using Pkg; Pkg.add(["VectorizationBase", "LoopVectorization", "PrettyTables", "VegaLite", "BenchmarkTools"])
```

I used the following loop order for each:

```
function jgemm_nkm!(C, A, B)
C .= 0
M, N = size(C); K = size(B,1)
@inbounds for n ∈ 1:N, k ∈ 1:K
@simd ivdep for m ∈ 1:M
C[m,n] += A[m,k] * B[k,n]
end
end
end
```

This plot (and the remaining plots) cover all square matrices of sizes 2x2 through 256x256.

The loop modeling doesn’t consider data locality / cache yet, so naturally it starts to fall behind as the matrices increase in size.

Note that the new Flang (LLVM-Fortran) compiler being developed and mainlined into LLVM will use a MLIR dialect (called FIR) for performing loop optimizations. This promises to have the same benefits as Polly, so it will be interesting to test once they have a release. GCC’s polyhedral model (which I thought I activated with the flag `-floop-nest-optimize`

) does not seem to do anything, as far as I could tell.

#### A’ * B

Matrix multiplication again, but this time `A`

is transposed. `k`

is the inner loop, so we’re effectively taking `M*N`

dot products.

#### Dot Products

To get into a little about how it works, lets start with a simple example, a dot product:

```
function jdotavx(a, b)
s = 0.0
@avx for i ∈ eachindex(a, b)
s += a[i] * b[i]
end
s
end
```

Most recent CPUs are capable of 2x aligned loads, and 2x fma per clock cycle.

Because the dot product requires 2 loads per fma, that bottle necks us at 1 fma per clock cycle.

However, it takes 4 clock cycles to complete an fma instruction on many recent CPUs. And to complete one iteration of `s_new = s_old a[i] * b[i]`

, we need to have calculated `s_old`

.

For this reason, LoopVectorization unrolls the loop by a factor of 4, so that we have four copies of `s`

, and a new `fma`

instruction can be started on every clock cycle.

Now, consider the `selfdot`

:

```
function jselfdotavx(a)
s = 0.0
@avx for i ∈ eachindex(a)
s += a[i] * a[i]
end
s
end
```

Here we only need a single load per `fma`

, meaning that it could potentially perform 2 fmas per clock cycle, requiring 8 concurrent instances.

But LLVM still unrolls it by a factor of 4.

#### Three-argument dot product

This is effectively `x*A*y`

, a quadratic form.

It can be written as two nested loops, or a `gemv`

followed by a dot product. What I’m calling BLAS here is just the Julia-builtin 3-arg dot. I did not check its implementation. I suspect it does not actually call BLAS, as MKL normally shows much steadier performance with respect to input size.

#### Matrix * vector

‘gemv’ benchmarks:

#### Sum of Squared Error

I talked about matrix-vector multiplication here. Rather than showing that `LoopVectorization`

will perform all the optimizations I implemented there, I’ll show a similar example, the sum of squared error: `sum(abs2, y .- X * β)`

.

```
function jOLSlp_avx(y, X, β)
lp = 0.0
@avx for i ∈ eachindex(y)
δ = y[i]
for j ∈ eachindex(β)
δ -= X[i,j] * β[j]
end
lp += δ * δ
end
lp
end
```

#### Broadcasting

Lets start with something simple:

```
b = @avx exp.(a)
@avx @. b = log(a)
```

`exp`

benchmarks:

Base Julia cannot vectorize elemental functions like `exp`

, so the advantage here is substantial.

It can also handle multidimensional broadcasts:

```
a = rand(M); B = rand(M, N); c = rand(N);
D = @avx @. a + B * c'
@avx @. D = a + B * c'
```

Currently, arrays are not allowed to have dimensions of size `1`

. Either I’ll allow users to work around this via providing information on which dimensions are of size `1`

, or by replacing all strides associated with these dimensions as 0.

## Current Limitations

- It only handles double precision at the moment. Each variable carries with it “elementbytes” to indicate size (eg, 4 bytes for
`Float32`

) but I didn’t add anything yet to actually specify or infer this. - It only handles rectangular iteration spaces.
- Always uses cartesian indexing for broadcasts. If the broadcast includes multidimensional arrays that are all of the same size, it may be worthwhile
`vec`

ing them first. - Does not model memory movement or locality in making optimization decisions. I plan to do this eventually. The intention would be for it to consider creating a macro and micro kernel. The micro-kernel handles register-level optimizations, and the macro kernel handles cache-friendliness following this work.
- Aliasing between distinct arrays is forbidden. You can’t
`@avx`

a`cumsum`

. - It doesn’t do loop “splitting”. Sometimes it would be better to break a loop up into 2 or more loops.
- It cannot handle multiple loops at the same level within the loop nest.

I’m also not married to the `@avx`

name, given that the name is only relevant for recent x86 CPUs. The name `@vectorize`

is currently reserved, because there it already defines an older (and much simpler) macro. I may eventually deprecate it, but I have about a dozen unregistered packages still depending on it for the time being.

Before using `@avx`

, I’d always recommend testing it for both correctness and performance.

I’d like both improving modeling, as well as new strategies for lowering loops. Better/more performance, documentation, tests, etc are all always wanted. PRs and issues are welcome.