I thought LoopVectorization 0.12 brought enough exciting improvements to warrant an update.
Probably most exciting is the addition of @avxt
or equivalently @avx thread=true
, which can automatically multithread for
loops (I’ll add support for broadcasting later).
The implementation is simple at the moment, but should get better as I develop an actual performance model.
Multithreading
Taking ThreadsX’s first README example, the LoopVectorization equivalent would be:
julia> using LoopVectorization, BenchmarkTools, ThreadsX
julia> function relative_prime_count(x, N)
c = 0
@avxt for i ∈ 1:N
c += gcd(x, i) == 1
end
c
end
relative_prime_count (generic function with 1 method)
julia> @btime relative_prime_count(42, 10_000)
3.298 μs (0 allocations: 0 bytes)
2857
julia> @btime ThreadsX.sum(gcd(42, i) == 1 for i in 1:10_000)
136.617 μs (3096 allocations: 240.36 KiB)
2857
Some of LoopVectorization’s performance benefit comes from SIMD (note: gcd
requires the vplzcnt(d/q)
instructions to be fast, which are AVX512 only).
LoopVectorization’s threading and optimizations are limited to loops working with Union{Bool,Base.HWReal}
s and strided arrays, where static scheduling works great as each chunk takes a predictable amount of time.
I regularly use ThreadsX.jl
for things like parallelizing derivatives through solves of ordinary differential equations.
A more interesting example what I’ll work on integrating in the NNlib ecosystem soonᵀᴹ are convolutions.
Code for creating a batch of a hundred 256x256 Float32
images with 3 input channels, and convolveing them with a 5x5 kernel producing 6 output channels:
using NNlib, LoopVectorization, Static
img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);
dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);
function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
K₁ = StaticInt(1):StaticInt(K[1])
K₂ = StaticInt(1):StaticInt(K[2])
Cᵢₙ = StaticInt(1):StaticInt(C_in)
Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end
function convlayer!(
out::AbstractArray{<:Any,4}, img, kern,
dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
(K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
@avxt for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), d ∈ axes(out,4), o ∈ Cₒᵤₜ
s = zero(eltype(out))
for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
end
out[j₁, j₂, o, d] = s
end
out
end
The NNlib.DenseConvDims
type contains static size information that kernaxes
translates to a form that LoopVectorization
will understand. Results:
julia> NNlib.conv!(out2, img, kern, dcd);
julia> convlayer!(out1, img, kern, dcd);
julia> out1 ≈ out2
true
julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.377 ms (0.00% GC)
median time: 5.432 ms (0.00% GC)
mean time: 5.433 ms (0.00% GC)
maximum time: 5.682 ms (0.00% GC)
--------------
samples: 920
evals/sample: 1
julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 675.02 MiB
allocs estimate: 195
--------------
minimum time: 182.749 ms (0.00% GC)
median time: 190.472 ms (0.60% GC)
mean time: 197.527 ms (4.98% GC)
maximum time: 300.536 ms (35.82% GC)
--------------
samples: 26
evals/sample: 1
NNlib.conv!
on the CPU calls conv_im2col!, which translates the convolution into a BLAS call. It uses Threads.@threads
across the batches; we can set BLAS.set_num_threads(1)
to prevent oversubscription:
julia> using LinearAlgebra
julia> BLAS.set_num_threads(1)
julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
memory estimate: 675.02 MiB
allocs estimate: 195
--------------
minimum time: 124.177 ms (0.00% GC)
median time: 128.609 ms (0.93% GC)
mean time: 133.574 ms (5.36% GC)
maximum time: 235.760 ms (45.17% GC)
--------------
samples: 38
evals/sample: 1
Leaving @avxt
23x faster. Broadly, the class of problems handled by “how do I make this look like gemm”? is a class where LoopVectorization is likely to do well (or can be tuned/modified to perform well).
Something else important to emphasize here are the 0 allocations and consistent timings. This is a feature of @avxt
, even for code taking just a few microseconds, but not of code built upon Julia’s threading primitives unless the inconsistency can be amortized over much longer durations:
julia> @benchmark relative_prime_count(42, 10_000)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.325 μs (0.00% GC)
median time: 3.502 μs (0.00% GC)
mean time: 3.503 μs (0.00% GC)
maximum time: 13.655 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 8
julia> @benchmark ThreadsX.sum(gcd(42, i) == 1 for i in 1:10_000)
BenchmarkTools.Trial:
memory estimate: 240.14 KiB
allocs estimate: 3089
--------------
minimum time: 133.668 μs (0.00% GC)
median time: 246.671 μs (0.00% GC)
mean time: 300.477 μs (19.57% GC)
maximum time: 588.215 ms (99.96% GC)
--------------
samples: 10000
evals/sample: 1
Thus the minimum times are representative of @avxt
performance, but often not representative of Threads.@threads
or Threads.@spawn
.
It also naturally does well with gemm itself:
Competing head to head with Octavian and MKL, while OpenBLAS and (for now) Tullio are left behind.
Tullio actually does better at larger sizes that bust the L2 cache. I haven’t looked at why (more threads → suffers less? Or does it do some cache level optimizations?).
With 80x80 matrices, LoopVectorization and Octavian were already exceeding 450 GFLOPS of double precision.
LoopVectorization’s (and Tullio’s) strength are the ability to bring this performance to other routines like the convolution above in a manner more natural than “can I make it look like gemm” approach.
Another note here is that LoopVectorization will now take advantage of the information provided by ArrayInterface.indices
:
function lvmul_threads!(C, A, B)
@avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
indices
broadcasts the second (integer) argument in calling axes
on the arrays.
Thus indices((C,B), 2)
is similar to:
@assert axes(C,2) == axes(B,2); axes(C,2)
and indices((A,B), (2,1))
is like
@assert axes(A,2) == axes(B,1); axes(A,2)
If the macro sees the indices
call it disables the @assert
s (I was wanting to shave as many nanoseconds off certain benchmarks as possible ) and uses this along with the types of the arrays to find out if it implies any of their strides are equivalent, allowing minor optimizations.
An example of the small problem sizes at which threading is capable of increasing performance:
julia> N = 10_000; x = rand(N); y = rand(N);
julia> @btime dot($x, $y) # LinearAlgebra
1.114 μs (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dotavx($x, $y) # `@avx`, single threaded
761.621 ns (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dotavxt($x, $y) # `@avxt`, multithreaded
622.723 ns (0 allocations: 0 bytes)
2480.296446711209
julia> @btime dotbaseline($x, $y) # `@inbounds @simd`, single threaded
1.294 μs (0 allocations: 0 bytes)
2480.2964467112097
julia> @btime cdot($x, $y) # OpenMP C implementation
6.109 μs (0 allocations: 0 bytes)
2480.2964467112092
cdot
, using OpenMP–maybe somone more familiar can tell me if I’m doing anything obvious wrong.
Function definitions
function dotavxt(a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Real}
s = zero(T)
@avxt for i ∈ eachindex(a,b)
s += a[i] * b[i]
end
s
end
function dotbaseline(a::AbstractArray{T}, b::AbstractArray{T}) where {T}
s = zero(T)
@fastmath @inbounds @simd for i ∈ eachindex(a,b)
s += a[i]' * b[i]
end
s
end
#include<omp.h>
// gcc -Ofast -march=native -mprefer-vector-width=512 -fopenmp -shared -fPIC openmp.c -o libomptest.so
double dot(double* a, double* b, long N){
double s = 0.0;
#pragma omp parallel for reduction(+: s)
for(long n = 0; n < N; n++){
s += a[n]*b[n];
}
return s;
}
Benchmarks from 1_000:1_000:20_000
showing minimum and median times connected in a weird way:
Regular strided memory accesses
We can make the problem a bit more complex to show off the other major new feature in LoopVectorization 0.12.
Previous versions of LoopVectorization modeled 3 kinds of load/store:
- Scalar
- Contiguous vector
- Discontiguous vector
Scalar and vector are fast, and discontiguous was slow, requiring gather/scatter instructions.
Now, category three has been split, and we have 4 kinds:
- Scalar
- Contiguous vector
- Regular stride → unroll and shuffle
- Irregular stride → gather/scatter
A few example cases where this helps:
- Dealing with arrays of structs.
For example,Vector{Complex{Float64}}
. WhileLoopVectorization
still does not supportComplex{Float64}
, we canreinterpret(Float64, A)
, or as of Julia 1.6,reinterpret(reshape, Float64, A)
. Three different ways to give a complex dot product (if benchmarking, you may want to make all or none of them@avxt
; currently the third is@avx
for the sake of having different names):
# _j16 requires Julia VERSION ≥ v"1.6.0-rc1" (older ones, e.g. v"1.6.0-DEV.1581", also work)
function dotavxt_j16(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(reshape, T, ca)
b = reinterpret(reshape, T, cb)
re = zero(T); im = zero(T)
@avxt for i ∈ axes(a,2)
re += a[1,i] * b[1,i] + a[2,i] * b[2,i]
im += a[1,i] * b[2,i] - a[2,i] * b[1,i]
end
Complex(re, im)
end
function dotavx(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(T, ca);
b = reinterpret(T, cb);
re = zero(T); im = zero(T)
@avx for i ∈ 1:2:length(a)
re += a[i] * b[i ] + a[i+1] * b[i+1]
im += a[i] * b[i+1] - a[i+1] * b[i ]
end
Complex(re, im)
end
function dotavxt(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(T, ca);
b = reinterpret(T, cb);
re = zero(T); im = zero(T)
# with a multiplier, we go from `i = 1 -> 2i = 2` to `i = 0 -> 2i = 0
# 2(i+1-1) = 2i + 2 - 2, so....
@avxt for i ∈ 1:length(a)>>>1
re += a[2i-1] * b[2i-1] + a[2i] * b[2i ]
im += a[2i-1] * b[2i ] - a[2i] * b[2i-1]
end
Complex(re, im)
end
Something else to highlight here is that strided loops, i.e. the 1:2:length(a)
above, are now supported. Loops are no longer restricted to being over AbstractUnitRange
s.
Benchmarks showing minimum and median times over 1_000:1_000:20_000
:
While the single threaded performance is better than @simd
, LinearAlgebra.dot
seems to be doing better. I’ll have to look into what they’re doing differently. cdot
again seems to be having problems.
Finally, we can get good performance with the 3-argument dot
for Complex{Float64}
arrays. Without a BLAS
routine for LinearAlgebra.dot
, we can do well, benchmarking from 100:100:1_000
:
Motivated by this discourse thread.
The function definition:
function dotavxt(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
a = reinterpret(reshape, T, ca)
b = reinterpret(reshape, T, cb)
re = zero(T); im = zero(T)
@avx for i ∈ axes(a,2) # adjoint(a[i]) * b[i]
re += a[1,i] * b[1,i] + a[2,i] * b[2,i]
im += a[1,i] * b[2,i] - a[2,i] * b[1,i]
end
Complex(re, im)
end
@avx
(single threaded)'s performance starts off best among the single threaded implementations, but then declines and (hard to see) ends up last among them. It’s less cache friendly, and apparently needs an extra layer of blocking. All single threaded versions should essentially be as fast as sum(A)
.
- Transposed arrays. For example, old benchmarks calculating
C = A' * B'
:
Using LoopVectorization 0.12:
Limited StaticArrays.SArray support
using LoopVectorization, StaticArrays
@inline function inline_AmulB!(C, A, B)
@avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B), (2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
end
@inline function smul(A::SMatrix{M,K,TA}, B::SMatrix{K,N,TB}) where {TA,TB,M,K,N}
C = MMatrix{M,N,promote_type(TA,TB)}(undef)
inline_AmulB!(C, A, B)
SMatrix(C)
end
A = @SMatrix rand(7,7);
B = @SMatrix rand(7,7);
@btime $(Ref(A))[] * $(Ref(B))[]
@btime smul($(Ref(A))[], $(Ref(B))[])
Yielding:
julia> @btime $(Ref(A))[] * $(Ref(B))[]
60.955 ns (0 allocations: 0 bytes)
7×7 SMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
1.31901 1.56177 0.562737 1.40351 1.09281 1.11956 1.10401
2.20513 2.32908 0.814014 2.17807 1.55059 1.79158 1.25284
1.81742 2.32854 1.2245 2.04733 1.7858 2.54968 1.48319
1.26337 1.3801 0.445864 1.17922 0.923933 1.25471 0.646562
1.44048 1.56591 0.679518 1.44741 1.20476 1.46411 0.891856
1.91483 1.87127 1.25527 1.53752 1.61665 1.59461 1.52743
1.66651 2.45191 0.705221 2.15954 1.96442 2.04098 1.49243
julia> @btime smul($(Ref(A))[], $(Ref(B))[])
29.875 ns (0 allocations: 0 bytes)
7×7 SMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
1.31901 1.56177 0.562737 1.40351 1.09281 1.11956 1.10401
2.20513 2.32908 0.814014 2.17807 1.55059 1.79158 1.25284
1.81742 2.32854 1.2245 2.04733 1.7858 2.54968 1.48319
1.26337 1.3801 0.445864 1.17922 0.923933 1.25471 0.646562
1.44048 1.56591 0.679518 1.44741 1.20476 1.46411 0.891856
1.91483 1.87127 1.25527 1.53752 1.61665 1.59461 1.52743
1.66651 2.45191 0.705221 2.15954 1.96442 2.04098 1.49243
This is still much slower than what you get with MMatrix
:
julia> AM = MMatrix(A); BM = MMatrix(B); CM = similar(AM);
julia> @btime(inline_AmulB!($CM, $AM, $BM)); CM
11.940 ns (0 allocations: 0 bytes)
7×7 MMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
1.31901 1.56177 0.562737 1.40351 1.09281 1.11956 1.10401
2.20513 2.32908 0.814014 2.17807 1.55059 1.79158 1.25284
1.81742 2.32854 1.2245 2.04733 1.7858 2.54968 1.48319
1.26337 1.3801 0.445864 1.17922 0.923933 1.25471 0.646562
1.44048 1.56591 0.679518 1.44741 1.20476 1.46411 0.891856
1.91483 1.87127 1.25527 1.53752 1.61665 1.59461 1.52743
1.66651 2.45191 0.705221 2.15954 1.96442 2.04098 1.49243
But it’s the best I can do so far without being able to get a Ptr
.
AVX512 special functions
Finally, I also optimzied exp2
, log
, log2
, and log10
for AVX512. I also changed exp
and exp10
for AVX512, but their performance improved only slightly in microbenchmarks for systems with 2 512 bit FMA units (and didn’t improve on systems with just a single unit). The “real world” performance gain may be bigger, as they’re now using tables with just 16-entries instead of 256, thus taking up much less cache space.
The reason exp
and exp10
are almost 30% slower than exp2
:
julia> using VectorizationBase
julia> vu = VecUnroll(ntuple(_ -> Vec(ntuple(_ -> 5randn(), pick_vector_width(Float64))...), Val(4))) # 32 numbers
4 x Vec{8, Float64}
Vec{8, Float64}<8.142911604059428, 1.06553874163889, 2.8884167945859303, 0.6605812992796163, -7.172960280554572, -10.672154414657193, -0.8483893260667725, -4.5423075816324685>
Vec{8, Float64}<-7.187336629600654, 1.1271494431115698, 7.30946901052441, -0.6909342798882181, 8.157191675210004, -8.959861534856481, 2.4266837514035435, 7.202665536311091>
Vec{8, Float64}<3.2211478577589245, 3.3214825667919805, 8.770620491481353, 8.215935583288026, -1.2030909726747399, -2.1450713726417394, 2.3191029746681635, 7.4191044105642145>
Vec{8, Float64}<1.3867372083360363, 9.578246938475889, -7.567849870319881, -0.977707699977351, 1.0456300537320184, 3.842246946586108, 2.9014877927441374, -2.1716597637506454>
julia> @benchmark exp2($(Ref(vu))[]) # 3.67 `exp2`s/ns!
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 8.721 ns (0.00% GC)
median time: 8.823 ns (0.00% GC)
mean time: 8.828 ns (0.00% GC)
maximum time: 32.249 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
julia> @benchmark exp($(Ref(vu))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.167 ns (0.00% GC)
median time: 11.274 ns (0.00% GC)
mean time: 11.299 ns (0.00% GC)
maximum time: 34.928 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 999
Is because I needed to add extra arithmetic for them to keep the same accuracy exp2
has.
They’re AVX512-only, because they’re taking advntage of AVX512-only instructions. E.g., replacing the sequence of instructions for getting the exponent and mantissa of a number for the log
functions with simply getexp
and getmant
.
I haven’t updated the exp
benchmark in LoopVectorization’s docs, but I did update the logdettriangle
benchmark, where LoopVectorization’s log
is now much faster, while the log
shipping with Intel’s oneAPI compilers seems to be slower.
Final disclaimer:
You can nest @avxt
code inside CheapThreads.batch
(and future CheapThreads
APIs) as well as of course Distributed
, but it doesn’t support nesting inside Threads.@spawn
or Threads.@threads
yet.
All benchmarks in this post were done with:
julia> versioninfo()
Julia Version 1.7.0-DEV.707
Commit d1d0646390* (2021-03-14 18:11 UTC)
Platform Info:
OS: Linux (x86_64-generic-linux)
CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
JULIA_NUM_THREADS = 36