ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses

I thought LoopVectorization 0.12 brought enough exciting improvements to warrant an update.

Probably most exciting is the addition of @avxt or equivalently @avx thread=true, which can automatically multithread for loops (I’ll add support for broadcasting later).
The implementation is simple at the moment, but should get better as I develop an actual performance model.

Multithreading

Taking ThreadsX’s first README example, the LoopVectorization equivalent would be:

julia> using LoopVectorization, BenchmarkTools, ThreadsX

julia> function relative_prime_count(x, N)
           c = 0
           @avxt for i ∈ 1:N
               c += gcd(x, i) == 1
           end
           c
       end
relative_prime_count (generic function with 1 method)

julia> @btime relative_prime_count(42, 10_000)
  3.298 μs (0 allocations: 0 bytes)
2857

julia> @btime ThreadsX.sum(gcd(42, i) == 1 for i in 1:10_000)
  136.617 μs (3096 allocations: 240.36 KiB)
2857

Some of LoopVectorization’s performance benefit comes from SIMD (note: gcd requires the vplzcnt(d/q) instructions to be fast, which are AVX512 only).
LoopVectorization’s threading and optimizations are limited to loops working with Union{Bool,Base.HWReal}s and strided arrays, where static scheduling works great as each chunk takes a predictable amount of time.
I regularly use ThreadsX.jl for things like parallelizing derivatives through solves of ordinary differential equations.

A more interesting example what I’ll work on integrating in the NNlib ecosystem soonᵀᴹ are convolutions.
Code for creating a batch of a hundred 256x256 Float32 images with 3 input channels, and convolveing them with a 5x5 kernel producing 6 output channels:

using NNlib, LoopVectorization, Static

img = rand(Float32, 260, 260, 3, 100);
kern = rand(Float32, 5, 5, 3, 6);
out1 = Array{Float32}(undef, size(img,1)+1-size(kern,1), size(img,2)+1-size(kern,2), size(kern,4), size(img,4));
out2 = similar(out1);

dcd = NNlib.DenseConvDims(img, kern, flipkernel = true);

function kernaxes(::DenseConvDims{2,K,C_in, C_out}) where {K,C_in, C_out} # LoopVectorization can take advantage of static size information
    K₁ =  StaticInt(1):StaticInt(K[1])
    K₂ =  StaticInt(1):StaticInt(K[2])
    Cᵢₙ =  StaticInt(1):StaticInt(C_in)
    Cₒᵤₜ = StaticInt(1):StaticInt(C_out)
    (K₁, K₂, Cᵢₙ, Cₒᵤₜ)
end

function convlayer!(
    out::AbstractArray{<:Any,4}, img, kern,
    dcd::DenseConvDims{2, <:Any, <:Any, <:Any, (1, 1), (0, 0, 0, 0), (1, 1), true}
)
    (K₁, K₂, Cᵢₙ, Cₒᵤₜ) = kernaxes(dcd)
    @avxt for j₁ ∈ axes(out,1), j₂ ∈ axes(out,2), d ∈ axes(out,4), o ∈ Cₒᵤₜ
        s = zero(eltype(out))
        for k₁ ∈ K₁, k₂ ∈ K₂, i ∈ Cᵢₙ
            s += img[j₁ + k₁ - 1, j₂ + k₂ - 1, i, d] * kern[k₁, k₂, i, o]
        end
        out[j₁, j₂, o, d] = s
    end
    out
end

The NNlib.DenseConvDims type contains static size information that kernaxes translates to a form that LoopVectorization will understand. Results:

julia> NNlib.conv!(out2, img, kern, dcd);

julia> convlayer!(out1, img, kern, dcd);

julia> out1 ≈ out2
true

julia> @benchmark convlayer!($out1, $img, $kern, $dcd)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.377 ms (0.00% GC)
  median time:      5.432 ms (0.00% GC)
  mean time:        5.433 ms (0.00% GC)
  maximum time:     5.682 ms (0.00% GC)
  --------------
  samples:          920
  evals/sample:     1

julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
  memory estimate:  675.02 MiB
  allocs estimate:  195
  --------------
  minimum time:     182.749 ms (0.00% GC)
  median time:      190.472 ms (0.60% GC)
  mean time:        197.527 ms (4.98% GC)
  maximum time:     300.536 ms (35.82% GC)
  --------------
  samples:          26
  evals/sample:     1

NNlib.conv! on the CPU calls conv_im2col!, which translates the convolution into a BLAS call. It uses Threads.@threads across the batches; we can set BLAS.set_num_threads(1) to prevent oversubscription:

julia> using LinearAlgebra

julia> BLAS.set_num_threads(1)

julia> @benchmark NNlib.conv!($out2, $img, $kern, $dcd)
BenchmarkTools.Trial:
  memory estimate:  675.02 MiB
  allocs estimate:  195
  --------------
  minimum time:     124.177 ms (0.00% GC)
  median time:      128.609 ms (0.93% GC)
  mean time:        133.574 ms (5.36% GC)
  maximum time:     235.760 ms (45.17% GC)
  --------------
  samples:          38
  evals/sample:     1

Leaving @avxt 23x faster. Broadly, the class of problems handled by “how do I make this look like gemm”? is a class where LoopVectorization is likely to do well (or can be tuned/modified to perform well).
Something else important to emphasize here are the 0 allocations and consistent timings. This is a feature of @avxt, even for code taking just a few microseconds, but not of code built upon Julia’s threading primitives unless the inconsistency can be amortized over much longer durations:

julia> @benchmark relative_prime_count(42, 10_000)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.325 μs (0.00% GC)
  median time:      3.502 μs (0.00% GC)
  mean time:        3.503 μs (0.00% GC)
  maximum time:     13.655 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     8

julia> @benchmark ThreadsX.sum(gcd(42, i) == 1 for i in 1:10_000)
BenchmarkTools.Trial:
  memory estimate:  240.14 KiB
  allocs estimate:  3089
  --------------
  minimum time:     133.668 μs (0.00% GC)
  median time:      246.671 μs (0.00% GC)
  mean time:        300.477 μs (19.57% GC)
  maximum time:     588.215 ms (99.96% GC)
  --------------
  samples:          10000
  evals/sample:     1

Thus the minimum times are representative of @avxt performance, but often not representative of Threads.@threads or Threads.@spawn.

It also naturally does well with gemm itself:


Competing head to head with Octavian and MKL, while OpenBLAS and (for now) Tullio are left behind.
Tullio actually does better at larger sizes that bust the L2 cache. I haven’t looked at why (more threads → suffers less? Or does it do some cache level optimizations?).
With 80x80 matrices, LoopVectorization and Octavian were already exceeding 450 GFLOPS of double precision.
LoopVectorization’s (and Tullio’s) strength are the ability to bring this performance to other routines like the convolution above in a manner more natural than “can I make it look like gemm” approach.

Another note here is that LoopVectorization will now take advantage of the information provided by ArrayInterface.indices:

function lvmul_threads!(C, A, B)
    @avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

indices broadcasts the second (integer) argument in calling axes on the arrays.
Thus indices((C,B), 2) is similar to:

@assert axes(C,2) == axes(B,2); axes(C,2)

and indices((A,B), (2,1)) is like

@assert axes(A,2) == axes(B,1); axes(A,2)

If the macro sees the indices call it disables the @asserts (I was wanting to shave as many nanoseconds off certain benchmarks as possible :sweat_smile:) and uses this along with the types of the arrays to find out if it implies any of their strides are equivalent, allowing minor optimizations.

An example of the small problem sizes at which threading is capable of increasing performance:

julia> N = 10_000; x = rand(N); y = rand(N);

julia> @btime dot($x, $y) # LinearAlgebra
  1.114 μs (0 allocations: 0 bytes)
2480.296446711209

julia> @btime dotavx($x, $y) # `@avx`, single threaded
  761.621 ns (0 allocations: 0 bytes)
2480.296446711209

julia> @btime dotavxt($x, $y)  # `@avxt`, multithreaded
  622.723 ns (0 allocations: 0 bytes)
2480.296446711209

julia> @btime dotbaseline($x, $y) # `@inbounds @simd`, single threaded
  1.294 μs (0 allocations: 0 bytes)
2480.2964467112097

julia> @btime cdot($x, $y) # OpenMP C implementation
  6.109 μs (0 allocations: 0 bytes)
2480.2964467112092

cdot, using OpenMP–maybe somone more familiar can tell me if I’m doing anything obvious wrong.

Function definitions
function dotavxt(a::AbstractArray{T}, b::AbstractArray{T}) where {T <: Real}
	s = zero(T)
	@avxt for i ∈ eachindex(a,b)
	    s += a[i] * b[i]
	end
	s
end
function dotbaseline(a::AbstractArray{T}, b::AbstractArray{T}) where {T}
	s = zero(T)
	@fastmath @inbounds @simd for i ∈ eachindex(a,b)
	    s += a[i]' * b[i]
	end
	s
end
#include<omp.h>
//  gcc -Ofast -march=native -mprefer-vector-width=512 -fopenmp -shared -fPIC openmp.c -o libomptest.so
double dot(double* a, double* b, long N){
  double s = 0.0;
  #pragma omp parallel for reduction(+: s)
  for(long n = 0; n < N; n++){
    s += a[n]*b[n];
  }
  return s;
}

Benchmarks from 1_000:1_000:20_000 showing minimum and median times connected in a weird way:

Regular strided memory accesses

We can make the problem a bit more complex to show off the other major new feature in LoopVectorization 0.12.
Previous versions of LoopVectorization modeled 3 kinds of load/store:

  1. Scalar
  2. Contiguous vector
  3. Discontiguous vector

Scalar and vector are fast, and discontiguous was slow, requiring gather/scatter instructions.
Now, category three has been split, and we have 4 kinds:

  1. Scalar
  2. Contiguous vector
  3. Regular stride → unroll and shuffle
  4. Irregular stride → gather/scatter

A few example cases where this helps:

  1. Dealing with arrays of structs.
    For example, Vector{Complex{Float64}}. While LoopVectorization still does not support Complex{Float64}, we can reinterpret(Float64, A), or as of Julia 1.6, reinterpret(reshape, Float64, A). Three different ways to give a complex dot product (if benchmarking, you may want to make all or none of them @avxt; currently the third is @avx for the sake of having different names):
# _j16 requires Julia VERSION ≥ v"1.6.0-rc1" (older ones, e.g. v"1.6.0-DEV.1581", also work)
function dotavxt_j16(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
    a = reinterpret(reshape, T, ca)
    b = reinterpret(reshape, T, cb)
    re = zero(T); im = zero(T)
    @avxt for i ∈ axes(a,2)
        re += a[1,i] * b[1,i] + a[2,i] * b[2,i]
        im += a[1,i] * b[2,i] - a[2,i] * b[1,i]
    end
    Complex(re, im)
end
function dotavx(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
    a = reinterpret(T, ca);
    b = reinterpret(T, cb);
    re = zero(T); im = zero(T)
    @avx for i ∈ 1:2:length(a)
        re += a[i] * b[i  ] + a[i+1] * b[i+1]
        im += a[i] * b[i+1] - a[i+1] * b[i  ]
    end
    Complex(re, im)
end
function dotavxt(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
    a = reinterpret(T, ca);
    b = reinterpret(T, cb);
    re = zero(T); im = zero(T)
    # with a multiplier, we go from `i = 1 -> 2i = 2` to `i = 0 -> 2i = 0
    # 2(i+1-1) = 2i + 2 - 2, so....
    @avxt for i ∈ 1:length(a)>>>1
        re += a[2i-1] * b[2i-1] + a[2i] * b[2i  ]
        im += a[2i-1] * b[2i  ] - a[2i] * b[2i-1]
    end
    Complex(re, im)
end

Something else to highlight here is that strided loops, i.e. the 1:2:length(a) above, are now supported. Loops are no longer restricted to being over AbstractUnitRanges.

Benchmarks showing minimum and median times over 1_000:1_000:20_000:

While the single threaded performance is better than @simd, LinearAlgebra.dot seems to be doing better. I’ll have to look into what they’re doing differently. cdot again seems to be having problems.

Finally, we can get good performance with the 3-argument dot for Complex{Float64} arrays. Without a BLAS routine for LinearAlgebra.dot, we can do well, benchmarking from 100:100:1_000:


Motivated by this discourse thread.

The function definition:

function dotavxt(ca::AbstractVector{Complex{T}}, cb::AbstractVector{Complex{T}}) where {T}
    a = reinterpret(reshape, T, ca)
    b = reinterpret(reshape, T, cb)
    re = zero(T); im = zero(T)
    @avx for i ∈ axes(a,2) # adjoint(a[i]) * b[i]
        re += a[1,i] * b[1,i] + a[2,i] * b[2,i]
        im += a[1,i] * b[2,i] - a[2,i] * b[1,i]
    end
    Complex(re, im)
end

@avx (single threaded)'s performance starts off best among the single threaded implementations, but then declines and (hard to see) ends up last among them. It’s less cache friendly, and apparently needs an extra layer of blocking. All single threaded versions should essentially be as fast as sum(A).

  1. Transposed arrays. For example, old benchmarks calculating C = A' * B':

    Using LoopVectorization 0.12:

Limited StaticArrays.SArray support

using LoopVectorization, StaticArrays
@inline function inline_AmulB!(C, A, B)
    @avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end
@inline function smul(A::SMatrix{M,K,TA}, B::SMatrix{K,N,TB}) where {TA,TB,M,K,N}
    C = MMatrix{M,N,promote_type(TA,TB)}(undef)
    inline_AmulB!(C, A, B)
    SMatrix(C)
end

A = @SMatrix rand(7,7);
B = @SMatrix rand(7,7);
@btime $(Ref(A))[] * $(Ref(B))[]
@btime smul($(Ref(A))[], $(Ref(B))[])

Yielding:

julia> @btime $(Ref(A))[] * $(Ref(B))[]
  60.955 ns (0 allocations: 0 bytes)
7×7 SMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
 1.31901  1.56177  0.562737  1.40351  1.09281   1.11956  1.10401
 2.20513  2.32908  0.814014  2.17807  1.55059   1.79158  1.25284
 1.81742  2.32854  1.2245    2.04733  1.7858    2.54968  1.48319
 1.26337  1.3801   0.445864  1.17922  0.923933  1.25471  0.646562
 1.44048  1.56591  0.679518  1.44741  1.20476   1.46411  0.891856
 1.91483  1.87127  1.25527   1.53752  1.61665   1.59461  1.52743
 1.66651  2.45191  0.705221  2.15954  1.96442   2.04098  1.49243

julia> @btime smul($(Ref(A))[], $(Ref(B))[])
  29.875 ns (0 allocations: 0 bytes)
7×7 SMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
 1.31901  1.56177  0.562737  1.40351  1.09281   1.11956  1.10401
 2.20513  2.32908  0.814014  2.17807  1.55059   1.79158  1.25284
 1.81742  2.32854  1.2245    2.04733  1.7858    2.54968  1.48319
 1.26337  1.3801   0.445864  1.17922  0.923933  1.25471  0.646562
 1.44048  1.56591  0.679518  1.44741  1.20476   1.46411  0.891856
 1.91483  1.87127  1.25527   1.53752  1.61665   1.59461  1.52743
 1.66651  2.45191  0.705221  2.15954  1.96442   2.04098  1.49243

This is still much slower than what you get with MMatrix:

julia> AM = MMatrix(A); BM = MMatrix(B); CM = similar(AM);

julia> @btime(inline_AmulB!($CM, $AM, $BM)); CM
  11.940 ns (0 allocations: 0 bytes)
7×7 MMatrix{7, 7, Float64, 49} with indices SOneTo(7)×SOneTo(7):
 1.31901  1.56177  0.562737  1.40351  1.09281   1.11956  1.10401
 2.20513  2.32908  0.814014  2.17807  1.55059   1.79158  1.25284
 1.81742  2.32854  1.2245    2.04733  1.7858    2.54968  1.48319
 1.26337  1.3801   0.445864  1.17922  0.923933  1.25471  0.646562
 1.44048  1.56591  0.679518  1.44741  1.20476   1.46411  0.891856
 1.91483  1.87127  1.25527   1.53752  1.61665   1.59461  1.52743
 1.66651  2.45191  0.705221  2.15954  1.96442   2.04098  1.49243

But it’s the best I can do so far without being able to get a Ptr. :man_shrugging:

AVX512 special functions

Finally, I also optimzied exp2, log, log2, and log10 for AVX512. I also changed exp and exp10 for AVX512, but their performance improved only slightly in microbenchmarks for systems with 2 512 bit FMA units (and didn’t improve on systems with just a single unit). The “real world” performance gain may be bigger, as they’re now using tables with just 16-entries instead of 256, thus taking up much less cache space.
The reason exp and exp10 are almost 30% slower than exp2:

julia> using VectorizationBase

julia> vu = VecUnroll(ntuple(_ -> Vec(ntuple(_ -> 5randn(), pick_vector_width(Float64))...), Val(4))) # 32 numbers
4 x Vec{8, Float64}
Vec{8, Float64}<8.142911604059428, 1.06553874163889, 2.8884167945859303, 0.6605812992796163, -7.172960280554572, -10.672154414657193, -0.8483893260667725, -4.5423075816324685>
Vec{8, Float64}<-7.187336629600654, 1.1271494431115698, 7.30946901052441, -0.6909342798882181, 8.157191675210004, -8.959861534856481, 2.4266837514035435, 7.202665536311091>
Vec{8, Float64}<3.2211478577589245, 3.3214825667919805, 8.770620491481353, 8.215935583288026, -1.2030909726747399, -2.1450713726417394, 2.3191029746681635, 7.4191044105642145>
Vec{8, Float64}<1.3867372083360363, 9.578246938475889, -7.567849870319881, -0.977707699977351, 1.0456300537320184, 3.842246946586108, 2.9014877927441374, -2.1716597637506454>

julia> @benchmark exp2($(Ref(vu))[]) # 3.67 `exp2`s/ns!
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.721 ns (0.00% GC)
  median time:      8.823 ns (0.00% GC)
  mean time:        8.828 ns (0.00% GC)
  maximum time:     32.249 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark exp($(Ref(vu))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.167 ns (0.00% GC)
  median time:      11.274 ns (0.00% GC)
  mean time:        11.299 ns (0.00% GC)
  maximum time:     34.928 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

Is because I needed to add extra arithmetic for them to keep the same accuracy exp2 has.
They’re AVX512-only, because they’re taking advntage of AVX512-only instructions. E.g., replacing the sequence of instructions for getting the exponent and mantissa of a number for the log functions with simply getexp and getmant.
I haven’t updated the exp benchmark in LoopVectorization’s docs, but I did update the logdettriangle benchmark, where LoopVectorization’s log is now much faster, while the log shipping with Intel’s oneAPI compilers seems to be slower.

Final disclaimer:

You can nest @avxt code inside CheapThreads.batch (and future CheapThreads APIs) as well as of course Distributed, but it doesn’t support nesting inside Threads.@spawn or Threads.@threads yet.

All benchmarks in this post were done with:

julia> versioninfo()
Julia Version 1.7.0-DEV.707
Commit d1d0646390* (2021-03-14 18:11 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cascadelake)
Environment:
  JULIA_NUM_THREADS = 36
72 Likes

This didn’t fit in the above post (character limits) and was hidden anyway, so I split it off into a reply.

Handling Static Arrays -- can we do better?

I’m open to ideas. Maybe I should file an issue with LLVM or post on the LLVM discourse to ask for advice/suggestions.
For those curious, here is the assembly from smul:

# julia> @cn smul(A,B)
        .text
        sub     rsp, 664
        vmovups zmm0, zmmword ptr [rsi]
        vmovups zmm1, zmmword ptr [rsi + 64]
        vmovups zmm2, zmmword ptr [rsi + 128]
        vmovups zmm3, zmmword ptr [rsi + 192]
        vmovups zmm4, zmmword ptr [rsi + 256]
        vmovups zmm5, zmmword ptr [rsi + 320]
        mov     rax, qword ptr [rsi + 384]
        vmovups zmmword ptr [rsp - 128], zmm0
        vmovups zmmword ptr [rsp - 64], zmm1
        vmovups zmmword ptr [rsp], zmm2
        vmovups zmmword ptr [rsp + 64], zmm3
        vmovups zmmword ptr [rsp + 128], zmm4
        vmovups zmmword ptr [rsp + 192], zmm5
        mov     qword ptr [rsp + 256], rax
        mov     al, 127
        kmovd   k1, eax
        vexpandpd       zmm6 {k1} {z}, zmmword ptr [rsp - 128]
        vmulpd  zmm0, zmm6, qword ptr [rdx]{1to8}
        vmulpd  zmm1, zmm6, qword ptr [rdx + 56]{1to8}
        vmulpd  zmm2, zmm6, qword ptr [rdx + 112]{1to8}
        vmulpd  zmm3, zmm6, qword ptr [rdx + 168]{1to8}
        vmulpd  zmm4, zmm6, qword ptr [rdx + 224]{1to8}
        vmulpd  zmm5, zmm6, qword ptr [rdx + 280]{1to8}
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rsp - 72]
        vmulpd  zmm6, zmm6, qword ptr [rdx + 336]{1to8}
        vfmadd231pd     zmm0, zmm7, qword ptr [rdx + 8]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rdx + 64]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rdx + 120]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rdx + 176]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rdx + 232]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rdx + 288]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vexpandpd       zmm8 {k1} {z}, zmmword ptr [rsp - 16]
        vfmadd231pd     zmm6, zmm7, qword ptr [rdx + 344]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vfmadd231pd     zmm0, zmm8, qword ptr [rdx + 16]{1to8} # zmm0 = (zmm8 * mem) + zmm0
        vfmadd231pd     zmm1, zmm8, qword ptr [rdx + 72]{1to8} # zmm1 = (zmm8 * mem) + zmm1
        vfmadd231pd     zmm2, zmm8, qword ptr [rdx + 128]{1to8} # zmm2 = (zmm8 * mem) + zmm2
        vfmadd231pd     zmm3, zmm8, qword ptr [rdx + 184]{1to8} # zmm3 = (zmm8 * mem) + zmm3
        vfmadd231pd     zmm4, zmm8, qword ptr [rdx + 240]{1to8} # zmm4 = (zmm8 * mem) + zmm4
        vfmadd231pd     zmm5, zmm8, qword ptr [rdx + 296]{1to8} # zmm5 = (zmm8 * mem) + zmm5
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rsp + 40]
        vfmadd231pd     zmm6, zmm8, qword ptr [rdx + 352]{1to8} # zmm6 = (zmm8 * mem) + zmm6
        vfmadd231pd     zmm0, zmm7, qword ptr [rdx + 24]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rdx + 80]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rdx + 136]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rdx + 192]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rdx + 248]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rdx + 304]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vexpandpd       zmm8 {k1} {z}, zmmword ptr [rsp + 96]
        vfmadd231pd     zmm6, zmm7, qword ptr [rdx + 360]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vfmadd231pd     zmm0, zmm8, qword ptr [rdx + 32]{1to8} # zmm0 = (zmm8 * mem) + zmm0
        vfmadd231pd     zmm1, zmm8, qword ptr [rdx + 88]{1to8} # zmm1 = (zmm8 * mem) + zmm1
        vfmadd231pd     zmm2, zmm8, qword ptr [rdx + 144]{1to8} # zmm2 = (zmm8 * mem) + zmm2
        vfmadd231pd     zmm3, zmm8, qword ptr [rdx + 200]{1to8} # zmm3 = (zmm8 * mem) + zmm3
        vfmadd231pd     zmm4, zmm8, qword ptr [rdx + 256]{1to8} # zmm4 = (zmm8 * mem) + zmm4
        mov     rax, rdi
        vfmadd231pd     zmm5, zmm8, qword ptr [rdx + 312]{1to8} # zmm5 = (zmm8 * mem) + zmm5
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rsp + 152]
        vfmadd231pd     zmm6, zmm8, qword ptr [rdx + 368]{1to8} # zmm6 = (zmm8 * mem) + zmm6
        vfmadd231pd     zmm0, zmm7, qword ptr [rdx + 40]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rdx + 96]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rdx + 152]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rdx + 208]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rdx + 264]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rdx + 320]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rdx + 376]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vmovupd zmm7 {k1} {z}, zmmword ptr [rsp + 208]
        vfmadd231pd     zmm0, zmm7, qword ptr [rdx + 48]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rdx + 104]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rdx + 160]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rdx + 216]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rdx + 272]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rdx + 328]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rdx + 384]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vmovupd zmmword ptr [rsp + 272] {k1}, zmm0
        vmovupd zmmword ptr [rsp + 328] {k1}, zmm1
        vmovupd zmmword ptr [rsp + 384] {k1}, zmm2
        vmovupd zmmword ptr [rsp + 440] {k1}, zmm3
        vmovupd zmmword ptr [rsp + 496] {k1}, zmm4
        vmovupd zmmword ptr [rsp + 552] {k1}, zmm5
        vmovupd zmmword ptr [rsp + 608] {k1}, zmm6
        vmovups zmm0, zmmword ptr [rsp + 272]
        vmovups zmm1, zmmword ptr [rsp + 336]
        vmovups zmm2, zmmword ptr [rsp + 400]
        vmovups zmm3, zmmword ptr [rsp + 464]
        vmovups zmm4, zmmword ptr [rsp + 528]
        vmovups zmm5, zmmword ptr [rsp + 592]
        vmovups zmmword ptr [rdi], zmm0
        vmovups zmmword ptr [rdi + 64], zmm1
        vmovups zmmword ptr [rdi + 128], zmm2
        vmovups zmmword ptr [rdi + 192], zmm3
        mov     rcx, qword ptr [rsp + 656]
        vmovups zmmword ptr [rdi + 256], zmm4
        vmovups zmmword ptr [rdi + 320], zmm5
        mov     qword ptr [rdi + 384], rcx
        add     rsp, 664
        vzeroupper
        ret
        nop     word ptr [rax + rax]

It is needlessly copying data onto the stack (rsp is the stack pointer), and then off again. Before the matrix multiplication, it copies all the data from A onto the stack, and then for each column of A it wants to load, it uses vexpandpd to load the 7 elements into an 8-element vector register. Note that each vexpandpd is followed by 7 broadcasts. Note that vexpandpd is also slower than a masked move.

Once it is done, it copies C onto the stack (using masked stores, 7 numbers apart), and then back off again (with loads 8 numbers apart) before copying it to the destination (now wthout masking) and returning. It should just store to the destination directly.

Generated asm when all matrices are MMatrix is much nicer:

# julia> @cn inline_AmulB!(CM, AM, BM)
        .text
        mov     qword ptr [rsp - 8], rsi
        mov     rcx, qword ptr [rsi + 8]
        mov     rax, qword ptr [rsi + 16]
        mov     dl, 127
        kmovd   k1, edx
        vexpandpd       zmm6 {k1} {z}, zmmword ptr [rcx]
        vmulpd  zmm0, zmm6, qword ptr [rax]{1to8}
        vmulpd  zmm1, zmm6, qword ptr [rax + 56]{1to8}
        vmulpd  zmm2, zmm6, qword ptr [rax + 112]{1to8}
        vmulpd  zmm3, zmm6, qword ptr [rax + 168]{1to8}
        vmulpd  zmm4, zmm6, qword ptr [rax + 224]{1to8}
        vmulpd  zmm5, zmm6, qword ptr [rax + 280]{1to8}
        vmulpd  zmm6, zmm6, qword ptr [rax + 336]{1to8}
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rcx + 56]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 8]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 64]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 120]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 176]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 232]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 288]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 344]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rcx + 112]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 16]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 72]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 128]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 184]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 240]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 296]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 352]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rcx + 168]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 24]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 80]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 136]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 192]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 248]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 304]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 360]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rcx + 224]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 32]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 88]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 144]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 200]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 256]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 312]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 368]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vexpandpd       zmm7 {k1} {z}, zmmword ptr [rcx + 280]
        mov     rdx, qword ptr [rsi]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 40]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 96]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 152]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 208]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 264]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 320]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 376]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vmovupd zmm7 {k1} {z}, zmmword ptr [rcx + 336]
        vfmadd231pd     zmm0, zmm7, qword ptr [rax + 48]{1to8} # zmm0 = (zmm7 * mem) + zmm0
        vfmadd231pd     zmm1, zmm7, qword ptr [rax + 104]{1to8} # zmm1 = (zmm7 * mem) + zmm1
        vfmadd231pd     zmm2, zmm7, qword ptr [rax + 160]{1to8} # zmm2 = (zmm7 * mem) + zmm2
        vfmadd231pd     zmm3, zmm7, qword ptr [rax + 216]{1to8} # zmm3 = (zmm7 * mem) + zmm3
        vfmadd231pd     zmm4, zmm7, qword ptr [rax + 272]{1to8} # zmm4 = (zmm7 * mem) + zmm4
        vfmadd231pd     zmm5, zmm7, qword ptr [rax + 328]{1to8} # zmm5 = (zmm7 * mem) + zmm5
        vfmadd231pd     zmm6, zmm7, qword ptr [rax + 384]{1to8} # zmm6 = (zmm7 * mem) + zmm6
        vmovupd zmmword ptr [rdx] {k1}, zmm0
        vmovupd zmmword ptr [rdx + 56] {k1}, zmm1
        vmovupd zmmword ptr [rdx + 112] {k1}, zmm2
        vmovupd zmmword ptr [rdx + 168] {k1}, zmm3
        vmovupd zmmword ptr [rdx + 224] {k1}, zmm4
        vmovupd zmmword ptr [rdx + 280] {k1}, zmm5
        vmovupd zmmword ptr [rdx + 336] {k1}, zmm6
        movabs  rax, offset jl_system_image_data
        vzeroupper
        ret

Although it’s still mostly using vexpandpd.

9 Likes

Congrats on the release! Those AVX-512 exp functions are beautiful!

3 Likes

They’re based on a mix of your implementations and Intel’s optimization manuals, which basically say “we added these instructions meant to be used in this way to implement SIMD special functions, for example you can use vscalef to implement exp via…”. Those special instructions largely replace the magic / bit-hacky parts with faster alternatives that also handles special cases correctly to let you skip the NaN/Inf/etc checks.

But I need to look more closely again to see how you got the indices for the table (without reading anything I did/said earlier…since I wan’t to keep this MIT and probably already crossed some lines) and accurately determined the reduction.
The approach I used for exp2 didn’t work for other powers without either (a) losing a lot of accuracy or (b) doing a bunch of extra stuff that makes it slower. So I’m sure looking more closely/working at it with pen and paper for a bit would reveal a better answer than what I have now.

1 Like

Basically, the code is

N_float = round(x*LogBo256INV($base, T))
N = trunc(Int32, N_float)
r = muladd(N_float, LogBo256U($base, T), x)
r = muladd(N_float, LogBo256L($base, T), r)
k = N >> 8
jU, jL = table[N&255 +1] # +1 since 1 indexed
small_part =  muladd(jU, expm1b_kernel, jL) + jU

where the first constant is 256/log(base, 2), and the second is -log(base, 2)/256 split into upper and lower bits to keep accuracy for the reduction. The only trickyish part is to do 2 fmas for the reduction to keep accuracy for bigger inputs. Does that help at all?

1 Like

I’m stunned as always… :rocket:

6 Likes

Me too,

can we bump compat section in Zygote, such that we can try this.

1 Like

That helped.
The implementation is a better now. vscalef is great for returning the value – it calculates ldexp to combine N_float and vfma(js, expm1(r), js) where js are from the table, and I’m still using the 16-entry tables indexed via iperm2pd (a vector-shuffle instruction).
But now I’m getting r and the indices into the table following that approach for all but exp2.
For exp2, vreducepd works well.
That is, I can get r and N_float divided by 16 via:

r = vsreduce(16.0x, Val(0)) * 0.0625
N_float_over_16 = x - r
N_float = N_float_over_16 * 16.0 # If I need it

I want to use N_float_over_16 for vscalef anyway, but depending on how I get the indices for the table, I may or may not need N_float. That is, I could use your approach:

inds = ((trunc(Int, 16.0*N_float)%UInt)) & 0x0f

Or

convert(UInt, vsreduce(N_float_over_16, Val(1))*16.0)

FWIW, llvm-mca claims the former will require fewer clock cycles despite using 1 more instruction per exp2.

While getting r using vreduce is super slick for exp2, trying to use it for the other bases got messy. I’d have to look more closely to understand why it is so reliant on exact rounding while the trunc approach isn’t.

I’ll start working on an NNlibCPU library that you can load and get LoopVectorization-based implementations.
I’ll probably also either use Octavian.jl for this, or make LoopVectorization actually start handling the cache instead of leaving it up to users.

Aside from batched matrix multiplication, some low-hanging fruit is adding support for f.(A * W .+ b) where b is a vector and f an activation function. By avoiding extra passes over the memory and also threading f at the same time, I think they should become nearly free. While they’d be much cheaper than matrix multiplication anyway for large matrices (O(N^3) vs O(N^2)), they can take a significant amount of time for smaller matrices, especially if you aren’t evaluating them with multiple threads or with SIMD (as is the case now by default).

3 Likes

If we could somehow design NNlibCPU to work with a generic sliding window iterator then it could be generalized to work for a lot of what ImageFiltering is doing. I have some code lying around for a sliding window iterator that uses OptionallyStaticStepRange. Happy to send it your way if you want, although I’m not sure if it’s what you would want to feed into LLVM.

2 Likes

I’d be happy to take a look at an example. Simple 2d image filtering kernels should be well tested, but so far only with a step of 1. Would also be good to add a test with a reinterpreted array of structs (i.e., colors). In the NNlib example, the “channels in” were the third dimension, not the first, which effectively gives it a struct of arrays layout.

1 Like

I’m really excited about the multithreaded capabilities! I have a couple of (possibly naive) questions about that.

Race conditions

I couldn’t figure out how LoopVectorization handles race conditions automatically. For examples, in

@avxt for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
    Cmn = zero(eltype(C))
    for k ∈ indices((A,B), (2,1))
        Cmn += A[m,k] * B[k,n]
    end
    C[m,n] = Cmn
end

can it figure out on its own that the inner loop has to be run sequentially? In other words, how can it automatically handle

    for k ∈ indices((A,B), (2,1))
        Cmn += A[m,k] * B[k,n]
    end

in a smart way (avoid exceeding the cache as well as race conditions)? Would that work with more complex index spaces, for example, what if the index space were to depend on m, n, or there is a nontrivial transformation between k and the actual indices of A and B?

This is a bit clearer to me in the Tranducer-based approach, when one has to basically implement a parallel reduce with some clever trick and then everything else should follow.

KernelAbstractions

Now that LoopVectorization has good multithreading support, would it make sense to add it as a possible backend to KernelAbstractions?

2 Likes
  1. It doesn’t avoid exceeding the cache yet. I’ll get around to implementing that soon; performance falls off a cliff for now when it runs out of cache:
  2. LoopVectorization already needs to recognize reductions and handle them specially anyway because SIMD is another form of parallelism. Right now, if the reduction is not returned but stored in an array, it does not thread the reduction (it could still decide to SIMD the reduction). I figured it’s heuristically not profitable if you have any other alternative loops to thread instead. But it can thread reductions that are returned, like in the various dot examples.
2 Likes

Ah, I see, I’m curious to look at the implementation when that happens.

That makes a lot of sense. Could you point me to the code that implements multithreaded reduction? I’m curious to see what tricks can be used to make this fast.

LoopVectorization/src/codegen/lower_threads.jl contains the threading code.

It splits up the iteration space among threads, and then combines their return values.
Combining their return values means that only certain reductions are supported (i.e. additive/multiplicative and max/min) – it has to know how to combine and initialize them – otherwise you’ll get a “reduction not recognized” error.
The low overhead comes from using ThreadingUtilities.launch to run threads.

1 Like

Can this be used with automatic differentiation with ForwardDiff.jl for example?

1 Like

The iterator I made produced a window (tuple of indices) for each iteration and was constructed given dilations, strides, window size, and start/stop indices. I’m assuming it would be easy to optimize reduction within each iteration/window but I’m assuming we’d want to do this asynchronously so that we can throw each reduction on a different thread (if optimal).

Support for that is the 3rd or 4th major change change I have planned.

Unless the window is very large, it would be more efficient to do multiple window reductions in parallel (via SIMD and/or threads) than to parallelize the window itself.
That’s one of the optimizations it applies to loops like:

function filter!(out, A, kern)
    @avx for I in CartesianIndices(out)
        tmp = zero(eltype(out))
        for J in CartesianIndices(kern)
            tmp += A[I+J]*kern[J]
        end
        out[I] = tmp
    end
    out
end
1 Like