[ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays

stev47 · April 25, 2020, 5:18pm

I did try, but LoopVectorization does not seem to read through the relative index accesses in Window. I’m hoping that the optimizations will eventually happend at a lower level within the compiler.
StaticKernels is pretty fast already though: applying a Laplace operator in two dimensions with map!(k, b, a) lies within a factor of 2 compared to copyto!(b, a) which is the hard lower limit. Only larger dense kernels will benefit noticably.

Feel free to add pull requests for these things to get the contributor count up. A testsuite exists and I will eventually get to setting up CI/cov and writing documentation if no one beats me to it.
I don’t quite see the value of CompatHelper (StaticKernels has currently no dependencies) and TagBot (I tag the git repository automatically when writing the release commit) though.

ChrisRackauckas · April 25, 2020, 7:17pm

That’s not going to happen. Auto-vectorization “already exists” in LLVM, and has for years, but it doesn’t do as well as LoopVectorization.jl, and I don’t expect it to improve to that amount anytime in the near future. I think this is kind of a must, because if we’re spending time on the library side to redo everything to something faster, we’re going to want to pick the fastest option and right now that looks like straight LoopVectorization.jl. I hope we don’t need to though, I’m rooting for you.

stev47 · April 26, 2020, 10:03am

That’s not going to happen. Auto-vectorization “already exists” in LLVM

I was thinking about the Julia compiler doing vectorization passes on lowered code before passing to llvm.

we’re going to want to pick the fastest option and right now that looks like straight LoopVectorization.jl

Sure. I reevaluated LoopVectorization and as I see it’s currently a bad fit for this package because

it would add a 6k loc dependency (+recursive deps) to a <400 loc package without dependencies
it makes strong assumptions on the loop body and is brittle regarding the loop-contents (which are user-defined in StaticKernels)
it does not seem to analyze the function body of user-defined functions within the loop
(minor) Expr(:curly), Val(::Expr) is not handled yet

But maybe @Elrod can comment on some of these.

I have no hard feelings, but I’m still convinced that for the special case of dense linear filters (note: LoopVectorization is much more general) you can make the same transformation that LoopVectorization does without 90% of the magic.
I plan to have a special kernel type for dense linear filters anyway and at that point the outer loop vectorization will be nice addition.

Elrod · April 26, 2020, 11:48pm

Something that can give you an idea is

julia> using LoopVectorization

julia> q = :(for m in 1:M, k in 1:K
                 C[m] += A[m,k] * B[k]
             end);

julia> ls = LoopVectorization.LoopSet(q);

julia> LoopVectorization.choose_order(ls)
([:m, :k], :k, :m, :m, 4, 6)

The vector of symbols gives the order of loops, from outer-most to inner-most.
This is problematic because k is the inner-most loop, and A can potentially have a large stride with respect to k, meaning it will move through memory extremely quickly. For this case, if A is square, performance plummets when M=K>48.

An example courtesy of @tkf where even a very naive blocking attempt greatly improved performance.

Convolutions are trickier, however. Using the example from this thread (although after writing I realized I and J swapped meanings):

julia> using LoopVectorization, OffsetArrays

julia> T = Float64
Float64

julia> A = rand(T, 100, 100);

julia> kern = OffsetArray(rand(T, 3, 3), -1:1, -1:1);

julia> out = OffsetArray(similar(A, size(A).-2), 1, 1);   # stay away from the edges of A

julia> lsgeneric = LoopVectorization.@avx_debug for I in CartesianIndices(out)
              tmp = zero(eltype(out))
              for J in CartesianIndices(kern)
                  tmp += A[I+J]*kern[J]
              end
              out[I] = tmp
          end;
OPS = Tuple{:numericconstant,Symbol("##zero#275"),LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000002, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:I,:I,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x02),:J,:J,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.loopvalue, 0x00, 0x03),:LoopVectorization,:+,LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000203, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000000, 0x0000000000000004, LoopVectorization.memload, 0x01, 0x05),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x06),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000002, 0x0000000000000000, 0x0000000000050601, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:identity,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000000007, LoopVectorization.compute, 0x00, 0x01),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000002, 0x0000000000000000, 0x0000000000000008, LoopVectorization.memstore, 0x03, 0x07)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000002, 0x0000000000000004, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:kern,Symbol("##vptr##_kern")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:out,Symbol("##vptr##_out")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(1, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:I,:J}
LB = Tuple{CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}},CartesianIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x000055f6a6ecf280, (100,)), LoopVectorization.OffsetStridedPointer{Float64,2,VectorizationBase.PackedStridedPointer{Float64,1}}(VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00007f136954ef30, (3,)), (-3, -3)), LoopVectorization.OffsetStridedPointer{Float64,2,VectorizationBase.PackedStridedPointer{Float64,1}}(VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x000055f6a3f17d80, (98,)), (0, 0)))

julia> LoopVectorization.choose_order(lsgeneric)
([Symbol("I#2#"), Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#")], Symbol("J#2#"), Symbol("I#2#"), Symbol("I#1#"), 3, 6)

The loop order, outer-to-inner, is Symbol("I#2#"), Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#"), which looks right.
If we were accessing memory element by element, like in a broadcast, that’d be optimal.

However, the other return values from choose_order are:

First unrolled loop: J#2
Second unrolled loop: I#2
Vectorized loop: I#1
First unroll factor: 3
Second unroll factor: 6

[Aside: I’ll adjust it eventually to make the two unroll factors more equal, although that’ll reduce performance in the case where kern doesn’t happen to have exactly 3 columns.]

On each iteration, it’ll load from the following columns of A:

A_0_0 = A[*,I2 + J2]
A_0_1 = A[*,I2 + J2 + 1]
A_0_2 = A[*,I2 + J2 + 2]
A_1_0 = A_0_1
A_1_1 = A_0_2
A_1_2 = A[*,I2 + J2 + 3]
A_2_0 = A_1_1
A_2_1 = A_1_2
A_2_2 = A[*,I2 + J2 + 4]
A_3_0 = A_2_1
A_3_1 = A_2_2
A_3_2 = A[*,I2 + J2 + 5]
A_4_0 = A_3_1
A_4_1 = A_3_2
A_4_2 = A[*,I2 + J2 + 6]
A_5_0 = A_4_1
A_5_1 = A_4_2
A_5_2 = A[*,I2 + J2 + 7]

After we’ve loaded these, the corresponding parts of A are hot in the L1 cache.
But then, we run down the rows of A, and they’re eventually evicted.

Once we’ve gone through all the rows, I2 increases by 6, and we now load

A_4_2 = A[*,I2 + J2 + 6]
A_5_2 = A[*,I2 + J2 + 7]

again, after they’ve long since been evicted from the L1 cache.
I suspect a blocking pattern that would let us reuse some of the memory while it is still in cache, instead of moving on to new rows and entirely new memory. Should be able to reduce cache bandwidth and increase performance.

But I’m not sure what the optimal tiling pattern would be.

Could you give more background on the rest of your comment?

I take it that they’ll be convolving with finite difference coefficient vectors to get the finite difference derivatives, but what are the rest of the inputs?
Do some of the arrays contain f(x)? Does dx refer to axis, or specific inputs they’re differentiating?

I confess I know little about these sorts of applications, but I could read if there is a good reference/summary.

For now it can’t introspect into user-supplied functions.

However, Julia’s compiler team was open to the idea of eventually creating an external API for packages to define their own compiler passes.
This will allow LoopVectorization simpler access to type information, and a much lower level view of the loop body, allowing it to support much higher level code. E.g., while it only supports native float and integer types, it could just run Julia’s compiler passes to turn Complex{T} numbers and operations into their elemental components with respect to T. That would apply to user provided functions as well, as long as they are inlined.
SSA code would also be easier to understand; things like misinterpreting nested expressions have been the source of most of the issues filed there, and these would disapear.

If you encounter bugs, please file issues! Many of them are easy to fix.

What does handling that look like/mean?

Sure, you can always write low level code yourself. You may also be able to coax the autovectorizer to do what you want, but I’ve not always found that to be reliable.

One of my biggest motivations in writing LoopVectorization was that I wanted to make my code fast, but didn’t want to spend all my time writing low level code.
I’ve deleted several thousand lines of code from PaddedMatrices.jl after writing LoopVectorization, because that code was now obsolete. In my future work, the plan is that I wont have to write that code to begin with, instead writing more code that actually looks like Julia.
And icing on the cake: it enables user-written code to be just as fast as mine.

I’m just elaborating on the package’s motivations. It’s not perfect.
OpenBLAS beats LoopVectorization “without the magic”, and has very little overhead the first time you call any of the methods.

While LoopVectorization loads faster on my machine than the likes of DataFrames, Zygote, or ApproxFun, slim libraries like StaticKernels or BenchmarkTools load much faster still.

> ./julia -e "@time using DataFrames"
  0.652549 seconds (1.28 M allocations: 80.771 MiB, 2.99% gc time)
> ./julia -e "@time using LoopVectorization"
  0.404863 seconds (948.38 k allocations: 66.117 MiB)
> ./julia -e "@time using StaticKernels"
  0.017532 seconds (28.75 k allocations: 1.857 MiB)

Plus there’s overhead on the first time @avx is used and the function called.

I get where you’re coming from, and your approach has merits, and could run just as fast for some scenarios while compiling much faster to yield a more responsive experience.

When code is bottlenecked by memory bandwidth, you also don’t need every ounce of CPU compute, because all of it in excess of the bandwidth is wasted (but tiling can help).

You should be able to implement the optimizations if you know which ones to apply a priori. But, you can’t really place bounds on the number of cases requiring different behavior in user code. (Using LoopVectorization 0.7.3), setup:

using StaticKernels, LoopVectorization, Images, OffsetArrays, BenchmarkTools

function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
    @avx for J in CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kern)
            tmp += A[I + J] * kern[I]
        end
        out[J] = tmp
    end
    out
end

kern = Images.Kernel.gaussian((1, 1), (3, 3))

const g11, g21, g31, g12, g22, g32, g13, g23, g33 = kern
@inline function f(w)
    return @inbounds g11*w[-1, -1] + g21*w[0, -1] + g31*w[1, -1] +
                     g12*w[-1,  0] + g22*w[0,  0] + g32*w[1,  0] +
                     g13*w[-1,  1] + g23*w[0,  1] + g33*w[1,  1]
end
k = StaticKernels.Kernel{(-1:1,-1:1)}(f)

a = rand(1024, 1024);
b = similar(a, (size(a) .- 2));

A = copy(a);
B = OffsetArray(similar(A, size(A).-2), 1, 1);

Running the earlier benchmarks, but now throwing in transposes:

julia> @btime map!($k, $b, $a);
  762.792 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $A, $kern);
  796.890 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $(b'), $a);
  3.732 ms (0 allocations: 0 bytes)

julia> @btime filter2davx!($(B'), $A, $kern);
  1.632 ms (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $b, $(a'));
  7.403 ms (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $(A'), $kern);
  1.709 ms (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $(b'), $(a'));
  7.513 ms (0 allocations: 0 bytes)

julia> @btime filter2davx!($(B'), $(A'), $kern);
  800.970 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

Nice – StaticKernels is now about 20% faster than it was 11 days ago, when I made my first post in this thread.
Using smaller arrays so that they fit in L2 cache, so that we can focus more on compute performance:

julia> a = rand(100, 100);

julia> b = similar(a, (size(a) .- 2));

julia> A = copy(a);

julia> B = OffsetArray(similar(A, size(A).-2), 1, 1);

julia> @btime map!($k, $b, $a);
  5.117 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $A, $kern);
  3.391 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $(b'), $a);
  26.040 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($(B'), $A, $kern);
  6.541 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $b, $(a'));
  26.427 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $(A'), $kern);
  7.279 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

julia> @btime map!($k, $(b'), $(a'));
  26.258 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($(B'), $(A'), $kern);
  3.522 μs (0 allocations: 0 bytes)

julia> b ≈ parent(B)
true

As is, the performance of StaticKernels fell off a cliff as soon as any of the arrays was transposed.
LoopVectorization got a fair bit slower too when just one of the arrays was transposed – it’s impossible to index both A and B in successive iterations, afterall – but, considering:

julia> At = similar(A'); size(A)
(100, 100)

julia> @btime copyto!($At, $(A'));
  13.130 μs (0 allocations: 0 bytes)

julia> a = rand(1024, 1024); at = similar(a'); typeof(at)
Array{Float64,2}

julia> @btime copyto!($at, $(a'));
  3.604 ms (0 allocations: 0 bytes)

Given that transposing an array while convolving it with a kernel using LoopVectorization is about twice as fast as just transposing the thing, LoopVectorization handled things rather gracefully. I didn’t want to keep optimizing code for different cases. That’s why I wrote LoopVectorzation to do it for me.

Maybe you aren’t interested in transposed arrays. I don’t suppose they come up very often in work like image processing. But when it comes to supporting generic code, maybe LoopVectorization is still brittle (again, please file issues!), but when it works, the computational kernels are fast.

stev47 · April 27, 2020, 8:36pm

Thank you for the very detailed writeup!

I’ve opened two bug reports on LoopVectorization that might be relevant to you, but there are many more expressions that just bail out (curlybrace type expressions A{1}, call-expressions where the first argument is an expression (f+g)(c+d), …). I’m not sure if you eventually plan to support the whole julia syntax inside your loops, but if you do then you are essentially recreating a compiler in some sense.
The ability to write your own compiler passes sounds very intriguing though!

Your benchmarks are nice and thank you for giving credit to the improvements in StaticKernels. But if you start and look at matrix transposes you could go on and look at every kind of array with non-standard memory-layout. Even LoopVectorization will not be able to understand these as long as the memory-layout is not specified somehow (will there be an interface for specifying memory-layout of custom array-types?).
Anyhow I think that is certainly out-of-scope for StaticKernels.

Topic		Replies	Views
Strange performance of a loop Performance	37	3278	July 21, 2018
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2220	March 17, 2021
Performance improvements for simple molecular surface caclulation Performance	15	665	September 14, 2020
[ANN] LoopVectorization Package Announcements	157	23597	May 27, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2969	July 16, 2018

[ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays

Related topics