[ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays

stev47 · April 15, 2020, 6:15pm

StaticKernels.jl is a new package that aims to enable easy and fast execution of kernel operations on arrays, e.g. finite differences, convolutions, image filters and morphological operations, etc.

custom kernel functions in arbitrary dimensions
custom boundary handling
allocation-free execution
small size (currently ~300 loc) and dependency free

Introduction

You can think of a Kernel as a function that takes a neighbourhood Window view on the underlying data as an argument. The kernel function aggregates the values within the neighbourhood and outputs a new value. The well-known function map(k, a) applied to the kernel k and a data array a then applies the kernel as a filter.

using StaticKernels
a = rand(1000, 1000)

# Laplace
k = Kernel{(-1:1,-1:1)}(w -> w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1])
map(k, a, inner=true)

# Erosion
k = Kernel{(-1:1,-1:1)}(w -> minimum(Tuple(w)))
map(k, a)

The window can be indexed using relative coordinates and implements a standard array interface. When accessed out of bounds (e.g. when positioned on the boundaries of the array) it evaluates to nothing.

# Laplace, zero boundary condition
k = Kernel{(-1:1,-1:1)}(w -> something(w[0,-1], 0.) + something(w[-1,0], 0.) - 4*w[0,0] + something(w[1,0], 0.) + something(w[0,1], 0.))
map(k, a)

# Forward-Gradient (non-skalar Kernel), neumann boundary condition
k = Kernel{(0:1, 0:1)}(w -> (something(w[1,0], w[0,0]) - w[0,0], something(w[0,1], w[0,0]) - w[0,0]))
map(k, a)

Why a new package?

Existing solutions didn’t meet at least one of:

non-allocating, fast operations
easily being able to write custom kernels
custom and efficient handling of boundary conditions
arbitrary types and dimensions.
simple implementation and easy to extend

Benchmarks on Linear Filters

Here a quick comparison for linear filtering (see test/benchmark.jl).
LocalFilters might be worse off than it should be, since it allocates within the loop.

Array: (10, 10), Kernel: (3, 3)
  StaticKernels      206.048 ns (0 allocations: 0 bytes)
  LoopVectorization  131.591 ns (0 allocations: 0 bytes)
  NNlib              13.238 μs (35 allocations: 6.89 KiB)
  ImageFiltering     8.997 μs (16 allocations: 3.73 KiB)
  LocalFilters       45.291 μs (2353 allocations: 36.77 KiB)
Array: (10, 10), Kernel: (5, 5)
  StaticKernels      567.505 ns (0 allocations: 0 bytes)
  LoopVectorization  273.347 ns (0 allocations: 0 bytes)
  NNlib              15.427 μs (35 allocations: 9.45 KiB)
  ImageFiltering     15.772 μs (16 allocations: 5.41 KiB)
  LocalFilters       93.978 μs (5809 allocations: 90.77 KiB)
Array: (10, 10), Kernel: (7, 7)
  StaticKernels      625.294 ns (0 allocations: 0 bytes)
  LoopVectorization  115.621 ns (0 allocations: 0 bytes)
  NNlib              14.544 μs (35 allocations: 8.52 KiB)
  ImageFiltering     119.226 μs (222 allocations: 27.36 KiB)
  LocalFilters       152.574 μs (10093 allocations: 157.70 KiB)
Array: (100, 100), Kernel: (3, 3)
  StaticKernels      26.029 μs (0 allocations: 0 bytes)
  LoopVectorization  17.417 μs (0 allocations: 0 bytes)
  NNlib              248.317 μs (36 allocations: 677.66 KiB)
  ImageFiltering     169.551 μs (17 allocations: 81.06 KiB)
  LocalFilters       5.208 ms (266413 allocations: 4.07 MiB)
Array: (100, 100), Kernel: (5, 5)
  StaticKernels      141.370 μs (0 allocations: 0 bytes)
  LoopVectorization  30.132 μs (0 allocations: 0 bytes)
  NNlib              780.498 μs (36 allocations: 1.76 MiB)
  ImageFiltering     312.253 μs (17 allocations: 82.73 KiB)
  LocalFilters       12.308 ms (732109 allocations: 11.17 MiB)
Array: (100, 100), Kernel: (7, 7)
  StaticKernels      323.081 μs (0 allocations: 0 bytes)
  LoopVectorization  53.485 μs (0 allocations: 0 bytes)
  NNlib              1.400 ms (36 allocations: 3.31 MiB)
  ImageFiltering     839.447 μs (232 allocations: 731.58 KiB)
  LocalFilters       22.810 ms (1420033 allocations: 21.67 MiB)
Array: (1000, 1000), Kernel: (3, 3)
  StaticKernels      3.356 ms (0 allocations: 0 bytes)
  LoopVectorization  2.709 ms (0 allocations: 0 bytes)
  NNlib              60.562 ms (40 allocations: 68.39 MiB)
  ImageFiltering     22.945 ms (17 allocations: 7.63 MiB)
  LocalFilters       703.648 ms (26964013 allocations: 411.44 MiB)
Array: (1000, 1000), Kernel: (5, 5)
  StaticKernels      15.446 ms (0 allocations: 0 bytes)
  LoopVectorization  6.850 ms (0 allocations: 0 bytes)
  NNlib              161.323 ms (40 allocations: 189.21 MiB)
  ImageFiltering     39.113 ms (17 allocations: 7.63 MiB)
  LocalFilters       1.866 s (74820109 allocations: 1.11 GiB)
Array: (1000, 1000), Kernel: (7, 7)
  StaticKernels      39.406 ms (0 allocations: 0 bytes)
  LoopVectorization  9.057 ms (0 allocations: 0 bytes)
  NNlib              323.726 ms (40 allocations: 369.37 MiB)
  ImageFiltering     88.782 ms (241 allocations: 68.77 MiB)
  LocalFilters       3.652 s (146496433 allocations: 2.18 GiB)

oschulz · April 15, 2020, 7:20pm

Oh, very nice - will this generalize to GPU arrays?

Oscar_Smith · April 15, 2020, 7:40pm

@tim.holy you were looking for something like this to speed up Images right?

stillyslalom · April 15, 2020, 8:36pm

This looks great! You may want to harness some of @elrod’s work in LoopVectorization:

using StaticKernels, LoopVectorization, Images, OffsetArrays, BenchmarkTools

function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
    @avx for J in CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kern)
            tmp += A[I + J] * kern[I]
        end
        out[J] = tmp
    end
    out
end

# StaticKernels
k = StaticKernels.Kernel{(-1:1,-1:1)}(w -> w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1])
a = rand(1024, 1024)
b = similar(a, (size(a) .- 2))

# LoopVectorization
kern = convert(AbstractArray, Images.Kernel.Laplacian())
A = copy(a)
B = OffsetArray(similar(A, size(A).-2), 1, 1)

@btime map!($k, $b, $a, inner=true) # 11.687 ms (0 allocations: 0 bytes)
@btime filter2davx!($B, $A, $kern)  # 1.848 ms (0 allocations: 0 bytes)

** edit: with @inbounds @inline, your package edges out LoopVectorization:

@inline function f(w)
    return @inbounds w[0,-1] + w[-1,0] + 4*w[0,0] + w[1,0] + w[0,1]
end
k = StaticKernels.Kernel{(-1:1,-1:1)}(f)

@btime map!($k, $b, $a, inner=true) #  1.492 ms (0 allocations: 0 bytes)

*** edit 2: LoopVectorization is faster for fully-filled kernels:

kern = Images.Kernel.gaussian((1, 1), (3, 3))

const g11, g21, g31, g12, g22, g32, g13, g23, g33 = kern
@inline function f(w)
    return @inbounds g11*w[-1, -1] + g21*w[0, -1] + g31*w[1, -1] +
                     g12*w[-1,  0] + g22*w[0,  0] + g32*w[1,  0] +
                     g13*w[-1,  1] + g23*w[0,  1] + g33*w[1,  1]
end
k = StaticKernels.Kernel{(-1:1,-1:1)}(f)

@btime map!($k, $b, $a, inner=true) # 2.886 ms (0 allocations: 0 bytes)
@btime filter2davx!($B, $A, $kern)  # 1.775 ms (0 allocations: 0 bytes)

stev47 · April 15, 2020, 9:39pm

@oschulz With a few changes it should work on GPUArrays, though I doubt it will be ever be as fast as manually fine-tuned gpu kernels.

@stillyslalom Could you retry with @inline and @inbounds? I get the following

# ... continuing from your post
k = Kernel{(-1:1,-1:1)}(@inline function(w) @inbounds w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1] end)
@btime filter2davx!($B, $A, $kern)  # 3.809 ms (0 allocations: 0 bytes)
@btime map!($k, $b, $a, inner=true) # 2.892 ms (0 allocations: 0 bytes)

Elrod · April 15, 2020, 10:24pm

On my computer, LoopVectorization still wins:

julia> @btime map!($k, $b, $a, inner=true);
  950.944 μs (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $A, $kern);
  800.251 μs (0 allocations: 0 bytes)

We can make it a little faster if wemake kern statically sized, taking code from the unit tests:

using LoopVectorization: StaticUnitRange
struct SizedOffsetMatrix{T,LR,UR,LC,RC} <: AbstractMatrix{T}
    data::Matrix{T}
end
Base.axes(::SizedOffsetMatrix{T,LR,UR,LC,UC}) where {T,LR,UR,LC,UC} = (StaticUnitRange{LR,UR}(),StaticUnitRange{LC,UC}())
@generated function LoopVectorization.stridedpointer(A::SizedOffsetMatrix{T,LR,UR,LC,RC}) where {T,LR,UR,LC,RC}
    quote
        $(Expr(:meta,:inline))
        LoopVectorization.OffsetStridedPointer(
        LoopVectorization.StaticStridedPointer{$T,Tuple{1,$(UR-LR+1)}}(pointer(A.data)),
            ($(LR-2), $(LC-2))
        )
    end
end
Base.getindex(A::SizedOffsetMatrix, i, j) = LoopVectorization.vload(LoopVectorization.stridedpointer(A), (i,j))
skern = SizedOffsetMatrix{eltype(kern),-1,1,-1,1}(parent(kern));

Yielding

julia> @btime filter2davx!($B, $A, $skern);
  771.416 μs (0 allocations: 0 bytes)

Small but tangible improvement.

Of course, using the dense kernel

julia> kern = Images.Kernel.gaussian((1, 1), (3, 3))
3×3 OffsetArray(::Array{Float64,2}, -1:1, -1:1) with eltype Float64 with indices -1:1×-1:1:
 0.0751136  0.123841  0.0751136
 0.123841   0.20418   0.123841
 0.0751136  0.123841  0.0751136

julia> const g11, g21, g31, g12, g22, g32, g13, g23, g33 = kern
3×3 OffsetArray(::Array{Float64,2}, -1:1, -1:1) with eltype Float64 with indices -1:1×-1:1:
 0.0751136  0.123841  0.0751136
 0.123841   0.20418   0.123841
 0.0751136  0.123841  0.0751136

julia> @inline function f(w)
           return @inbounds g11*w[-1, -1] + g21*w[0, -1] + g31*w[1, -1] +
                            g12*w[-1,  0] + g22*w[0,  0] + g32*w[1,  0] +
                            g13*w[-1,  1] + g23*w[0,  1] + g33*w[1,  1]
       end
f (generic function with 1 method)

julia> k = StaticKernels.Kernel{(-1:1,-1:1)}(f)
Kernel{(-1:1, -1:1)} with window function

CodeInfo(
1 ─       nothing
│         $(Expr(:inbounds, true))
│   %3  = Base.getindex(w, -1, -1)
│   %4  = Main.g11 * %3
│   %5  = Base.getindex(w, 0, -1)
│   %6  = Main.g21 * %5
│   %7  = Base.getindex(w, 1, -1)
│   %8  = Main.g31 * %7
│   %9  = Base.getindex(w, -1, 0)
│   %10 = Main.g12 * %9
│   %11 = Base.getindex(w, 0, 0)
│   %12 = Main.g22 * %11
│   %13 = Base.getindex(w, 1, 0)
│   %14 = Main.g32 * %13
│   %15 = Base.getindex(w, -1, 1)
│   %16 = Main.g13 * %15
│   %17 = Base.getindex(w, 0, 1)
│   %18 = Main.g23 * %17
│   %19 = Base.getindex(w, 1, 1)
│   %20 = Main.g33 * %19
│         val = %4 + %6 + %8 + %10 + %12 + %14 + %16 + %18 + %20
│         $(Expr(:inbounds, :pop))
└──       return val
)

julia> @btime map!($k, $b, $a, inner=true);
  2.263 ms (0 allocations: 0 bytes)

julia> @btime filter2davx!($B, $A, $kern);
  795.978 μs (0 allocations: 0 bytes)

julia> skern = SizedOffsetMatrix{eltype(kern),-1,1,-1,1}(parent(kern));

julia> @btime filter2davx!($B, $A, $skern);
  774.363 μs (0 allocations: 0 bytes)

julia> size.((B, A, b, a))
((1022, 1022), (1024, 1024), (1022, 1022), (1024, 1024))

julia> parent(B) ≈ b
true

Still, your benchmarks against the other libraries look really good for 300 loc. LoopVectorization has around 10 times that and several dependencies itself.

EDIT:
For the Laplacian kernel example, you could completely unroll the kernel for LoopVectorization as well.

julia> function filter2davx_laplacian!(out::AbstractMatrix, A::AbstractMatrix)
           @avx for n in axes(out,2), m in axes(out,1)
               out[m,n] = A[m,n-1] + A[m-1,n] + 4*A[m,n] + A[m+1,n] + A[m,n+1]
           end
           out
       end
filter2davx_laplacian! (generic function with 2 methods)

julia> @btime map!($k, $b, $a, inner=true);
  935.221 μs (0 allocations: 0 bytes)

julia> @btime filter2davx_laplacian!($B, $A);
  755.373 μs (0 allocations: 0 bytes)

julia> parent(B) ≈ b
true

julia> axes(B)
(2:1023, 2:1023)

Although, that’s not much improvement over the 800ns I saw with the completely generic version on this machine. I’d probably recommend preferring the kern = convert(AbstractArray, Images.Kernel.Laplacian()) approach with the generic filter2davx!.

Regarding Images.jl, @tim.holy has been making PRs to LoopVectorization (e.g., the one that added CartesianIndexing support), so hopefully Images.jl will be able to take advantage of this fairly soon.

stev47 · April 16, 2020, 5:47am

wow, thank you for the pointers! I’ve added LoopVectorization to the benchmark and it really does have quite an edge on dense and especially larger kernels, I’m impressed.

I liked the idea of being able to use the kernel function for user-defined mutations as well which is why I didn’t want to make too many assumptions on the generated loops (currently they are not annotated and thus assumption-free).
But the idea of outer vectorization is very intriguing and I’ll experiment with it.
Do you think the llvm auto-vectorizer will ever become aware of this so everyone can profit?

thisrod · April 16, 2020, 6:47am

The conventional wisdom on doing convolutions efficiently, for multidimensional arrays, is that it’s all about optimizing the sequence in which elements are looked up. Locality of memory access and cache optimization are supposed to be everything. I understand that’s why DiffEqOperators uses NNlib. Have you thought about that?

stev47 · April 16, 2020, 7:04am

Locality of memory access and cache optimization are supposed to be
everything.

It is simply looping in the order the output array is laid out in memory which
leads to reasonable locality. It is thus cache-oblivious and the Julia
autovectorization takes care of the rest.

Boundary conditions are specialized and get spliced into these loops for every
dimension. This is contrary to other approaches like wrapping the array by
copying or using ghost nodes (like DiffEqOperators).

DiffEqOperators
uses
NNlib. Have you thought about that?

I have benchmarked against NNlib (see my first post).

Elrod · April 16, 2020, 1:54pm

Outer vectorization is part of it. You can look at the benchmarks here, which show the same set of loops vectorized by a variety of compilers.
All but LoopVectorization will try vectorizing the inner loop by default.
Because the kernel is 3x3 in the example – too short to benefit from vectorization – all but LoopVectorization are slow.

LoopVectorization vectorizes an outer loop to avoid having to reduce a vector before storing into out.

Once you make the kernels statically sized, the compilers realize that vectorizing the inner loop is not profitable, and therefore don’t. Most of them will then decide to vectorize an outer loop instead, improving performance. “Most” means “all of them I tested except Julia”, but once you unroll the inner loop, Julia gets the same performance as the rest – except ifort (which is slow for some reason) and LoopVectorization.

LoopVectorization performs better than the others, because it realizes that there are two loops it can unroll to eliminate redundant loads. Basically, because A[i,j+(k+1)] = A[i,(j+1)+k], it unrolls the j and k loops so that it only has to load one of the above, and set the other as equal to it.
The actual amount to unroll these by is a function of the number of floating point registers available on the host CPU (32 for CPUs with AVX512, 16 for other x86_64 chips). The more you unroll, the more loads you can save, until all the registers are occupied. Once they are occupied, extra loads will “spill” registers onto the stack, causing a lot of extra loads and stores as data is shuffled between the registers and stack, probably resulting in worse performance than not unrolling at all.

Although, maybe I should run these benchmarks at larger sizes. It looks like LoopVectorization’s performance advantage is starting to disappear as out increases in size beyond 220 x 220 on my computer. By the time we get to 256 x 256, A and out no longer fit in the L2 cache:

julia> (256 * 256 + 258 * 258) * 8
1056800

julia> using VectorizationBase: CACHE_SIZE

julia> CACHE_SIZE[2]
1048576

and thus memory bandwidth starts becoming more important than compute performance, and they should all be more or less equally fast. So maybe this becomes irrelevant by the time we get to size(B) == (1022, 1022), like in the example benchmarks from this thread.
I’ll try benchmarking those other implementations shortly.

Running my benchmark script for B sizes 1021:1054:

julia> filter2d_dynamic_bench = benchmark_filter2ddynamic(sizes)#512:-1:2)
┌──────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ Size │              Julia │              Clang │           GFortran │                icc │              ifort │  LoopVectorization │
├──────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ 1021 │ 1.8178403910939909 │ 2.4448306821853394 │ 2.5599758758545472 │ 2.3462737316727744 │ 0.9316425099693625 │  6.388229132045654 │
│ 1022 │ 1.4650474013811996 │ 2.4574988668329794 │ 2.5565097150729046 │  2.368345881830715 │ 0.9575968962523743 │  7.120309061395235 │
│ 1023 │ 1.4523050934125357 │  2.456864412268391 │ 2.5581045546385814 │ 2.2563586382703487 │ 0.9439406425526551 │ 6.7386551470755345 │
│ 1024 │ 1.4620156104343172 │ 2.4406820133219034 │ 2.5497492989037354 │ 2.3508886689694903 │ 0.8230009329853537 │ 6.4987169022202576 │
│ 1025 │ 1.4653099596405736 │  2.453439172772119 │ 2.5610229107458826 │ 2.3523993502020546 │ 0.9240179613286665 │ 12.653379666036146 │
│ 1026 │ 1.4649694063529215 │  2.457283912915538 │  2.559811252904115 │  2.348638634181143 │ 0.9489334085530406 │  8.997384570676422 │
│ 1027 │ 1.4644498501850818 │  2.454107686567634 │ 2.5582532749347178 │  2.348745420177809 │ 0.9334177468757741 │ 10.073033793079814 │
│ 1028 │ 1.4645535198737467 │ 2.4831666078310444 │  2.559519021516521 │   2.35243139281692 │ 0.9290141677115838 │ 16.591517193355756 │
│ 1029 │  1.463922708108225 │  2.456257102559596 │ 2.5594085077031945 │ 2.3511307511464508 │ 0.9247747696835061 │ 22.382360255552598 │
│ 1030 │ 1.4638645348413444 │ 2.4454624172509147 │  2.558976437802299 │ 2.3513706822394744 │ 1.2445576093960709 │ 6.1476334646579165 │
│ 1031 │ 1.4510862949021326 │ 2.4520947378376943 │ 2.5573510300045132 │ 2.2548478324291232 │ 0.9166272782987794 │   6.04940973324174 │
│ 1032 │ 1.4528526827738566 │ 2.4515460572905483 │ 2.5583591600971096 │  2.260300897258687 │ 0.9246098635188117 │  7.020013469747157 │
│ 1033 │ 1.4653147433715972 │ 2.4548993343811247 │ 2.5584759530511785 │ 2.3465960120638014 │ 0.9292823336027185 │  7.146309256257276 │
│ 1034 │ 1.4523129861528221 │  2.453986136387049 │  2.558641393968214 │ 2.2502021085649218 │  0.946965923214369 │ 5.8949214033269195 │
│ 1035 │  1.464546550767119 │ 2.4530308308164304 │ 2.5590432787545936 │ 2.3450589554079104 │ 0.9193110638592383 │  6.431579709479529 │
│ 1036 │ 1.4650620333766335 │  2.445646191776606 │ 2.5589423842075694 │ 2.3460145530821825 │  0.929970907751645 │  6.837251149563763 │
│ 1037 │ 1.4645939832131403 │ 2.4455518669712424 │  2.558959471872023 │ 2.3456610537674556 │ 0.9227649544141436 │  5.905494440887469 │
│ 1038 │ 1.4593597286609086 │ 2.4344372990994736 │ 2.5495059262129693 │ 2.3420066135797026 │ 1.2334391538952136 │  6.004958958926798 │
│ 1039 │ 1.4630049855918557 │ 2.4546263754548288 │  2.556838551496205 │  2.335084278469622 │  0.957512004211583 │  6.004233947088257 │
│ 1040 │  1.463050130743714 │  2.465014323793477 │ 2.5568491928707133 │ 2.3388599380686492 │  0.920929909216241 │  5.995172482826057 │
│ 1041 │  1.464392925125857 │ 2.4637232724420546 │ 2.5581810350046847 │ 2.3389516182785974 │ 0.9146383022159846 │ 6.3829338956366515 │
│ 1042 │ 1.4631205087965642 │  2.461875187728442 │ 2.5575686094760073 │ 2.3368615010489804 │  0.948747162571823 │ 5.8752640337352044 │
│ 1043 │ 1.4634191755413013 │ 2.4472510192450025 │ 2.5588917364938704 │  2.337800539326161 │  0.933759181102324 │  5.970789244113779 │
│ 1044 │ 1.4634904820537336 │  2.459446874686412 │   2.55710376985874 │ 2.3377327682071884 │ 0.9254530978252374 │  6.318137768724167 │
│ 1045 │ 1.7860041054834286 │  2.467178794336835 │  2.561124335987429 │  2.345513493216436 │ 0.8736223022517141 │  5.917879978552798 │
│ 1046 │ 1.4637260209248606 │ 2.4622268678070762 │ 2.5592437615931507 │ 2.3482420414990544 │ 1.0470275378226601 │  6.847554793406015 │
│ 1047 │ 1.4634893986356143 │ 2.4508010534646734 │ 2.5575093850329687 │ 2.3456384126005947 │ 0.8608966299172266 │  8.837372511584036 │
│ 1048 │ 1.4639202303550392 │ 2.4616387423724624 │  2.556358986737475 │ 2.3410966279534273 │ 0.9422728217372728 │  9.797481359975357 │
│ 1049 │ 1.4630292329505217 │ 2.4542890586993704 │ 2.5572239652245856 │  2.344784748860405 │ 0.8748710327774221 │ 20.604989877494553 │
│ 1050 │  1.463151075881509 │ 2.4427694025683975 │ 2.5563609913156693 │ 2.3454162051647587 │ 0.8533824519798495 │   21.5905032888295 │
│ 1051 │ 1.4632256404538175 │  2.464302216704647 │  2.558230751436519 │ 2.3416139317061684 │ 0.9315586167729809 │ 11.692655094944906 │
│ 1052 │ 1.4642001899867962 │  2.456464786803328 │ 2.5597315474592643 │ 2.3510045519394325 │ 0.9505331386456543 │ 19.769258089395784 │
│ 1053 │ 1.4635455908577903 │  2.449856307871303 │ 2.5564491716996147 │ 2.3481485384732674 │  0.940296054969879 │ 22.743510171972147 │
│ 1054 │ 1.4637250546487306 │ 2.4548960792135937 │  2.559983885535248 │ 2.3625230256141783 │ 1.2179923592239215 │  23.91414215599374 │
└──────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┘

julia> filter2d_3x3_bench = benchmark_filter2d3x3(sizes)#512:-1:2)
┌──────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ Size │              Julia │              Clang │           GFortran │                icc │              ifort │  LoopVectorization │
├──────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ 1021 │ 1.2551788220206912 │ 10.000534407792083 │  8.941824018374602 │  7.941065868086378 │  3.877127536219201 │ 10.566397520564575 │
│ 1022 │ 1.2547216989375067 │  11.91141551898785 │  9.049461019677073 │  8.126204430271414 │ 1.8370533116247376 │ 13.536970794795087 │
│ 1023 │ 1.6890159709904202 │  9.418720554905853 │   9.05658138480076 │  8.294455379812637 │ 3.6259080823481535 │ 10.029061233049072 │
│ 1024 │ 1.2551605048358676 │   11.3767539047816 │  8.544823457170645 │   8.58314033160025 │   2.36207136438046 │  17.23009589463479 │
│ 1025 │ 1.2551342929129348 │ 11.568496575559848 │   9.03972938446962 │ 10.769526923176647 │  4.056948155107192 │  20.83452414142251 │
│ 1026 │  1.689876403033365 │  8.270507243831604 │  8.274293017049338 │  8.176583766900436 │  4.483629377525495 │  8.380554276821837 │
│ 1027 │ 1.2546948912015137 │  8.323960224133849 │  5.205748262732101 │  4.952628716496256 │  3.720494099006548 │  5.073267353065921 │
│ 1028 │ 1.2544085388435715 │ 13.397522939063824 │  8.586060964938062 │  6.778118301943528 │  3.659501656879656 │  5.638452634724073 │
│ 1029 │  1.254407829271221 │  7.081539831824071 │  5.455868811555582 │  5.511921016699926 │ 3.9397467266169492 │  6.466429903439832 │
│ 1030 │ 1.2547484499053059 │   8.15050917803923 │  5.912941701036674 │  6.191081009642616 │ 3.4667122351858426 │  4.919933405369308 │
│ 1031 │ 1.2545391671348247 │  13.09927864697921 │  7.056674544585479 │  6.108494231856809 │  3.624378030621512 │  6.559113131520651 │
│ 1032 │ 1.2550465354945572 │  6.529047566672857 │  5.158078656861457 │  5.170342748447148 │  3.808002745990803 │  6.226929582425081 │
│ 1033 │ 1.2550524464251667 │ 6.6314387841195055 │  5.671450058729425 │  6.044335311847562 │ 3.3448031268415255 │ 5.1642089447035335 │
│ 1034 │ 1.2550358968820778 │  6.993354723702178 │  6.514939555716776 │ 6.7110776994918595 │  4.518903130623564 │   6.49836429435122 │
│ 1035 │ 1.2549032727717757 │  6.069524402148934 │  7.475827814977333 │  5.015869094226311 │ 3.4107407808350896 │  4.751227997214618 │
│ 1036 │  1.252568805009168 │ 6.3030567664470425 │  5.201525846657598 │  5.047339680994086 │  3.783478259968004 │  4.649607602449309 │
│ 1037 │   1.25508292067117 │  7.513384030074459 │  7.502213162922935 │ 6.9049278791223205 │  4.280397309344173 │  6.965812359874229 │
│ 1038 │ 1.2547469817716392 │  6.099449081313786 │  5.723073519734888 │  5.652116409871714 │ 3.4303831534945655 │  5.815474999857125 │
│ 1039 │ 1.2549037668262317 │  5.090376805097961 │  4.926670374942081 │  4.951396773149147 │   3.30836482722168 │  4.909219919625342 │
│ 1040 │ 1.2548438958961106 │  5.266619157716729 │    5.4767917257172 │  5.048793212334221 │  3.497152021277486 │  5.429944962863147 │
│ 1041 │  1.254694622328792 │  5.033385089985812 │  5.007599232819161 │   4.93410780669145 │  3.255869731115504 │  4.788728099897298 │
│ 1042 │ 1.2548135946612264 │  4.893691910900707 │  4.858145123296108 │  4.930489614124427 │ 3.3430898443754264 │  4.850206103674865 │
│ 1043 │ 1.2548398377926953 │  4.865586613990924 │  5.162621130881511 │  5.418022287990747 │  3.474721350519458 │ 6.0546939558590305 │
│ 1044 │ 1.2549365315662502 │  5.567524851008133 │ 5.5525790891917035 │  5.784679284231125 │  3.767785837295706 │ 12.775300095837615 │
│ 1045 │ 1.2547349716923968 │  5.434978632685068 │  4.993071612789222 │  5.358654155344301 │  3.519846748728815 │  5.966678451476047 │
│ 1046 │ 1.2549220850764238 │ 6.4222209178265235 │  6.271976228499903 │  6.216651292129574 │  3.670299281784494 │  8.182937318165841 │
│ 1047 │  1.251947851453187 │  6.778825157334282 │  6.415524661426301 │   6.30229310762313 │ 3.5772289976723264 │  16.07746719465797 │
│ 1048 │ 1.2551189280193653 │   6.78882934743324 │  6.234693649258144 │   6.19219407848372 │  3.812819625296996 │ 17.184897748431414 │
│ 1049 │ 1.6895351479493448 │  6.538035578417527 │  6.470540196985551 │  6.441164702884728 │ 3.5506589291774513 │ 13.623783409802638 │
│ 1050 │ 1.6890768434661723 │  6.436448378602248 │  6.080552354553382 │   6.54933940516588 │  3.655846946165764 │  20.28958051420839 │
│ 1051 │ 1.2546965049976808 │  6.396950769802026 │ 6.0965010617944575 │   7.17417243717705 │ 3.4782166649440636 │  14.73226546956298 │
│ 1052 │ 1.2550311900529638 │  6.731106609547211 │  6.176456413788485 │  8.033918961216868 │ 3.7766364833661306 │  20.46598207292664 │
│ 1053 │ 1.2547526793481112 │  6.481130722853248 │  6.132268419249654 │  9.096269226073481 │ 3.4578194613031914 │  16.86932767017661 │
│ 1054 │ 1.2545026046166026 │  6.119884534222356 │  5.855553740015088 │  9.576531843729526 │ 3.5280621428260517 │ 22.985242910816066 │
└──────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┘

julia> filter2d_unrolled_bench = benchmark_filter2dunrolled(sizes)#512:-1:2)
┌──────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┐
│ Size │              Julia │              Clang │           GFortran │                icc │              ifort │  LoopVectorization │
├──────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
│ 1021 │   8.63374153385047 │  8.732223271774389 │  7.700053921894504 │  8.136815982074804 │  1.498925043500679 │  7.891565599777345 │
│ 1022 │  9.080040480219358 │  8.818789999672205 │  7.926675178343437 │ 6.8111954422808845 │ 2.0033002808826503 │ 5.4678408187737055 │
│ 1023 │   8.70731402920091 │  7.401857225769008 │  7.976392719694987 │  6.175789825199149 │  3.456483283813895 │  7.219485679920725 │
│ 1024 │ 14.105484554290449 │   9.30351330982637 │ 10.023995789265326 │ 10.069675101469006 │ 2.9662919403812626 │  7.993496072038267 │
│ 1025 │   8.74132077420138 │  9.735438098461845 │  8.574105571101379 │  7.187166005652955 │ 3.4777120726516535 │  5.954599600529027 │
│ 1026 │  7.353777876400755 │  6.293404925073791 │  6.419819927118482 │ 6.8445284863394305 │  3.654125638433443 │  5.975755029831204 │
│ 1027 │  9.370018791884581 │   9.21506293138449 │ 10.103394395640928 │ 10.227406796028454 │  3.552156922255231 │  7.919102815791042 │
│ 1028 │ 15.183472840123866 │ 14.408733786961555 │ 13.802039422483737 │  9.990256271947995 │  4.271127609405676 │  8.172005172855336 │
│ 1029 │  7.134498536262868 │ 5.8043099514864425 │  6.470369789910558 │   5.64975930853109 │  3.377146774568478 │  6.202471161569295 │
│ 1030 │  9.019503507970889 │  9.454494938639854 │  8.329696224010833 │   6.21913115682881 │ 3.8277602340211443 │   5.43159615496973 │
│ 1031 │  6.941791455878189 │  6.920162649511212 │  6.727455996854871 │  5.861960382061633 │ 2.6772829004078225 │  5.590749601275917 │
│ 1032 │ 10.205254869158109 │  8.030643945106318 │   7.95027512469641 │  6.134768734087403 │ 3.3765989528209954 │  7.161841712822355 │
│ 1033 │ 6.1933834411909805 │ 6.7792623967490355 │  6.153233222313693 │  5.723946053807357 │  2.623832344458078 │    6.3659857523863 │
│ 1034 │  6.881583460863652 │  6.603644063455201 │  6.667468323783392 │    5.7399381782674 │ 3.2785127936779195 │  6.867368756727797 │
│ 1035 │ 14.930528226520533 │   16.1870402879936 │ 13.418896723387983 │  9.743113920786348 │  3.167196392933698 │  7.958014091295831 │
│ 1036 │  6.372850572528682 │  7.030486580795161 │  6.081655903405634 │  5.554542092036707 │ 3.0294862169079106 │  7.808458163881472 │
│ 1037 │  8.927771640769143 │  8.891751917940876 │  9.430248873143098 │  8.954578272650862 │ 2.1160242487826766 │  7.997032817484811 │
│ 1038 │  7.503983350471749 │  5.882372134653347 │  5.380499123594733 │  6.721678857510249 │ 3.5956664838394192 │  5.480316063194686 │
│ 1039 │  4.880610538498934 │   5.48663206003532 │  4.956074181817396 │   6.15759947039937 │ 1.6887731232681473 │  7.436507164663673 │
│ 1040 │  6.562467320967082 │  5.383935392306798 │  4.769700448845539 │  5.221126662979045 │ 2.2168704331874682 │  4.918307288281304 │
│ 1041 │  6.081513563476409 │  6.524719099133877 │  6.429336918181904 │  4.990559896626967 │ 1.6893853854125565 │  5.097483905188994 │
│ 1042 │  5.122918483428633 │  5.470764674569613 │  5.811882912114815 │   5.75270041744349 │ 2.7841181777332853 │  5.903745032765198 │
│ 1043 │ 4.9466397045941175 │  5.280227832798889 │ 5.5140298405998776 │  6.137973895386389 │ 1.8113351489029939 │  6.980163257942189 │
│ 1044 │  5.803695163136322 │  6.788873068831418 │  6.725216141466496 │  6.560574333582837 │ 2.4200244028013853 │  7.515976748997068 │
│ 1045 │  5.641607207380234 │  6.347360985411729 │  6.165100630576657 │  6.329947255056869 │  1.918261944017462 │  7.396052660410471 │
│ 1046 │  7.346398620138879 │  7.235234556154645 │  7.805229847579577 │  6.758359576519098 │ 3.5208101964331107 │ 16.426992946097304 │
│ 1047 │  6.736138100649773 │  6.602714065436322 │  7.309349885273872 │  6.647696116376712 │  2.017634369811932 │  7.083461491971808 │
│ 1048 │  7.859965808972261 │  7.518943370635752 │  7.958005499929247 │  6.970927041696752 │ 2.4028292670748175 │  9.520083212661376 │
│ 1049 │  7.446960537928438 │  6.967811131622037 │  7.825466585512267 │  7.205646629044043 │   1.96674210456816 │  9.219081867647443 │
│ 1050 │  7.650502747323395 │  7.146147047631829 │  7.812399049465396 │  7.978278479697904 │  2.953033183001873 │ 11.212553034164412 │
│ 1051 │   7.69344606107734 │  7.369934406065428 │  8.094610322599557 │ 10.264242989397529 │  2.204027979493786 │ 16.345627956289135 │
│ 1052 │  7.602942578975529 │ 7.5701755338946395 │  7.790682304926185 │  8.956238887394647 │ 2.6695567285316986 │ 14.296458630321858 │
│ 1053 │  7.797251029684959 │  7.376104473617071 │  7.376491264319776 │ 11.522262402945826 │  2.467240635946479 │ 20.069518101664887 │
│ 1054 │  7.440068390614413 │  7.338711394728723 │  5.450686705114226 │ 14.847019404706407 │ 3.3401162965501894 │ 22.240822479214266 │
└──────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┴────────────────────┘

Perhaps I’ll add StaticKernels to these benchmarks.

These look very different, and show much worse, performance than I get over the size range of 2-256.
That suggests either
a) Don’t worry about optimizations such as load-elimination discussed above.
b) If you’re doing more operations on these arrays than just convolutions, perform the whole series in on chunks that fit in your L2 cache, and iterate over those chunks of A and B.

stev47 · April 16, 2020, 2:16pm

Once you make the kernels statically sized, the compilers realize that
vectorizing the inner loop is not profitable, and therefore don’t. Most
of them will then decide to vectorize an outer loop instead, improving
performance. “Most” means “all of them I tested except Julia”, but once
you unroll the inner loop, Julia gets the same performance as the rest
– except ifort (which is slow for some reason) and LoopVectorization.

I see, there is the unroll-and-jam
pass in llvm that seemingly promises to do outer loop vectorization.
Julia doesn’t enable that pass
though. Maybe just nobody has touched the julia loop-code lately?

LoopVectorization performs better than the others, because it realizes
that there are two loops it can unroll to eliminate redundant loads.
Basically, because A[i,j+(k+1)] = A[i,(j+1)+k], it unrolls the j and
k loops so that it only has to load one of the above, and set the
other as equal to it.

That is quite neat, I might be able to do that transformation without too much
effort.

Elrod · April 16, 2020, 2:28pm

Ah, that explains why Julia is the only one that doesn’t vectorize when the kernel is statically sized.

I think it’s worth experimenting with, but as I showed with tables/plots that I’ve now edited into my previous post, it helps a lot for arrays fitting in the L2 cache, but doesn’t seem to help for arrays larger than that. So if you’re working with large arrays, I wouldn’t expect much or any benefit, unless you can chunk the arrays and perform a series of operations on those chunks while they’re in your L2 cache, before moving on to the next chunks.

Note that LoopVectorization got between 6-8 GFLOPS for most of those sizes in 1021:1054, but was around 50 GFLOPS for sizes under 256 on my computer. Cache bandwidth is obviously a lot more important than compute here (on my computer, and probably any other x86_64 computer).

Oscar_Smith · April 16, 2020, 2:35pm

That’s true for most simple operations. L2 blocking is one LoopVectorization has planned to add, but it’s pretty hard to do so generically.

thisrod · April 16, 2020, 11:30pm

That makes sense. In the worst case, every element of the kernel is reading a different region of memory, but the cache has more lines than there are kernel elements. Thanks.

I looked at the table again, and noticed that some times were in μs and others in ns. That makes it more impressive!

tbeason · April 25, 2020, 2:20am

I started messing around with this. I have a question about using other functions inside the kernel function. Suppose I want a 3 point centered moving average, but I want missing as my boundary condition.

using StaticKernels,Missings,Statistics
a = allowmissing(rand(100))

# this works
kf_manual(w) = (w[-1]+w[0]+w[1])/3
Kmanual = Kernel{(-1:1,)}(kf_manual,StaticKernels.ExtensionConstant(missing))
map(Kmanual,a)

# this does not
kf_mean(w) = mean(Tuple(w))
Kmean = Kernel{(-1:1,)}(kf_mean,StaticKernels.ExtensionConstant(missing))
map(Kmean,a)

The second way doesn’t respect the extension, but I can’t even get it to work at all without using Tuple, which doesn’t use bounds checking. The first method is fine if I know the window size in advance, but what if I don’t?

ChrisRackauckas · April 25, 2020, 5:10am

Awesome work. We’ll be sure to make use of all of this work in DiffEqOperators.

BTW, @Elrod one of the things I hope we can add to our fast convolution support is irregular stencils, since this is required for finite differences with irregular dx. Is this something that would also benefit from interesting caching styles, or are fallbacks to simple loops just as good?

Our route forward might be to utilize LoopVectorization on our fused kernels, but still fall back to conv on GPUs for CuDNN. Let me know if there are any better ideas. Essentially what I want to be doing with this whole thing is for someone to simply say “I want the 4th order central difference with these dx’s, dy’s, and dz’s”, and I send that kernel to whatever stencil library does well.

stev47 · April 25, 2020, 10:17am

Great question @tbeason. The quick and easy way around it would be to access the kernel axes from within the window:

kf_mean(w) = mean(w[i] for i in CartesianIndices(axes(w.kernel)))
Kmean = Kernel{(-1:1,)}(kf_mean, StaticKernels.ExtensionConstant(missing))
map(Kmean, a)

Tuple(w) currently gives you a tuple of only the inner values, but it makes sense to have it return data for the whole kernel when boundary conditions are active, I will think about that.
Currently you could define it like this:

@generated KTuple(w) = :( Base.@_inline_meta; @inbounds ($((:(w[$i]) for i in CartesianIndices(w.parameters[4].parameters[1]))...),) )

@ChrisRackauckas I looked at some of the work in DiffEqOperators and I suspect (tell me if I’m wrong) that it might be a bit more than just sending the kernel to whatever stencil library does well: you are currently applying the operators to an extended BoundaryPaddedArray which includes runtime checks for every getindex.
If you want to avoid runtime checks for boundaries you need some way to statically generate the loop to avoid that, which is one of the selling points of this package that might have been overshadowed by the performance debate against LoopVectorization for interior kernels.

ChrisRackauckas · April 25, 2020, 1:41pm

No, we dispatch on the BoundaryPaddedArray to avoid that. There’s no overhead in getindex because we just directly use the array (if we didn’t, then conv wouldn’t work…). We can change that dispatch to anything.

stev47 · April 25, 2020, 3:38pm

You’re right, I was wrong about getindex being used.
The dispatch seems to happen on mul! with DerivativeOperator in src/derivative_operators/derivative_operator_functions.jl where the boundary is handled separately from the interior:

cv = DenseConvDims(_M, W, padding=pad, flipkernel=true)
conv!(_x_temp, _M, W, cv)
# Now deal with boundaries
# [...]

If you want to leverage the boundary handling in this package which ensures instead that the boundaries are dealt with in memory order, then just switching out the conv! line won’t do, but instead something like defining RobinBoundaryCondition <: StaticKernels::Extension and overloading getindex_extension().

ChrisRackauckas · April 25, 2020, 4:34pm

Interesting. Do you plan to add LoopVectorization.jl under the hood? If you add that in to get the same speed with a higher level interface, that’s a pretty nice option for us to target.

Also, just an FYI, the thing that would make me not use StaticKernels.jl right now is that it’s a single contributor library with no CI, no CompatHelper, no TagBot, no code coverage, etc. so it doesn’t live up to what’s required to be a SciML dependency since it’s hard to see what the long-term maintenance plan of this repository looks like. Of course, all of those can be fixed up.

Topic		Replies	Views
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2171	March 17, 2021
Performance improvements for simple molecular surface caclulation Performance	15	644	September 14, 2020
Strange performance of a loop Performance	37	3238	July 21, 2018
[ANN] LoopVectorization Package Announcements	157	23303	May 27, 2020
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2910	July 16, 2018

[ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays

Introduction

Why a new package?

Benchmarks on Linear Filters

Related topics