# [ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays

StaticKernels.jl is a new package that aims to enable easy and fast execution of kernel operations on arrays, e.g. finite differences, convolutions, image filters and morphological operations, etc.

• custom kernel functions in arbitrary dimensions
• custom boundary handling
• allocation-free execution
• small size (currently ~300 loc) and dependency free

# Introduction

You can think of a `Kernel` as a function that takes a neighbourhood `Window` view on the underlying data as an argument. The kernel function aggregates the values within the neighbourhood and outputs a new value. The well-known function `map(k, a)` applied to the kernel `k` and a data array `a` then applies the kernel as a filter.

``````using StaticKernels
a = rand(1000, 1000)

# Laplace
k = Kernel{(-1:1,-1:1)}(w -> w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1])
map(k, a, inner=true)

# Erosion
k = Kernel{(-1:1,-1:1)}(w -> minimum(Tuple(w)))
map(k, a)
``````

The window can be indexed using relative coordinates and implements a standard array interface. When accessed out of bounds (e.g. when positioned on the boundaries of the array) it evaluates to nothing.

``````# Laplace, zero boundary condition
k = Kernel{(-1:1,-1:1)}(w -> something(w[0,-1], 0.) + something(w[-1,0], 0.) - 4*w[0,0] + something(w[1,0], 0.) + something(w[0,1], 0.))
map(k, a)

# Forward-Gradient (non-skalar Kernel), neumann boundary condition
k = Kernel{(0:1, 0:1)}(w -> (something(w[1,0], w[0,0]) - w[0,0], something(w[0,1], w[0,0]) - w[0,0]))
map(k, a)
``````

# Why a new package?

Existing solutions didnβt meet at least one of:

• non-allocating, fast operations
• easily being able to write custom kernels
• custom and efficient handling of boundary conditions
• arbitrary types and dimensions.
• simple implementation and easy to extend

# Benchmarks on Linear Filters

Here a quick comparison for linear filtering (see `test/benchmark.jl`).
`LocalFilters` might be worse off than it should be, since it allocates within the loop.

``````Array: (10, 10), Kernel: (3, 3)
StaticKernels      206.048 ns (0 allocations: 0 bytes)
LoopVectorization  131.591 ns (0 allocations: 0 bytes)
NNlib              13.238 ΞΌs (35 allocations: 6.89 KiB)
ImageFiltering     8.997 ΞΌs (16 allocations: 3.73 KiB)
LocalFilters       45.291 ΞΌs (2353 allocations: 36.77 KiB)
Array: (10, 10), Kernel: (5, 5)
StaticKernels      567.505 ns (0 allocations: 0 bytes)
LoopVectorization  273.347 ns (0 allocations: 0 bytes)
NNlib              15.427 ΞΌs (35 allocations: 9.45 KiB)
ImageFiltering     15.772 ΞΌs (16 allocations: 5.41 KiB)
LocalFilters       93.978 ΞΌs (5809 allocations: 90.77 KiB)
Array: (10, 10), Kernel: (7, 7)
StaticKernels      625.294 ns (0 allocations: 0 bytes)
LoopVectorization  115.621 ns (0 allocations: 0 bytes)
NNlib              14.544 ΞΌs (35 allocations: 8.52 KiB)
ImageFiltering     119.226 ΞΌs (222 allocations: 27.36 KiB)
LocalFilters       152.574 ΞΌs (10093 allocations: 157.70 KiB)
Array: (100, 100), Kernel: (3, 3)
StaticKernels      26.029 ΞΌs (0 allocations: 0 bytes)
LoopVectorization  17.417 ΞΌs (0 allocations: 0 bytes)
NNlib              248.317 ΞΌs (36 allocations: 677.66 KiB)
ImageFiltering     169.551 ΞΌs (17 allocations: 81.06 KiB)
LocalFilters       5.208 ms (266413 allocations: 4.07 MiB)
Array: (100, 100), Kernel: (5, 5)
StaticKernels      141.370 ΞΌs (0 allocations: 0 bytes)
LoopVectorization  30.132 ΞΌs (0 allocations: 0 bytes)
NNlib              780.498 ΞΌs (36 allocations: 1.76 MiB)
ImageFiltering     312.253 ΞΌs (17 allocations: 82.73 KiB)
LocalFilters       12.308 ms (732109 allocations: 11.17 MiB)
Array: (100, 100), Kernel: (7, 7)
StaticKernels      323.081 ΞΌs (0 allocations: 0 bytes)
LoopVectorization  53.485 ΞΌs (0 allocations: 0 bytes)
NNlib              1.400 ms (36 allocations: 3.31 MiB)
ImageFiltering     839.447 ΞΌs (232 allocations: 731.58 KiB)
LocalFilters       22.810 ms (1420033 allocations: 21.67 MiB)
Array: (1000, 1000), Kernel: (3, 3)
StaticKernels      3.356 ms (0 allocations: 0 bytes)
LoopVectorization  2.709 ms (0 allocations: 0 bytes)
NNlib              60.562 ms (40 allocations: 68.39 MiB)
ImageFiltering     22.945 ms (17 allocations: 7.63 MiB)
LocalFilters       703.648 ms (26964013 allocations: 411.44 MiB)
Array: (1000, 1000), Kernel: (5, 5)
StaticKernels      15.446 ms (0 allocations: 0 bytes)
LoopVectorization  6.850 ms (0 allocations: 0 bytes)
NNlib              161.323 ms (40 allocations: 189.21 MiB)
ImageFiltering     39.113 ms (17 allocations: 7.63 MiB)
LocalFilters       1.866 s (74820109 allocations: 1.11 GiB)
Array: (1000, 1000), Kernel: (7, 7)
StaticKernels      39.406 ms (0 allocations: 0 bytes)
LoopVectorization  9.057 ms (0 allocations: 0 bytes)
NNlib              323.726 ms (40 allocations: 369.37 MiB)
ImageFiltering     88.782 ms (241 allocations: 68.77 MiB)
LocalFilters       3.652 s (146496433 allocations: 2.18 GiB)
``````
31 Likes

Oh, very nice - will this generalize to GPU arrays?

@tim.holy you were looking for something like this to speed up `Images` right?

1 Like

This looks great! You may want to harness some of @elrodβs work in LoopVectorization:

``````using StaticKernels, LoopVectorization, Images, OffsetArrays, BenchmarkTools

function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
@avx for J in CartesianIndices(out)
tmp = zero(eltype(out))
for I β CartesianIndices(kern)
tmp += A[I + J] * kern[I]
end
out[J] = tmp
end
out
end

# StaticKernels
k = StaticKernels.Kernel{(-1:1,-1:1)}(w -> w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1])
a = rand(1024, 1024)
b = similar(a, (size(a) .- 2))

# LoopVectorization
kern = convert(AbstractArray, Images.Kernel.Laplacian())
A = copy(a)
B = OffsetArray(similar(A, size(A).-2), 1, 1)
``````
``````@btime map!(\$k, \$b, \$a, inner=true) # 11.687 ms (0 allocations: 0 bytes)
@btime filter2davx!(\$B, \$A, \$kern)  # 1.848 ms (0 allocations: 0 bytes)
``````

** edit: with @inbounds @inline, your package edges out LoopVectorization:

``````@inline function f(w)
return @inbounds w[0,-1] + w[-1,0] + 4*w[0,0] + w[1,0] + w[0,1]
end
k = StaticKernels.Kernel{(-1:1,-1:1)}(f)

@btime map!(\$k, \$b, \$a, inner=true) #  1.492 ms (0 allocations: 0 bytes)
``````

*** edit 2: LoopVectorization is faster for fully-filled kernels:

``````kern = Images.Kernel.gaussian((1, 1), (3, 3))

const g11, g21, g31, g12, g22, g32, g13, g23, g33 = kern
@inline function f(w)
return @inbounds g11*w[-1, -1] + g21*w[0, -1] + g31*w[1, -1] +
g12*w[-1,  0] + g22*w[0,  0] + g32*w[1,  0] +
g13*w[-1,  1] + g23*w[0,  1] + g33*w[1,  1]
end
k = StaticKernels.Kernel{(-1:1,-1:1)}(f)

@btime map!(\$k, \$b, \$a, inner=true) # 2.886 ms (0 allocations: 0 bytes)
@btime filter2davx!(\$B, \$A, \$kern)  # 1.775 ms (0 allocations: 0 bytes)
``````
5 Likes

@oschulz With a few changes it should work on `GPUArrays`, though I doubt it will be ever be as fast as manually fine-tuned gpu kernels.

@stillyslalom Could you retry with `@inline` and `@inbounds`? I get the following

``````# ... continuing from your post
k = Kernel{(-1:1,-1:1)}(@inline function(w) @inbounds w[0,-1] + w[-1,0] - 4*w[0,0] + w[1,0] + w[0,1] end)
@btime filter2davx!(\$B, \$A, \$kern)  # 3.809 ms (0 allocations: 0 bytes)
@btime map!(\$k, \$b, \$a, inner=true) # 2.892 ms (0 allocations: 0 bytes)
``````
1 Like

On my computer, LoopVectorization still wins:

``````julia> @btime map!(\$k, \$b, \$a, inner=true);
950.944 ΞΌs (0 allocations: 0 bytes)

julia> @btime filter2davx!(\$B, \$A, \$kern);
800.251 ΞΌs (0 allocations: 0 bytes)
``````

We can make it a little faster if wemake `kern` statically sized, taking code from the unit tests:

``````using LoopVectorization: StaticUnitRange
struct SizedOffsetMatrix{T,LR,UR,LC,RC} <: AbstractMatrix{T}
data::Matrix{T}
end
Base.axes(::SizedOffsetMatrix{T,LR,UR,LC,UC}) where {T,LR,UR,LC,UC} = (StaticUnitRange{LR,UR}(),StaticUnitRange{LC,UC}())
@generated function LoopVectorization.stridedpointer(A::SizedOffsetMatrix{T,LR,UR,LC,RC}) where {T,LR,UR,LC,RC}
quote
\$(Expr(:meta,:inline))
LoopVectorization.OffsetStridedPointer(
LoopVectorization.StaticStridedPointer{\$T,Tuple{1,\$(UR-LR+1)}}(pointer(A.data)),
(\$(LR-2), \$(LC-2))
)
end
end
Base.getindex(A::SizedOffsetMatrix, i, j) = LoopVectorization.vload(LoopVectorization.stridedpointer(A), (i,j))
skern = SizedOffsetMatrix{eltype(kern),-1,1,-1,1}(parent(kern));
``````

Yielding

``````julia> @btime filter2davx!(\$B, \$A, \$skern);
771.416 ΞΌs (0 allocations: 0 bytes)
``````

Small but tangible improvement.

Of course, using the dense kernel

``````julia> kern = Images.Kernel.gaussian((1, 1), (3, 3))
3Γ3 OffsetArray(::Array{Float64,2}, -1:1, -1:1) with eltype Float64 with indices -1:1Γ-1:1:
0.0751136  0.123841  0.0751136
0.123841   0.20418   0.123841
0.0751136  0.123841  0.0751136

julia> const g11, g21, g31, g12, g22, g32, g13, g23, g33 = kern
3Γ3 OffsetArray(::Array{Float64,2}, -1:1, -1:1) with eltype Float64 with indices -1:1Γ-1:1:
0.0751136  0.123841  0.0751136
0.123841   0.20418   0.123841
0.0751136  0.123841  0.0751136

julia> @inline function f(w)
return @inbounds g11*w[-1, -1] + g21*w[0, -1] + g31*w[1, -1] +
g12*w[-1,  0] + g22*w[0,  0] + g32*w[1,  0] +
g13*w[-1,  1] + g23*w[0,  1] + g33*w[1,  1]
end
f (generic function with 1 method)

julia> k = StaticKernels.Kernel{(-1:1,-1:1)}(f)
Kernel{(-1:1, -1:1)} with window function

CodeInfo(
1 β       nothing
β         \$(Expr(:inbounds, true))
β   %3  = Base.getindex(w, -1, -1)
β   %4  = Main.g11 * %3
β   %5  = Base.getindex(w, 0, -1)
β   %6  = Main.g21 * %5
β   %7  = Base.getindex(w, 1, -1)
β   %8  = Main.g31 * %7
β   %9  = Base.getindex(w, -1, 0)
β   %10 = Main.g12 * %9
β   %11 = Base.getindex(w, 0, 0)
β   %12 = Main.g22 * %11
β   %13 = Base.getindex(w, 1, 0)
β   %14 = Main.g32 * %13
β   %15 = Base.getindex(w, -1, 1)
β   %16 = Main.g13 * %15
β   %17 = Base.getindex(w, 0, 1)
β   %18 = Main.g23 * %17
β   %19 = Base.getindex(w, 1, 1)
β   %20 = Main.g33 * %19
β         val = %4 + %6 + %8 + %10 + %12 + %14 + %16 + %18 + %20
β         \$(Expr(:inbounds, :pop))
βββ       return val
)

julia> @btime map!(\$k, \$b, \$a, inner=true);
2.263 ms (0 allocations: 0 bytes)

julia> @btime filter2davx!(\$B, \$A, \$kern);
795.978 ΞΌs (0 allocations: 0 bytes)

julia> skern = SizedOffsetMatrix{eltype(kern),-1,1,-1,1}(parent(kern));

julia> @btime filter2davx!(\$B, \$A, \$skern);
774.363 ΞΌs (0 allocations: 0 bytes)

julia> size.((B, A, b, a))
((1022, 1022), (1024, 1024), (1022, 1022), (1024, 1024))

julia> parent(B) β b
true
``````

Still, your benchmarks against the other libraries look really good for 300 loc. LoopVectorization has around 10 times that and several dependencies itself.

EDIT:
For the Laplacian kernel example, you could completely unroll the kernel for LoopVectorization as well.

``````julia> function filter2davx_laplacian!(out::AbstractMatrix, A::AbstractMatrix)
@avx for n in axes(out,2), m in axes(out,1)
out[m,n] = A[m,n-1] + A[m-1,n] + 4*A[m,n] + A[m+1,n] + A[m,n+1]
end
out
end
filter2davx_laplacian! (generic function with 2 methods)

julia> @btime map!(\$k, \$b, \$a, inner=true);
935.221 ΞΌs (0 allocations: 0 bytes)

julia> @btime filter2davx_laplacian!(\$B, \$A);
755.373 ΞΌs (0 allocations: 0 bytes)

julia> parent(B) β b
true

julia> axes(B)
(2:1023, 2:1023)
``````

Although, thatβs not much improvement over the 800ns I saw with the completely generic version on this machine. Iβd probably recommend preferring the `kern = convert(AbstractArray, Images.Kernel.Laplacian())` approach with the generic `filter2davx!`.

Regarding Images.jl, @tim.holy has been making PRs to LoopVectorization (e.g., the one that added CartesianIndexing support), so hopefully Images.jl will be able to take advantage of this fairly soon.

3 Likes

wow, thank you for the pointers! Iβve added LoopVectorization to the benchmark and it really does have quite an edge on dense and especially larger kernels, Iβm impressed.

I liked the idea of being able to use the kernel function for user-defined mutations as well which is why I didnβt want to make too many assumptions on the generated loops (currently they are not annotated and thus assumption-free).
But the idea of outer vectorization is very intriguing and Iβll experiment with it.
Do you think the llvm auto-vectorizer will ever become aware of this so everyone can profit?

2 Likes

The conventional wisdom on doing convolutions efficiently, for multidimensional arrays, is that itβs all about optimizing the sequence in which elements are looked up. Locality of memory access and cache optimization are supposed to be everything. I understand thatβs why DiffEqOperators uses NNlib. Have you thought about that?

1 Like

Locality of memory access and cache optimization are supposed to be
everything.

It is simply looping in the order the output array is laid out in memory which
leads to reasonable locality. It is thus cache-oblivious and the Julia
autovectorization takes care of the rest.

Boundary conditions are specialized and get spliced into these loops for every
dimension. This is contrary to other approaches like wrapping the array by
copying or using ghost nodes (like DiffEqOperators).

DiffEqOperators
uses
NNlib. Have you thought about that?

I have benchmarked against NNlib (see my first post).

1 Like

Outer vectorization is part of it. You can look at the benchmarks here, which show the same set of loops vectorized by a variety of compilers.
All but LoopVectorization will try vectorizing the inner loop by default.
Because the kernel is 3x3 in the example β too short to benefit from vectorization β all but LoopVectorization are slow.

LoopVectorization vectorizes an outer loop to avoid having to reduce a vector before storing into `out`.

Once you make the kernels statically sized, the compilers realize that vectorizing the inner loop is not profitable, and therefore donβt. Most of them will then decide to vectorize an outer loop instead, improving performance. βMostβ means βall of them I tested except Juliaβ, but once you unroll the inner loop, Julia gets the same performance as the rest β except ifort (which is slow for some reason) and LoopVectorization.

LoopVectorization performs better than the others, because it realizes that there are two loops it can unroll to eliminate redundant loads. Basically, because `A[i,j+(k+1)] = A[i,(j+1)+k]`, it unrolls the `j` and `k` loops so that it only has to load one of the above, and set the other as equal to it.
The actual amount to unroll these by is a function of the number of floating point registers available on the host CPU (32 for CPUs with AVX512, 16 for other x86_64 chips). The more you unroll, the more loads you can save, until all the registers are occupied. Once they are occupied, extra loads will βspillβ registers onto the stack, causing a lot of extra loads and stores as data is shuffled between the registers and stack, probably resulting in worse performance than not unrolling at all.

Although, maybe I should run these benchmarks at larger sizes. It looks like LoopVectorizationβs performance advantage is starting to disappear as `out` increases in size beyond 220 x 220 on my computer. By the time we get to 256 x 256, `A` and `out` no longer fit in the L2 cache:

``````julia> (256 * 256 + 258 * 258) * 8
1056800

julia> using VectorizationBase: CACHE_SIZE

julia> CACHE_SIZE[2]
1048576
``````

and thus memory bandwidth starts becoming more important than compute performance, and they should all be more or less equally fast. So maybe this becomes irrelevant by the time we get to `size(B) == (1022, 1022)`, like in the example benchmarks from this thread.
Iβll try benchmarking those other implementations shortly.

Running my benchmark script for `B` sizes 1021:1054:

``````julia> filter2d_dynamic_bench = benchmark_filter2ddynamic(sizes)#512:-1:2)
ββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ
β Size β              Julia β              Clang β           GFortran β                icc β              ifort β  LoopVectorization β
ββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β 1021 β 1.8178403910939909 β 2.4448306821853394 β 2.5599758758545472 β 2.3462737316727744 β 0.9316425099693625 β  6.388229132045654 β
β 1022 β 1.4650474013811996 β 2.4574988668329794 β 2.5565097150729046 β  2.368345881830715 β 0.9575968962523743 β  7.120309061395235 β
β 1023 β 1.4523050934125357 β  2.456864412268391 β 2.5581045546385814 β 2.2563586382703487 β 0.9439406425526551 β 6.7386551470755345 β
β 1024 β 1.4620156104343172 β 2.4406820133219034 β 2.5497492989037354 β 2.3508886689694903 β 0.8230009329853537 β 6.4987169022202576 β
β 1025 β 1.4653099596405736 β  2.453439172772119 β 2.5610229107458826 β 2.3523993502020546 β 0.9240179613286665 β 12.653379666036146 β
β 1026 β 1.4649694063529215 β  2.457283912915538 β  2.559811252904115 β  2.348638634181143 β 0.9489334085530406 β  8.997384570676422 β
β 1027 β 1.4644498501850818 β  2.454107686567634 β 2.5582532749347178 β  2.348745420177809 β 0.9334177468757741 β 10.073033793079814 β
β 1028 β 1.4645535198737467 β 2.4831666078310444 β  2.559519021516521 β   2.35243139281692 β 0.9290141677115838 β 16.591517193355756 β
β 1029 β  1.463922708108225 β  2.456257102559596 β 2.5594085077031945 β 2.3511307511464508 β 0.9247747696835061 β 22.382360255552598 β
β 1030 β 1.4638645348413444 β 2.4454624172509147 β  2.558976437802299 β 2.3513706822394744 β 1.2445576093960709 β 6.1476334646579165 β
β 1031 β 1.4510862949021326 β 2.4520947378376943 β 2.5573510300045132 β 2.2548478324291232 β 0.9166272782987794 β   6.04940973324174 β
β 1032 β 1.4528526827738566 β 2.4515460572905483 β 2.5583591600971096 β  2.260300897258687 β 0.9246098635188117 β  7.020013469747157 β
β 1033 β 1.4653147433715972 β 2.4548993343811247 β 2.5584759530511785 β 2.3465960120638014 β 0.9292823336027185 β  7.146309256257276 β
β 1034 β 1.4523129861528221 β  2.453986136387049 β  2.558641393968214 β 2.2502021085649218 β  0.946965923214369 β 5.8949214033269195 β
β 1035 β  1.464546550767119 β 2.4530308308164304 β 2.5590432787545936 β 2.3450589554079104 β 0.9193110638592383 β  6.431579709479529 β
β 1036 β 1.4650620333766335 β  2.445646191776606 β 2.5589423842075694 β 2.3460145530821825 β  0.929970907751645 β  6.837251149563763 β
β 1037 β 1.4645939832131403 β 2.4455518669712424 β  2.558959471872023 β 2.3456610537674556 β 0.9227649544141436 β  5.905494440887469 β
β 1038 β 1.4593597286609086 β 2.4344372990994736 β 2.5495059262129693 β 2.3420066135797026 β 1.2334391538952136 β  6.004958958926798 β
β 1039 β 1.4630049855918557 β 2.4546263754548288 β  2.556838551496205 β  2.335084278469622 β  0.957512004211583 β  6.004233947088257 β
β 1040 β  1.463050130743714 β  2.465014323793477 β 2.5568491928707133 β 2.3388599380686492 β  0.920929909216241 β  5.995172482826057 β
β 1041 β  1.464392925125857 β 2.4637232724420546 β 2.5581810350046847 β 2.3389516182785974 β 0.9146383022159846 β 6.3829338956366515 β
β 1042 β 1.4631205087965642 β  2.461875187728442 β 2.5575686094760073 β 2.3368615010489804 β  0.948747162571823 β 5.8752640337352044 β
β 1043 β 1.4634191755413013 β 2.4472510192450025 β 2.5588917364938704 β  2.337800539326161 β  0.933759181102324 β  5.970789244113779 β
β 1044 β 1.4634904820537336 β  2.459446874686412 β   2.55710376985874 β 2.3377327682071884 β 0.9254530978252374 β  6.318137768724167 β
β 1045 β 1.7860041054834286 β  2.467178794336835 β  2.561124335987429 β  2.345513493216436 β 0.8736223022517141 β  5.917879978552798 β
β 1046 β 1.4637260209248606 β 2.4622268678070762 β 2.5592437615931507 β 2.3482420414990544 β 1.0470275378226601 β  6.847554793406015 β
β 1047 β 1.4634893986356143 β 2.4508010534646734 β 2.5575093850329687 β 2.3456384126005947 β 0.8608966299172266 β  8.837372511584036 β
β 1048 β 1.4639202303550392 β 2.4616387423724624 β  2.556358986737475 β 2.3410966279534273 β 0.9422728217372728 β  9.797481359975357 β
β 1049 β 1.4630292329505217 β 2.4542890586993704 β 2.5572239652245856 β  2.344784748860405 β 0.8748710327774221 β 20.604989877494553 β
β 1050 β  1.463151075881509 β 2.4427694025683975 β 2.5563609913156693 β 2.3454162051647587 β 0.8533824519798495 β   21.5905032888295 β
β 1051 β 1.4632256404538175 β  2.464302216704647 β  2.558230751436519 β 2.3416139317061684 β 0.9315586167729809 β 11.692655094944906 β
β 1052 β 1.4642001899867962 β  2.456464786803328 β 2.5597315474592643 β 2.3510045519394325 β 0.9505331386456543 β 19.769258089395784 β
β 1053 β 1.4635455908577903 β  2.449856307871303 β 2.5564491716996147 β 2.3481485384732674 β  0.940296054969879 β 22.743510171972147 β
β 1054 β 1.4637250546487306 β 2.4548960792135937 β  2.559983885535248 β 2.3625230256141783 β 1.2179923592239215 β  23.91414215599374 β
ββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ
``````

``````julia> filter2d_3x3_bench = benchmark_filter2d3x3(sizes)#512:-1:2)
ββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ
β Size β              Julia β              Clang β           GFortran β                icc β              ifort β  LoopVectorization β
ββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β 1021 β 1.2551788220206912 β 10.000534407792083 β  8.941824018374602 β  7.941065868086378 β  3.877127536219201 β 10.566397520564575 β
β 1022 β 1.2547216989375067 β  11.91141551898785 β  9.049461019677073 β  8.126204430271414 β 1.8370533116247376 β 13.536970794795087 β
β 1023 β 1.6890159709904202 β  9.418720554905853 β   9.05658138480076 β  8.294455379812637 β 3.6259080823481535 β 10.029061233049072 β
β 1024 β 1.2551605048358676 β   11.3767539047816 β  8.544823457170645 β   8.58314033160025 β   2.36207136438046 β  17.23009589463479 β
β 1025 β 1.2551342929129348 β 11.568496575559848 β   9.03972938446962 β 10.769526923176647 β  4.056948155107192 β  20.83452414142251 β
β 1026 β  1.689876403033365 β  8.270507243831604 β  8.274293017049338 β  8.176583766900436 β  4.483629377525495 β  8.380554276821837 β
β 1027 β 1.2546948912015137 β  8.323960224133849 β  5.205748262732101 β  4.952628716496256 β  3.720494099006548 β  5.073267353065921 β
β 1028 β 1.2544085388435715 β 13.397522939063824 β  8.586060964938062 β  6.778118301943528 β  3.659501656879656 β  5.638452634724073 β
β 1029 β  1.254407829271221 β  7.081539831824071 β  5.455868811555582 β  5.511921016699926 β 3.9397467266169492 β  6.466429903439832 β
β 1030 β 1.2547484499053059 β   8.15050917803923 β  5.912941701036674 β  6.191081009642616 β 3.4667122351858426 β  4.919933405369308 β
β 1031 β 1.2545391671348247 β  13.09927864697921 β  7.056674544585479 β  6.108494231856809 β  3.624378030621512 β  6.559113131520651 β
β 1032 β 1.2550465354945572 β  6.529047566672857 β  5.158078656861457 β  5.170342748447148 β  3.808002745990803 β  6.226929582425081 β
β 1033 β 1.2550524464251667 β 6.6314387841195055 β  5.671450058729425 β  6.044335311847562 β 3.3448031268415255 β 5.1642089447035335 β
β 1034 β 1.2550358968820778 β  6.993354723702178 β  6.514939555716776 β 6.7110776994918595 β  4.518903130623564 β   6.49836429435122 β
β 1035 β 1.2549032727717757 β  6.069524402148934 β  7.475827814977333 β  5.015869094226311 β 3.4107407808350896 β  4.751227997214618 β
β 1036 β  1.252568805009168 β 6.3030567664470425 β  5.201525846657598 β  5.047339680994086 β  3.783478259968004 β  4.649607602449309 β
β 1037 β   1.25508292067117 β  7.513384030074459 β  7.502213162922935 β 6.9049278791223205 β  4.280397309344173 β  6.965812359874229 β
β 1038 β 1.2547469817716392 β  6.099449081313786 β  5.723073519734888 β  5.652116409871714 β 3.4303831534945655 β  5.815474999857125 β
β 1039 β 1.2549037668262317 β  5.090376805097961 β  4.926670374942081 β  4.951396773149147 β   3.30836482722168 β  4.909219919625342 β
β 1040 β 1.2548438958961106 β  5.266619157716729 β    5.4767917257172 β  5.048793212334221 β  3.497152021277486 β  5.429944962863147 β
β 1041 β  1.254694622328792 β  5.033385089985812 β  5.007599232819161 β   4.93410780669145 β  3.255869731115504 β  4.788728099897298 β
β 1042 β 1.2548135946612264 β  4.893691910900707 β  4.858145123296108 β  4.930489614124427 β 3.3430898443754264 β  4.850206103674865 β
β 1043 β 1.2548398377926953 β  4.865586613990924 β  5.162621130881511 β  5.418022287990747 β  3.474721350519458 β 6.0546939558590305 β
β 1044 β 1.2549365315662502 β  5.567524851008133 β 5.5525790891917035 β  5.784679284231125 β  3.767785837295706 β 12.775300095837615 β
β 1045 β 1.2547349716923968 β  5.434978632685068 β  4.993071612789222 β  5.358654155344301 β  3.519846748728815 β  5.966678451476047 β
β 1046 β 1.2549220850764238 β 6.4222209178265235 β  6.271976228499903 β  6.216651292129574 β  3.670299281784494 β  8.182937318165841 β
β 1047 β  1.251947851453187 β  6.778825157334282 β  6.415524661426301 β   6.30229310762313 β 3.5772289976723264 β  16.07746719465797 β
β 1048 β 1.2551189280193653 β   6.78882934743324 β  6.234693649258144 β   6.19219407848372 β  3.812819625296996 β 17.184897748431414 β
β 1049 β 1.6895351479493448 β  6.538035578417527 β  6.470540196985551 β  6.441164702884728 β 3.5506589291774513 β 13.623783409802638 β
β 1050 β 1.6890768434661723 β  6.436448378602248 β  6.080552354553382 β   6.54933940516588 β  3.655846946165764 β  20.28958051420839 β
β 1051 β 1.2546965049976808 β  6.396950769802026 β 6.0965010617944575 β   7.17417243717705 β 3.4782166649440636 β  14.73226546956298 β
β 1052 β 1.2550311900529638 β  6.731106609547211 β  6.176456413788485 β  8.033918961216868 β 3.7766364833661306 β  20.46598207292664 β
β 1053 β 1.2547526793481112 β  6.481130722853248 β  6.132268419249654 β  9.096269226073481 β 3.4578194613031914 β  16.86932767017661 β
β 1054 β 1.2545026046166026 β  6.119884534222356 β  5.855553740015088 β  9.576531843729526 β 3.5280621428260517 β 22.985242910816066 β
ββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ
``````

``````julia> filter2d_unrolled_bench = benchmark_filter2dunrolled(sizes)#512:-1:2)
ββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββ
β Size β              Julia β              Clang β           GFortran β                icc β              ifort β  LoopVectorization β
ββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββ€
β 1021 β   8.63374153385047 β  8.732223271774389 β  7.700053921894504 β  8.136815982074804 β  1.498925043500679 β  7.891565599777345 β
β 1022 β  9.080040480219358 β  8.818789999672205 β  7.926675178343437 β 6.8111954422808845 β 2.0033002808826503 β 5.4678408187737055 β
β 1023 β   8.70731402920091 β  7.401857225769008 β  7.976392719694987 β  6.175789825199149 β  3.456483283813895 β  7.219485679920725 β
β 1024 β 14.105484554290449 β   9.30351330982637 β 10.023995789265326 β 10.069675101469006 β 2.9662919403812626 β  7.993496072038267 β
β 1025 β   8.74132077420138 β  9.735438098461845 β  8.574105571101379 β  7.187166005652955 β 3.4777120726516535 β  5.954599600529027 β
β 1026 β  7.353777876400755 β  6.293404925073791 β  6.419819927118482 β 6.8445284863394305 β  3.654125638433443 β  5.975755029831204 β
β 1027 β  9.370018791884581 β   9.21506293138449 β 10.103394395640928 β 10.227406796028454 β  3.552156922255231 β  7.919102815791042 β
β 1028 β 15.183472840123866 β 14.408733786961555 β 13.802039422483737 β  9.990256271947995 β  4.271127609405676 β  8.172005172855336 β
β 1029 β  7.134498536262868 β 5.8043099514864425 β  6.470369789910558 β   5.64975930853109 β  3.377146774568478 β  6.202471161569295 β
β 1030 β  9.019503507970889 β  9.454494938639854 β  8.329696224010833 β   6.21913115682881 β 3.8277602340211443 β   5.43159615496973 β
β 1031 β  6.941791455878189 β  6.920162649511212 β  6.727455996854871 β  5.861960382061633 β 2.6772829004078225 β  5.590749601275917 β
β 1032 β 10.205254869158109 β  8.030643945106318 β   7.95027512469641 β  6.134768734087403 β 3.3765989528209954 β  7.161841712822355 β
β 1033 β 6.1933834411909805 β 6.7792623967490355 β  6.153233222313693 β  5.723946053807357 β  2.623832344458078 β    6.3659857523863 β
β 1034 β  6.881583460863652 β  6.603644063455201 β  6.667468323783392 β    5.7399381782674 β 3.2785127936779195 β  6.867368756727797 β
β 1035 β 14.930528226520533 β   16.1870402879936 β 13.418896723387983 β  9.743113920786348 β  3.167196392933698 β  7.958014091295831 β
β 1036 β  6.372850572528682 β  7.030486580795161 β  6.081655903405634 β  5.554542092036707 β 3.0294862169079106 β  7.808458163881472 β
β 1037 β  8.927771640769143 β  8.891751917940876 β  9.430248873143098 β  8.954578272650862 β 2.1160242487826766 β  7.997032817484811 β
β 1038 β  7.503983350471749 β  5.882372134653347 β  5.380499123594733 β  6.721678857510249 β 3.5956664838394192 β  5.480316063194686 β
β 1039 β  4.880610538498934 β   5.48663206003532 β  4.956074181817396 β   6.15759947039937 β 1.6887731232681473 β  7.436507164663673 β
β 1040 β  6.562467320967082 β  5.383935392306798 β  4.769700448845539 β  5.221126662979045 β 2.2168704331874682 β  4.918307288281304 β
β 1041 β  6.081513563476409 β  6.524719099133877 β  6.429336918181904 β  4.990559896626967 β 1.6893853854125565 β  5.097483905188994 β
β 1042 β  5.122918483428633 β  5.470764674569613 β  5.811882912114815 β   5.75270041744349 β 2.7841181777332853 β  5.903745032765198 β
β 1043 β 4.9466397045941175 β  5.280227832798889 β 5.5140298405998776 β  6.137973895386389 β 1.8113351489029939 β  6.980163257942189 β
β 1044 β  5.803695163136322 β  6.788873068831418 β  6.725216141466496 β  6.560574333582837 β 2.4200244028013853 β  7.515976748997068 β
β 1045 β  5.641607207380234 β  6.347360985411729 β  6.165100630576657 β  6.329947255056869 β  1.918261944017462 β  7.396052660410471 β
β 1046 β  7.346398620138879 β  7.235234556154645 β  7.805229847579577 β  6.758359576519098 β 3.5208101964331107 β 16.426992946097304 β
β 1047 β  6.736138100649773 β  6.602714065436322 β  7.309349885273872 β  6.647696116376712 β  2.017634369811932 β  7.083461491971808 β
β 1048 β  7.859965808972261 β  7.518943370635752 β  7.958005499929247 β  6.970927041696752 β 2.4028292670748175 β  9.520083212661376 β
β 1049 β  7.446960537928438 β  6.967811131622037 β  7.825466585512267 β  7.205646629044043 β   1.96674210456816 β  9.219081867647443 β
β 1050 β  7.650502747323395 β  7.146147047631829 β  7.812399049465396 β  7.978278479697904 β  2.953033183001873 β 11.212553034164412 β
β 1051 β   7.69344606107734 β  7.369934406065428 β  8.094610322599557 β 10.264242989397529 β  2.204027979493786 β 16.345627956289135 β
β 1052 β  7.602942578975529 β 7.5701755338946395 β  7.790682304926185 β  8.956238887394647 β 2.6695567285316986 β 14.296458630321858 β
β 1053 β  7.797251029684959 β  7.376104473617071 β  7.376491264319776 β 11.522262402945826 β  2.467240635946479 β 20.069518101664887 β
β 1054 β  7.440068390614413 β  7.338711394728723 β  5.450686705114226 β 14.847019404706407 β 3.3401162965501894 β 22.240822479214266 β
ββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββ
``````

Perhaps Iβll add `StaticKernels` to these benchmarks.

These look very different, and show much worse, performance than I get over the size range of 2-256.
That suggests either
b) If youβre doing more operations on these arrays than just convolutions, perform the whole series in on chunks that fit in your L2 cache, and iterate over those chunks of `A` and `B`.

Once you make the kernels statically sized, the compilers realize that
vectorizing the inner loop is not profitable, and therefore donβt. Most
of them will then decide to vectorize an outer loop instead, improving
performance. βMostβ means βall of them I tested except Juliaβ, but once
you unroll the inner loop, Julia gets the same performance as the rest
β except ifort (which is slow for some reason) and LoopVectorization.

I see, there is the unroll-and-jam
pass in llvm that seemingly promises to do outer loop vectorization.
Julia doesnβt enable that pass
though. Maybe just nobody has touched the julia loop-code lately?

LoopVectorization performs better than the others, because it realizes
that there are two loops it can unroll to eliminate redundant loads.
Basically, because `A[i,j+(k+1)] = A[i,(j+1)+k]`, it unrolls the `j` and
`k` loops so that it only has to load one of the above, and set the
other as equal to it.

That is quite neat, I might be able to do that transformation without too much
effort.

1 Like

Ah, that explains why Julia is the only one that doesnβt vectorize when the kernel is statically sized.

I think itβs worth experimenting with, but as I showed with tables/plots that Iβve now edited into my previous post, it helps a lot for arrays fitting in the L2 cache, but doesnβt seem to help for arrays larger than that. So if youβre working with large arrays, I wouldnβt expect much or any benefit, unless you can chunk the arrays and perform a series of operations on those chunks while theyβre in your L2 cache, before moving on to the next chunks.

Note that LoopVectorization got between 6-8 GFLOPS for most of those sizes in `1021:1054`, but was around 50 GFLOPS for sizes under 256 on my computer. Cache bandwidth is obviously a lot more important than compute here (on my computer, and probably any other x86_64 computer).

1 Like

Thatβs true for most simple operations. L2 blocking is one LoopVectorization has planned to add, but itβs pretty hard to do so generically.

That makes sense. In the worst case, every element of the kernel is reading a different region of memory, but the cache has more lines than there are kernel elements. Thanks.

I looked at the table again, and noticed that some times were in ΞΌs and others in ns. That makes it more impressive!

I started messing around with this. I have a question about using other functions inside the kernel function. Suppose I want a 3 point centered moving average, but I want `missing` as my boundary condition.

``````using StaticKernels,Missings,Statistics
a = allowmissing(rand(100))

# this works
kf_manual(w) = (w[-1]+w[0]+w[1])/3
Kmanual = Kernel{(-1:1,)}(kf_manual,StaticKernels.ExtensionConstant(missing))
map(Kmanual,a)

# this does not
kf_mean(w) = mean(Tuple(w))
Kmean = Kernel{(-1:1,)}(kf_mean,StaticKernels.ExtensionConstant(missing))
map(Kmean,a)
``````

The second way doesnβt respect the extension, but I canβt even get it to work at all without using `Tuple`, which doesnβt use bounds checking. The first method is fine if I know the window size in advance, but what if I donβt?

1 Like

Awesome work. Weβll be sure to make use of all of this work in DiffEqOperators.

BTW, @Elrod one of the things I hope we can add to our fast convolution support is irregular stencils, since this is required for finite differences with irregular `dx`. Is this something that would also benefit from interesting caching styles, or are fallbacks to simple loops just as good?

Our route forward might be to utilize LoopVectorization on our fused kernels, but still fall back to `conv` on GPUs for CuDNN. Let me know if there are any better ideas. Essentially what I want to be doing with this whole thing is for someone to simply say βI want the 4th order central difference with these dxβs, dyβs, and dzβsβ, and I send that kernel to whatever stencil library does well.

Great question @tbeason. The quick and easy way around it would be to access the kernel axes from within the window:

``````kf_mean(w) = mean(w[i] for i in CartesianIndices(axes(w.kernel)))
Kmean = Kernel{(-1:1,)}(kf_mean, StaticKernels.ExtensionConstant(missing))
map(Kmean, a)
``````

`Tuple(w)` currently gives you a tuple of only the inner values, but it makes sense to have it return data for the whole kernel when boundary conditions are active, I will think about that.
Currently you could define it like this:

``````@generated KTuple(w) = :( Base.@_inline_meta; @inbounds (\$((:(w[\$i]) for i in CartesianIndices(w.parameters[4].parameters[1]))...),) )
``````

@ChrisRackauckas I looked at some of the work in DiffEqOperators and I suspect (tell me if Iβm wrong) that it might be a bit more than just sending the kernel to whatever stencil library does well: you are currently applying the operators to an extended `BoundaryPaddedArray` which includes runtime checks for every `getindex`.
If you want to avoid runtime checks for boundaries you need some way to statically generate the loop to avoid that, which is one of the selling points of this package that might have been overshadowed by the performance debate against LoopVectorization for interior kernels.

1 Like

No, we dispatch on the `BoundaryPaddedArray` to avoid that. Thereβs no overhead in `getindex` because we just directly use the array (if we didnβt, then `conv` wouldnβt workβ¦). We can change that dispatch to anything.

Youβre right, I was wrong about `getindex` being used.
The dispatch seems to happen on `mul!` with `DerivativeOperator` in `src/derivative_operators/derivative_operator_functions.jl` where the boundary is handled separately from the interior:

``````cv = DenseConvDims(_M, W, padding=pad, flipkernel=true)
conv!(_x_temp, _M, W, cv)
# Now deal with boundaries
# [...]
``````

If you want to leverage the boundary handling in this package which ensures instead that the boundaries are dealt with in memory order, then just switching out the `conv!` line wonβt do, but instead something like defining `RobinBoundaryCondition <: StaticKernels::Extension` and overloading `getindex_extension()`.

Interesting. Do you plan to add LoopVectorization.jl under the hood? If you add that in to get the same speed with a higher level interface, thatβs a pretty nice option for us to target.

Also, just an FYI, the thing that would make me not use StaticKernels.jl right now is that itβs a single contributor library with no CI, no CompatHelper, no TagBot, no code coverage, etc. so it doesnβt live up to whatβs required to be a SciML dependency since itβs hard to see what the long-term maintenance plan of this repository looks like. Of course, all of those can be fixed up.

2 Likes