[ANN] Stencils.jl for fast small/direct stencils on CPU/GPU

Thought I should announce this as its getting kind of stable and useful:

The core idea of Stencils.jl is that a Stencil is a StaticVector of values extracted from an array at a specific position, following some pattern over N dimensions.

These can then be mapped over arrays on CPUs or GPUs (using KernelAbstractions.jl) to do things like convolutions or cellular automata.

Here’s a taste of some stencil patterns from the readme:

julia> Moore(1)
Moore{1, 2, 8, Nothing}
█▀█
▀▀▀


julia> Circle{3,2}()
Circle{3, 2, 37, Nothing}
 ▄███▄ 
███████
▀█████▀
  ▀▀▀  

julia> Window(3)
Window{3, 2, 49, Nothing}
███████
███████
███████
▀▀▀▀▀▀▀
# You can also define custom patterns, even with named positions
julia> ns = NamedStencil(n=(-1, 0), w=(0, -1), s=(1, 0), e=(0, 1))
NamedStencil{(:n, :w, :s, :e), ((-1, 0), (0, -1), (1, 0), (0, 1)), 1, 2, 4, Nothing}
▄▀▄
 ▀ 

# A StencilArray is used for stencil mapping and extraction
julia> A = StencilArray(
           [0 2 0 0 4
            0 0 6 0 0
            0 0 0 8 7
            0 0 3 0 1
            5 0 1 0 9],
           ns
       );

# You can retrieve a stencil at a specific location, and the named value:
julia> stencil(A, (4, 4)).n
8

julia> stencil(A, (4, 4)).w
3

And heres a benchmark of a mean blur on a laptop. Performance is pretty good for small kernels, they should be quite competitive with ImageFiltering.jl or DSP.jl even without GPUs. There are no special optimizations like splitting Gaussian kernels for convolutions, but its pretty easy to write that if you need it.

using Stencils, Statistics, BenchmarkTools
# Define a random array
r = rand(1000, 1000)
# Use a square 3x3/radius 1 stencil
stencil = Window(1)
# Wrap them both as a StencilArray
A = StencilArray(r, stencil)
# Map `mean` over all stencils in the array. You can use any function here - 
# `identity` would return an array of `Window` stencils.
@benchmark mapstencil(mean, A)

BenchmarkTools.Trial: 1058 samples with 1 evaluation.
 Range (min … max):  2.755 ms … 9.693 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.373 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.718 ms ± 1.326 ms  ┊ GC (mean ± σ):  2.92% ± 5.78%

    ▆▂                                   ▁█▅                 
  ▂▇██▄▁▂▂▁▂▂▄▅▂▂▁▂▁▁▁▂▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▅▄▂███▆▄▁▁▁▁▂▁▁▂▁▁▂▄▄ ▂
  2.75 ms        Histogram: frequency by time       6.82 ms <

 Memory estimate: 7.64 MiB, allocs estimate: 110.

Feedback appreciated!

15 Likes

This is great.
Could you compare performance with StaticKernels.jl?

How does it handle borders?
Is it like virtual extension or optimized loop over the boundaries?

I doubt I will have time do a comparison, but I would love to see it if you or someone else did :wink:

But essentially Stencils.jl performance is just a combination of KernelAbstractions.jl and StaticArrays.jl performance - stencils are built with generated functions and are regular static vectors with all the optimisations already written for them in StaticArrays.jl. Then KernelAbstractions.jl maps function on them over an array.

Within this paradigm there is not much room to improve performance (or much code at all really), and if there is its likely an actual implementation bug rather than just bad performance.

(there are various padding optimizations and only half of them are implemented iin Stencils.jl. A two-step interior/border conditional pad may help a little bit, but it also means an extra GPU kernel launch)

1 Like