[ANN] Stencils.jl for fast small/direct stencils on CPU/GPU

Raf · November 22, 2023, 3:22pm

Thought I should announce this as its getting kind of stable and useful:

The core idea of Stencils.jl is that a Stencil is a StaticVector of values extracted from an array at a specific position, following some pattern over N dimensions.

These can then be mapped over arrays on CPUs or GPUs (using KernelAbstractions.jl) to do things like convolutions or cellular automata.

Here’s a taste of some stencil patterns from the readme:

julia> Moore(1)
Moore{1, 2, 8, Nothing}
█▀█
▀▀▀


julia> Circle{3,2}()
Circle{3, 2, 37, Nothing}
 ▄███▄ 
███████
▀█████▀
  ▀▀▀  

julia> Window(3)
Window{3, 2, 49, Nothing}
███████
███████
███████
▀▀▀▀▀▀▀
# You can also define custom patterns, even with named positions
julia> ns = NamedStencil(n=(-1, 0), w=(0, -1), s=(1, 0), e=(0, 1))
NamedStencil{(:n, :w, :s, :e), ((-1, 0), (0, -1), (1, 0), (0, 1)), 1, 2, 4, Nothing}
▄▀▄
 ▀ 

# A StencilArray is used for stencil mapping and extraction
julia> A = StencilArray(
           [0 2 0 0 4
            0 0 6 0 0
            0 0 0 8 7
            0 0 3 0 1
            5 0 1 0 9],
           ns
       );

# You can retrieve a stencil at a specific location, and the named value:
julia> stencil(A, (4, 4)).n
8

julia> stencil(A, (4, 4)).w
3

And heres a benchmark of a mean blur on a laptop. Performance is pretty good for small kernels, they should be quite competitive with ImageFiltering.jl or DSP.jl even without GPUs. There are no special optimizations like splitting Gaussian kernels for convolutions, but its pretty easy to write that if you need it.

using Stencils, Statistics, BenchmarkTools
# Define a random array
r = rand(1000, 1000)
# Use a square 3x3/radius 1 stencil
stencil = Window(1)
# Wrap them both as a StencilArray
A = StencilArray(r, stencil)
# Map `mean` over all stencils in the array. You can use any function here - 
# `identity` would return an array of `Window` stencils.
@benchmark mapstencil(mean, A)

BenchmarkTools.Trial: 1058 samples with 1 evaluation.
 Range (min … max):  2.755 ms … 9.693 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.373 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.718 ms ± 1.326 ms  ┊ GC (mean ± σ):  2.92% ± 5.78%

    ▆▂                                   ▁█▅                 
  ▂▇██▄▁▂▂▁▂▂▄▅▂▂▁▂▁▁▁▂▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▅▄▂███▆▄▁▁▁▁▂▁▁▂▁▁▂▄▄ ▂
  2.75 ms        Histogram: frequency by time       6.82 ms <

 Memory estimate: 7.64 MiB, allocs estimate: 110.

Feedback appreciated!

RoyiAvital · November 22, 2023, 4:16pm

This is great.
Could you compare performance with StaticKernels.jl?

How does it handle borders?
Is it like virtual extension or optimized loop over the boundaries?

Raf · November 22, 2023, 5:01pm

I doubt I will have time do a comparison, but I would love to see it if you or someone else did

But essentially Stencils.jl performance is just a combination of KernelAbstractions.jl and StaticArrays.jl performance - stencils are built with generated functions and are regular static vectors with all the optimisations already written for them in StaticArrays.jl. Then KernelAbstractions.jl maps function on them over an array.

Within this paradigm there is not much room to improve performance (or much code at all really), and if there is its likely an actual implementation bug rather than just bad performance.

(there are various padding optimizations and only half of them are implemented iin Stencils.jl. A two-step interior/border conditional pad may help a little bit, but it also means an extra GPU kernel launch)

Topic		Replies	Views
Generic way to write array stencils/kernels for CPUs and GPUs? General Usage	5	610	September 28, 2018
Writing fast stencil computation kernels that work on both CPUs and GPUs GPU	3	2204	January 29, 2019
[ANN]: StaticKernels.jl - Fast Kernel Operations on Arrays Package Announcements	24	2713	April 27, 2020
Writing stencils for CuArray GPU	6	1144	July 31, 2019
Computationally efficient image filtering Performance package , dsp , methods	3	350	May 16, 2023

[ANN] Stencils.jl for fast small/direct stencils on CPU/GPU

Related topics