KernelForge.jl — High-performance portable GPU primitives for arbitrary types and operators

I’m happy to announce two related packages for high-performance GPU computing in Julia:

  • KernelForge.jl — high-level GPU primitives (mapreduce, scan, matvec, search, vectorized copy) with performance competitive with vendor-optimized libraries
  • KernelIntrinsics.jl — the low-level foundation, providing warp shuffles, vectorized memory access, and memory ordering primitives on top of KernelAbstractions.jl

Motivation

Julia already has excellent options for GPU computing:

  • CUDA.jl is mature and highly optimized, but is CUDA-specific and relies on vendor libraries (cuBLAS, CUB) for performance-critical primitives
  • AcceleratedKernels.jl takes a cross-architecture approach via KernelAbstractions.jl, trading some performance against vendor libraries for portability

KernelForge.jl aims to combine the best of both: cross-architecture portability (once the underlying intrinsics are available on other backends) with performance matching or exceeding vendor libraries (at least on a (NVIDIA RTX 1000)), through pure Julia implementations using memory ordering and flags, warp-level reductions, and vectorized memory stores/loads.

What makes this different?

Two complementary goals drive the design:

  • Generality: support for arbitrary isbitstype structs, custom operators, and non-contiguous views — not just Float32 with +
  • Performance: match cuBLAS and CUB on concrete types like Float32

Both goals are achieved simultaneously, at least on a laptop GPU (NVIDIA RTX 1000) — see the performance section for benchmarks.

Available primitives

Primitive Description
mapreduce General reduction over any dimension(s), arbitrary types and operators
scan! 1D prefix scan (like accumulate in Julia Base), supports non-commutative operators
vcopy! Vectorized copy with configurable load/store widths
matvec, vecmat Matrix-vector and vector-matrix products with custom operators
argmax, argmin Index of extremum
findfirst, findlast Search on GPU arrays

Moreover, for all these primitives, we do not need to any init or neutral argument as CUDA.jl or AcceleratedKernel.jl !

using KernelForge as KF
using CUDA

src = CUDA.rand(Float32, 10^6)
dst = similar(src)

# Full reduction with custom function and operator
total = KF.mapreduce(abs2, +, src)

# Reduction over specific dimensions
B = CUDA.rand(Float32, 4, 8, 16)
result = KF.mapreduce(identity, +, B; dims=(1, 3))  # shape: (1, 8, 1)

# Views are fully supported
v = view(src, 1:2:10^6)
KF.scan!(+, dst, v)

# Search
i = KF.findfirst(>(0.99f0), src)
j = KF.argmax(src)

Roadmap

Time permitting, the next planned primitives are sort and matrix-matrix product and/or some graph algorithms

Architecture

KernelForge.jl          ← GPU primitives (mapreduce, scan, matvec, search, ...)
       │
KernelIntrinsics.jl     ← Low-level intrinsics (warp shuffle, vectorized load/store, memory fences)
       │
KernelAbstractions.jl   ← Backend dispatch
       │
   CUDA.jl              ← (AMD/Intel planned)

KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for library developers rather than end users. Extending support to AMD or Intel GPUs would primarily require work in KernelIntrinsics.jl; KernelForge.jl itself is designed to need minimal adaptation.

Current status

Both packages are experimental and currently CUDA-only, though an extensive test suite is already in place. Note that vectorized loads and stores in KernelIntrinsics.jl do not perform bounds checking. Correctness and performance have been validated on an NVIDIA RTX 1000.

Feedback, bug reports, and contributions are very welcome — especially from anyone with access to AMD or Intel hardware!

12 Likes