KernelForge.jl — High-performance portable GPU primitives for arbitrary types and operators

I’m happy to announce two related packages for high-performance GPU computing in Julia:

  • KernelForge.jl — high-level GPU primitives (mapreduce, scan, matvec, search, vectorized copy) with performance competitive with vendor-optimized libraries
  • KernelIntrinsics.jl — the low-level foundation, providing warp shuffles, vectorized memory access, and memory ordering primitives on top of KernelAbstractions.jl

Motivation

Julia already has excellent options for GPU computing:

  • CUDA.jl is mature and highly optimized, but is CUDA-specific and relies on vendor libraries (cuBLAS, CUB) for performance-critical primitives
  • AcceleratedKernels.jl takes a cross-architecture approach via KernelAbstractions.jl, trading some performance against vendor libraries for portability

KernelForge.jl aims to combine the best of both: cross-architecture portability (once the underlying intrinsics are available on other backends) with performance matching or exceeding vendor libraries (at least on a (NVIDIA RTX 1000)), through pure Julia implementations using memory ordering and flags, warp-level reductions, and vectorized memory stores/loads.

What makes this different?

Two complementary goals drive the design:

  • Generality: support for arbitrary isbitstype structs, custom operators, and non-contiguous views — not just Float32 with +
  • Performance: match cuBLAS and CUB on concrete types like Float32

Both goals are achieved simultaneously, at least on a laptop GPU (NVIDIA RTX 1000) — see the performance section for benchmarks.

Available primitives

Primitive Description
mapreduce General reduction over any dimension(s), arbitrary types and operators
scan! 1D prefix scan (like accumulate in Julia Base), supports non-commutative operators
vcopy! Vectorized copy with configurable load/store widths
matvec, vecmat Matrix-vector and vector-matrix products with custom operators
argmax, argmin Index of extremum
findfirst, findlast Search on GPU arrays

Moreover, for all these primitives, we do not need to any init or neutral argument as CUDA.jl or AcceleratedKernel.jl !

using KernelForge as KF
using CUDA

src = CUDA.rand(Float32, 10^6)
dst = similar(src)

# Full reduction with custom function and operator
total = KF.mapreduce(abs2, +, src)

# Reduction over specific dimensions
B = CUDA.rand(Float32, 4, 8, 16)
result = KF.mapreduce(identity, +, B; dims=(1, 3))  # shape: (1, 8, 1)

# Views are fully supported
v = view(src, 1:2:10^6)
KF.scan!(+, dst, v)

# Search
i = KF.findfirst(>(0.99f0), src)
j = KF.argmax(src)

Roadmap

Time permitting, the next planned primitives are sort and matrix-matrix product and/or some graph algorithms

Architecture

KernelForge.jl          ← GPU primitives (mapreduce, scan, matvec, search, ...)
       │
KernelIntrinsics.jl     ← Low-level intrinsics (warp shuffle, vectorized load/store, memory fences)
       │
KernelAbstractions.jl   ← Backend dispatch
       │
   CUDA.jl              ← (AMD/Intel planned)

KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for library developers rather than end users. Extending support to AMD or Intel GPUs would primarily require work in KernelIntrinsics.jl; KernelForge.jl itself is designed to need minimal adaptation.

Current status

Both packages are experimental and currently CUDA-only, though an extensive test suite is already in place. Note that vectorized loads and stores in KernelIntrinsics.jl do not perform bounds checking. Correctness and performance have been validated on an NVIDIA RTX 1000.

Feedback, bug reports, and contributions are very welcome — especially from anyone with access to AMD or Intel hardware!

26 Likes

Awesome packages, not sure about the name of the second since there is already an KernelIntrinsics module within KernelAbstraction

2 Likes

Oh I didn’t know about that! I already changed the name once since it was first “MemoryAccess”, but that felt too general. Maybe something like KernelTools? Or it might be worth integrating the features into an existing package rather than maintaining a separate one. It can be quite heavy though already since it already comes with several features, each requiring an extension per backend.

I am also happy to take PRs to KernelAbstractions.jl or Atomix.jl/UnsafeAtomics.jl.

For KernelAbstraction, we currently maintain the 0.9 release branch and the 0.10 development branch. The only “challenge” for KA is that functionality should work on all backends, or at least there should be a query function.

1 Like

Thanks! I agree that warp shuffle operations and vectorized loads/stores would be a natural fit for KernelAbstractions.jl (although I wouldn’t quite know how to handle them on CPU — maybe with warp size equal to 1?), and that memory ordering primitives would belong in UnsafeAtomics.jl. I went the separate package route to get a quick proof of concept out before committing to full multi-backend support — and honestly, writing my own package is also a great way to learn quickly. I’d be happy to contribute the relevant pieces upstream once they’re more mature!

2 Likes

This is a very interesting package, thank you for putting this together!

Do I understand correctly that one could use KenelForge.jl to, for example, run integer operations on a GPU?

1 Like

Yes exactly. Actually you can already do this on CUDA.jl and AcceleratedKernels.jl.
But here the difference is performance matching CuBlas and Cub (see the benchmarks), (even for copy operation!).

Also I put a lot of example showing KernelForge works also on more complex structures (check UnitFloat8 type, or Quaternions for scan function).

Types must be bitstypes though but it’s a general GPU requirement.

1 Like

Nice work. Backend-agnostic warp operations would be useful for us in Molly.jl, where the force kernel relies on warp operations and we want to work on different GPUs.

There is a KernelAbstractions.jl issue open about warp operations: exposing warp-level semantics · Issue #420 · JuliaGPU/KernelAbstractions.jl · GitHub.