KernelForge.jl — High-performance portable GPU primitives for arbitrary types and operators

I’m happy to announce two related packages for high-performance GPU computing in Julia:

  • KernelForge.jl — high-level GPU primitives (mapreduce, scan, matvec, search, vectorized copy) with performance competitive with vendor-optimized libraries
  • KernelIntrinsics.jl — the low-level foundation, providing warp shuffles, vectorized memory access, and memory ordering primitives on top of KernelAbstractions.jl

Motivation

Julia already has excellent options for GPU computing:

  • CUDA.jl is mature and highly optimized, but is CUDA-specific and relies on vendor libraries (cuBLAS, CUB) for performance-critical primitives
  • AcceleratedKernels.jl takes a cross-architecture approach via KernelAbstractions.jl, trading some performance against vendor libraries for portability

KernelForge.jl aims to combine the best of both: cross-architecture portability (once the underlying intrinsics are available on other backends) with performance matching or exceeding vendor libraries (at least on a (NVIDIA RTX 1000)), through pure Julia implementations using memory ordering and flags, warp-level reductions, and vectorized memory stores/loads.

What makes this different?

Two complementary goals drive the design:

  • Generality: support for arbitrary isbitstype structs, custom operators, and non-contiguous views — not just Float32 with +
  • Performance: match cuBLAS and CUB on concrete types like Float32

Both goals are achieved simultaneously, at least on a laptop GPU (NVIDIA RTX 1000) — see the performance section for benchmarks.

Available primitives

Primitive Description
mapreduce General reduction over any dimension(s), arbitrary types and operators
scan! 1D prefix scan (like accumulate in Julia Base), supports non-commutative operators
vcopy! Vectorized copy with configurable load/store widths
matvec, vecmat Matrix-vector and vector-matrix products with custom operators
argmax, argmin Index of extremum
findfirst, findlast Search on GPU arrays

Moreover, for all these primitives, we do not need to any init or neutral argument as CUDA.jl or AcceleratedKernel.jl !

using KernelForge as KF
using CUDA

src = CUDA.rand(Float32, 10^6)
dst = similar(src)

# Full reduction with custom function and operator
total = KF.mapreduce(abs2, +, src)

# Reduction over specific dimensions
B = CUDA.rand(Float32, 4, 8, 16)
result = KF.mapreduce(identity, +, B; dims=(1, 3))  # shape: (1, 8, 1)

# Views are fully supported
v = view(src, 1:2:10^6)
KF.scan!(+, dst, v)

# Search
i = KF.findfirst(>(0.99f0), src)
j = KF.argmax(src)

Roadmap

Time permitting, the next planned primitives are sort and matrix-matrix product and/or some graph algorithms

Architecture

KernelForge.jl          ← GPU primitives (mapreduce, scan, matvec, search, ...)
       │
KernelIntrinsics.jl     ← Low-level intrinsics (warp shuffle, vectorized load/store, memory fences)
       │
KernelAbstractions.jl   ← Backend dispatch
       │
   CUDA.jl              ← (AMD/Intel planned)

KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for library developers rather than end users. Extending support to AMD or Intel GPUs would primarily require work in KernelIntrinsics.jl; KernelForge.jl itself is designed to need minimal adaptation.

Current status

Both packages are experimental and currently CUDA-only, though an extensive test suite is already in place. Note that vectorized loads and stores in KernelIntrinsics.jl do not perform bounds checking. Correctness and performance have been validated on an NVIDIA RTX 1000.

Feedback, bug reports, and contributions are very welcome — especially from anyone with access to AMD or Intel hardware!

30 Likes

Awesome packages, not sure about the name of the second since there is already an KernelIntrinsics module within KernelAbstraction

2 Likes

Oh I didn’t know about that! I already changed the name once since it was first “MemoryAccess”, but that felt too general. Maybe something like KernelTools? Or it might be worth integrating the features into an existing package rather than maintaining a separate one. It can be quite heavy though already since it already comes with several features, each requiring an extension per backend.

I am also happy to take PRs to KernelAbstractions.jl or Atomix.jl/UnsafeAtomics.jl.

For KernelAbstraction, we currently maintain the 0.9 release branch and the 0.10 development branch. The only “challenge” for KA is that functionality should work on all backends, or at least there should be a query function.

1 Like

Thanks! I agree that warp shuffle operations and vectorized loads/stores would be a natural fit for KernelAbstractions.jl (although I wouldn’t quite know how to handle them on CPU — maybe with warp size equal to 1?), and that memory ordering primitives would belong in UnsafeAtomics.jl. I went the separate package route to get a quick proof of concept out before committing to full multi-backend support — and honestly, writing my own package is also a great way to learn quickly. I’d be happy to contribute the relevant pieces upstream once they’re more mature!

2 Likes

This is a very interesting package, thank you for putting this together!

Do I understand correctly that one could use KenelForge.jl to, for example, run integer operations on a GPU?

1 Like

Yes exactly. Actually you can already do this on CUDA.jl and AcceleratedKernels.jl.
But here the difference is performance matching CuBlas and Cub (see the benchmarks), (even for copy operation!).

Also I put a lot of example showing KernelForge works also on more complex structures (check UnitFloat8 type, or Quaternions for scan function).

Types must be bitstypes though but it’s a general GPU requirement.

1 Like

Nice work. Backend-agnostic warp operations would be useful for us in Molly.jl, where the force kernel relies on warp operations and we want to work on different GPUs.

There is a KernelAbstractions.jl issue open about warp operations: exposing warp-level semantics · Issue #420 · JuliaGPU/KernelAbstractions.jl · GitHub.

1 Like

Thanks! Great to hear this could be useful for Molly.jl.

For context on the current state of KernelForge.jl: I just landed ROCm backend support in KernelIntrinsics.jl and updated KernelForge.jl accordingly. Both packages now work on CUDA and ROCm, with all tests passing on both backends — this includes warp shuffles, vectorized memory accesses, and memory fences/ordering primitives. Performance is very close to vendor libraries (CUB/rocPRIM) on several primitives, tested on RTX 1000 and A40 (CUDA) and MI300X (ROCm).

Regarding #420: I wanted to move quickly without waiting on a KA PR. I also think warp operations are fundamental enough that they belong at this level — every GPU backend supports them, and one can always define a warp as a single thread for degenerate cases like CPUs.

Beyond just exposing backend warp ops though, I wanted something more abstract. At the hardware level, warp shuffles operate on 32-bit registers, and backends handle more complex types by recursing down to primitives — but they stop at complex numbers (quite surprisingly). Arbitrary structs and tuples aren’t supported. KernelIntrinsics.jl extends this recursion to handle general structs and tuples, which I think is important for use cases like Molly.jl where you’re naturally shuffling composite types (e.g. force/position vectors) across threads.

4 Likes

Yes that’s important, currently we do some sketchy struct dis-assembly and re-assembly to allow warp shuffles with arbitrary atom types. This might provide a better way to do that.

Hello Emmanuel, thank you very much for this awesome package, it provides exactly the kind of warp level control and vector load I needed to port the RadiK GPU TopK algo (paper, CUDA ++ repo) to backend-agnostic Julia (see BitonicSort.jl and RadiK.jl).

I also forked your package to implement some Metal functionality, I could open a PR if you’re interested !

Cheers !

4 Likes

Thanks !

Yes open a PR ! It would be interesting to have metal backend also :slight_smile: Ideally it would have to pass some tests like the ones already presents (mostly generated by claude)
If orderings in memory access are not as developped as for CUDA, you can fall back to strong thread fence that should already exist.

By the way I have also planed to implement a sorting algo looking more like quicksort (using only one kernel). I don’t know if my idea is good but I guess it is worth trying !

Emmanuel

1 Like