KernelForge.jl — High-performance portable GPU primitives for arbitrary types and operators

epilliat · February 22, 2026, 8:27pm

I’m happy to announce two related packages for high-performance GPU computing in Julia:

KernelForge.jl — high-level GPU primitives (mapreduce, scan, matvec, search, vectorized copy) with performance competitive with vendor-optimized libraries
KernelIntrinsics.jl — the low-level foundation, providing warp shuffles, vectorized memory access, and memory ordering primitives on top of KernelAbstractions.jl

Motivation

Julia already has excellent options for GPU computing:

CUDA.jl is mature and highly optimized, but is CUDA-specific and relies on vendor libraries (cuBLAS, CUB) for performance-critical primitives
AcceleratedKernels.jl takes a cross-architecture approach via KernelAbstractions.jl, trading some performance against vendor libraries for portability

KernelForge.jl aims to combine the best of both: cross-architecture portability (once the underlying intrinsics are available on other backends) with performance matching or exceeding vendor libraries (at least on a (NVIDIA RTX 1000)), through pure Julia implementations using memory ordering and flags, warp-level reductions, and vectorized memory stores/loads.

What makes this different?

Two complementary goals drive the design:

Generality: support for arbitrary isbitstype structs, custom operators, and non-contiguous views — not just Float32 with +
Performance: match cuBLAS and CUB on concrete types like Float32

Both goals are achieved simultaneously, at least on a laptop GPU (NVIDIA RTX 1000) — see the performance section for benchmarks.

Available primitives

Primitive	Description
`mapreduce`	General reduction over any dimension(s), arbitrary types and operators
`scan!`	1D prefix scan (like `accumulate` in Julia Base), supports non-commutative operators
`vcopy!`	Vectorized copy with configurable load/store widths
`matvec`, `vecmat`	Matrix-vector and vector-matrix products with custom operators
`argmax`, `argmin`	Index of extremum
`findfirst`, `findlast`	Search on GPU arrays

Moreover, for all these primitives, we do not need to any init or neutral argument as CUDA.jl or AcceleratedKernel.jl !

using KernelForge as KF
using CUDA

src = CUDA.rand(Float32, 10^6)
dst = similar(src)

# Full reduction with custom function and operator
total = KF.mapreduce(abs2, +, src)

# Reduction over specific dimensions
B = CUDA.rand(Float32, 4, 8, 16)
result = KF.mapreduce(identity, +, B; dims=(1, 3))  # shape: (1, 8, 1)

# Views are fully supported
v = view(src, 1:2:10^6)
KF.scan!(+, dst, v)

# Search
i = KF.findfirst(>(0.99f0), src)
j = KF.argmax(src)

Roadmap

Time permitting, the next planned primitives are sort and matrix-matrix product and/or some graph algorithms

Architecture

KernelForge.jl          ← GPU primitives (mapreduce, scan, matvec, search, ...)
       │
KernelIntrinsics.jl     ← Low-level intrinsics (warp shuffle, vectorized load/store, memory fences)
       │
KernelAbstractions.jl   ← Backend dispatch
       │
   CUDA.jl              ← (AMD/Intel planned)

KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for library developers rather than end users. Extending support to AMD or Intel GPUs would primarily require work in KernelIntrinsics.jl; KernelForge.jl itself is designed to need minimal adaptation.

Current status

Both packages are experimental and currently CUDA-only, though an extensive test suite is already in place. Note that vectorized loads and stores in KernelIntrinsics.jl do not perform bounds checking. Correctness and performance have been validated on an NVIDIA RTX 1000.

Feedback, bug reports, and contributions are very welcome — especially from anyone with access to AMD or Intel hardware!

yolhan_mannes · February 23, 2026, 4:31am

Awesome packages, not sure about the name of the second since there is already an KernelIntrinsics module within KernelAbstraction

github.com/JuliaGPU/KernelAbstractions.jl

src/intrinsics.jl

main

"""
# `KernelIntrinsics`

The `KernelIntrinsics` (or `KI`) module defines the API interface for backends to define various lower-level device and
host-side functionality. The `KI` intrinsics are used to define the higher-level device-side
intrinsics functionality in `KernelAbstractions`.

Both provide APIs for host and device-side functionality, but `KI` focuses on on lower-level
functionality that is shared amongst backends, while `KernelAbstractions` provides higher-level functionality
such as writing kernels that work on arrays with an arbitrary number of dimensions, or convenience functions
like allocating arrays on a backend.
"""
module KernelIntrinsics

import ..KernelAbstractions: Backend
import GPUCompiler: split_kwargs, assign_args!

"""
    get_global_size()::@NamedTuple{x::Int, y::Int, z::Int}

This file has been truncated. show original

epilliat · February 23, 2026, 8:17am

Oh I didn’t know about that! I already changed the name once since it was first “MemoryAccess”, but that felt too general. Maybe something like KernelTools? Or it might be worth integrating the features into an existing package rather than maintaining a separate one. It can be quite heavy though already since it already comes with several features, each requiring an extension per backend.

vchuravy · February 23, 2026, 9:20am

I am also happy to take PRs to KernelAbstractions.jl or Atomix.jl/UnsafeAtomics.jl.

For KernelAbstraction, we currently maintain the 0.9 release branch and the 0.10 development branch. The only “challenge” for KA is that functionality should work on all backends, or at least there should be a query function.

epilliat · February 23, 2026, 10:28am

Thanks! I agree that warp shuffle operations and vectorized loads/stores would be a natural fit for KernelAbstractions.jl (although I wouldn’t quite know how to handle them on CPU — maybe with warp size equal to 1?), and that memory ordering primitives would belong in UnsafeAtomics.jl. I went the separate package route to get a quick proof of concept out before committing to full multi-backend support — and honestly, writing my own package is also a great way to learn quickly. I’d be happy to contribute the relevant pieces upstream once they’re more mature!

caleb-allen · February 25, 2026, 2:10pm

This is a very interesting package, thank you for putting this together!

Do I understand correctly that one could use KenelForge.jl to, for example, run integer operations on a GPU?

epilliat · February 25, 2026, 2:15pm

Yes exactly. Actually you can already do this on CUDA.jl and AcceleratedKernels.jl.
But here the difference is performance matching CuBlas and Cub (see the benchmarks), (even for copy operation!).

Also I put a lot of example showing KernelForge works also on more complex structures (check UnitFloat8 type, or Quaternions for scan function).

Types must be bitstypes though but it’s a general GPU requirement.

jgreener64 · March 14, 2026, 7:12pm

Nice work. Backend-agnostic warp operations would be useful for us in Molly.jl, where the force kernel relies on warp operations and we want to work on different GPUs.

There is a KernelAbstractions.jl issue open about warp operations: exposing warp-level semantics · Issue #420 · JuliaGPU/KernelAbstractions.jl · GitHub.

Topic		Replies	Views
Packages to write a blog post on “Optimizing an X matmul kernel” in Julia New to Julia	1	172	November 30, 2024
Using functions in GPU Kernel (via KernelAbstractions.jl) (k nearest neighbor kernel) GPU	1	1035	January 25, 2021
Difference between GPUArrays.jl and KernelAbstractations.jl GPU gpuarrays , kernelabstractions	4	380	June 12, 2024
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	466	June 4, 2025
[ANN] AcceleratedKernels.jl - Cross-architecture parallel algorithms for Julia's GPU backends Package Announcements package , announcement , gpu , performance , parallel	17	1798	March 3, 2026