I’m happy to announce two related packages for high-performance GPU computing in Julia:
- KernelForge.jl — high-level GPU primitives (mapreduce, scan, matvec, search, vectorized copy) with performance competitive with vendor-optimized libraries
- KernelIntrinsics.jl — the low-level foundation, providing warp shuffles, vectorized memory access, and memory ordering primitives on top of KernelAbstractions.jl
Motivation
Julia already has excellent options for GPU computing:
- CUDA.jl is mature and highly optimized, but is CUDA-specific and relies on vendor libraries (cuBLAS, CUB) for performance-critical primitives
- AcceleratedKernels.jl takes a cross-architecture approach via KernelAbstractions.jl, trading some performance against vendor libraries for portability
KernelForge.jl aims to combine the best of both: cross-architecture portability (once the underlying intrinsics are available on other backends) with performance matching or exceeding vendor libraries (at least on a (NVIDIA RTX 1000)), through pure Julia implementations using memory ordering and flags, warp-level reductions, and vectorized memory stores/loads.
What makes this different?
Two complementary goals drive the design:
- Generality: support for arbitrary
isbitstypestructs, custom operators, and non-contiguous views — not justFloat32with+ - Performance: match cuBLAS and CUB on concrete types like
Float32
Both goals are achieved simultaneously, at least on a laptop GPU (NVIDIA RTX 1000) — see the performance section for benchmarks.
Available primitives
| Primitive | Description |
|---|---|
mapreduce |
General reduction over any dimension(s), arbitrary types and operators |
scan! |
1D prefix scan (like accumulate in Julia Base), supports non-commutative operators |
vcopy! |
Vectorized copy with configurable load/store widths |
matvec, vecmat |
Matrix-vector and vector-matrix products with custom operators |
argmax, argmin |
Index of extremum |
findfirst, findlast |
Search on GPU arrays |
Moreover, for all these primitives, we do not need to any init or neutral argument as CUDA.jl or AcceleratedKernel.jl !
using KernelForge as KF
using CUDA
src = CUDA.rand(Float32, 10^6)
dst = similar(src)
# Full reduction with custom function and operator
total = KF.mapreduce(abs2, +, src)
# Reduction over specific dimensions
B = CUDA.rand(Float32, 4, 8, 16)
result = KF.mapreduce(identity, +, B; dims=(1, 3)) # shape: (1, 8, 1)
# Views are fully supported
v = view(src, 1:2:10^6)
KF.scan!(+, dst, v)
# Search
i = KF.findfirst(>(0.99f0), src)
j = KF.argmax(src)
Roadmap
Time permitting, the next planned primitives are sort and matrix-matrix product and/or some graph algorithms
Architecture
KernelForge.jl ← GPU primitives (mapreduce, scan, matvec, search, ...)
│
KernelIntrinsics.jl ← Low-level intrinsics (warp shuffle, vectorized load/store, memory fences)
│
KernelAbstractions.jl ← Backend dispatch
│
CUDA.jl ← (AMD/Intel planned)
KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for library developers rather than end users. Extending support to AMD or Intel GPUs would primarily require work in KernelIntrinsics.jl; KernelForge.jl itself is designed to need minimal adaptation.
Current status
Both packages are experimental and currently CUDA-only, though an extensive test suite is already in place. Note that vectorized loads and stores in KernelIntrinsics.jl do not perform bounds checking. Correctness and performance have been validated on an NVIDIA RTX 1000.
Feedback, bug reports, and contributions are very welcome — especially from anyone with access to AMD or Intel hardware!