I’ve just tagged cuTile.jl v0.3, featuring:
- CUDA.jl integration. Launching a cuTile kernel is now just
@cuda backend=cuTile .... - Better performance. We now match or outperform NVIDIA’s cuTile Python on every benchmark we ship.
- Much improved latency, with TTFX the same as with regular CUDA.jl kernels (~1.8s for a trivial kernel on my system).
- Random number generation, both host-level and in-kernel. Performance matches or beats cuRAND and the new GPUArrays.jl’ generator.
- Array slicing.
@view A[i:j, :]now produces a sub-range TileArray you can pass toct.load/ct.store.
Full write-up with code samples and benchmark numbers can be found on juliagpu.org: cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency ⋅ JuliaGPU
Upcoming webinar
If you’d like a guided tour, Andy Terrel (NVIDIA) and I are running a joint webinar on May 12, 2026 at 1 PM ET covering CUDA Tile’s design, how cuTile.jl is built on top of it, and several worked examples. Sign up here: cuTile.jl for High-Performance Computing in Julia - Event - JuliaHub