[ANN] cuTile.jl v0.3 + webinar

I’ve just tagged cuTile.jl v0.3, featuring:

  • CUDA.jl integration. Launching a cuTile kernel is now just @cuda backend=cuTile ....
  • Better performance. We now match or outperform NVIDIA’s cuTile Python on every benchmark we ship.
  • Much improved latency, with TTFX the same as with regular CUDA.jl kernels (~1.8s for a trivial kernel on my system).
  • Random number generation, both host-level and in-kernel. Performance matches or beats cuRAND and the new GPUArrays.jl’ generator.
  • Array slicing. @view A[i:j, :] now produces a sub-range TileArray you can pass to ct.load / ct.store.

Full write-up with code samples and benchmark numbers can be found on juliagpu.org: cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency ⋅ JuliaGPU

Upcoming webinar

If you’d like a guided tour, Andy Terrel (NVIDIA) and I are running a joint webinar on May 12, 2026 at 1 PM ET covering CUDA Tile’s design, how cuTile.jl is built on top of it, and several worked examples. Sign up here: cuTile.jl for High-Performance Computing in Julia - Event - JuliaHub

5 Likes