[ANN] cuTile.jl: Tile-based GPU programming for CUDA GPUs

I’m happy to announce an initial release of cuTile.jl, a new JuliaGPU package that makes it possible to program (Blackwell) NVIDIA GPUs using a tile-based abstraction by NVIDIA. This simplifies writing kernels, because you don’t have to think about threads or memory hierarchies anymore, everything is global memory accessed by blocks of threads:

using CUDA
import cuTile as ct

# Define kernel
function vadd(a, b, c, tile_size::Int)
    pid = ct.bid(1)
    tile_a = ct.load(a, pid, (tile_size,))
    tile_b = ct.load(b, pid, (tile_size,))
    ct.store(c, pid, tile_a + tile_b)
    return
end

# Launch
vector_size = 2^20
tile_size = 16
a, b = CUDA.rand(Float32, vector_size), CUDA.rand(Float32, vector_size)
c = CUDA.zeros(Float32, vector_size)

ct.launch(vadd, (cld(vector_size, tile_size), 1, 1), a, b, c, ct.Constant(tile_size))

@assert c == a .+ b

Compare this to a CUDA.jl vector addition:

function vadd(a, b, c)
    i = (blockIdx().x-1i32) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]
    return
end

@cuda threads=vector_size vadd(d_a, d_b, d_c)

Of course, the real power of tile-based programming becomes obvious with more complicated kernels, e.g., a full-blown matrix multiplication delivering pretty good performance (75% of CUBLAS) is as simple as:

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2},
                       tm::Int, tn::Int, tk::Int) where {T}
    M = size(A, 1)
    N = size(B, 2)
    K = ct.num_tiles(A, 2, (tm, tk))

    m, n = ct.bid(1), ct.bid(2)

    # K reduction loop - accumulate partial products
    acc = ct.full((tm, tn), zero(Float32), Float32)
    k = Int32(1)
    while k <= K
        a = ct.load(A, (m, k), (tm, tk); padding_mode=ct.PaddingMode.Zero)
        b = ct.load(B, (k, n), (tk, tn); padding_mode=ct.PaddingMode.Zero)
        if T === Float32
            # make use of tensor cores
            a = convert(ct.Tile{ct.TFloat32}, a)
            b = convert(ct.Tile{ct.TFloat32}, b)
        end
        acc = muladd(a, b, acc)
        k += Int32(1)
    end

    ct.store(C, (m, n), convert(ct.Tile{T}, acc))

    return nothing
end

As should be obvious from the 0.1 version number, cuTile.jl is under heavy development, and many features are still missing. Notably, not all of the Julia language is currently supported, as cuTile.jl brings its own Julia to Tile IR compiler. So please try out the package, file bugs or create PRs!

For more information, check out the NVIDIA developer zone blog post, or check out the repository which contains many more examples.

16 Likes

Hi!

seems cool!

Could you maybe comment a little if there is benefit to switch from KernelAbstractions.jl to cuTile.jl for CUDA use cases?
Do we get some naive performance improvements by switching the boilerplate?

There is currently no reason to switch. I envision cuTile.jl to be relevant for implementing very high-performance kernels (think matmul, fft, etc), where the complexity of Julia code and types used is low. CUDA.jl (and KernelAbstractions.jl) will still remain the go-to solution for general purpose kernels, at least in the foreseeable future.

1 Like

I noticed Nvidia saying:

The release of NVIDIA CUDA 13.1 introduces tile-based programming for GPUs, making it one of the most fundamental additions to GPU programming since CUDA was invented.

cuTile.j name implies this is only for CUDA/Nvidia. It seems this is about working at a higher/simpler level. Might such work on AMD etc.? Is suppose AMD needs to emulate this first. The name would then be misleading, if this could eventually work cross-GPU.

The MLIR dialect is open source, so yes other vendors could support Tile IR.

1 Like