Block/Tile-Based GPU Programming (not Scratch)

I just found that there’s this new (at least to me) GPU programming framework called “Block/Tile-Based”, whether it’s “block” or “tile” seems to depend on who you ask. The three I’ve just seen are OpenAI’s Triton, NVIDIA’s introduction of “tiles” to their Warp Python, and CUDA’s getting cuTile in the near future (if a LinkedIn post and a GitHub issue are anything to go off of).

As far as I understand it (mostly from this NVIDIA dev blog), the point is to make the programmer only have to worry about high-level thinking in terms of full array operations instead of needing to also manually manage host and device memory. That being said, there also seems to be the distinction that using these “tile operations” is more performant compared to the traditional kernel-oriented/single-instruction multiple-thread (SIMT) model; I don’t fully understand this yet (something to do with easier thread communication?). Anyway, I thought it was a cool idea so I kept reading and got to to the code, where it seemed…similar.

import warp as wp
 
@wp.kernel
def compute(A: wp.array2d(dtype=float),
            B: wp.array2d(dtype=float),
            C: wp.array2d(dtype=float)):
     
    # cooperatively load input tiles
    a = wp.tile_load(A, shape=(16, 16))
    b = wp.tile_load(B, shape=(16, 16))
 
    # compute sum
    c = a + b
 
    # cooperatively store sum to global memory
    wp.tile_store(C, c)
 
A = wp.ones((16,16), dtype=float)
B = wp.ones((16,16), dtype=float)
C = wp.empty((16,16), dtype=float)
 
wp.launch_tiled(compute, dim=1, inputs=[A, B, C], device="cuda:0", block_dim=64)

I saw this and my first thought was “This just feels like Julia”, or more specifically using CuArrays. So, naturally, my next thought was “Could this be used in the backend of CUDA.jl?”. I’m not at all familiar with the internals of the project (I guess I could go look, yay open source!) e.g. “is it wrappers around CUDA C calls? More specific generation of LLVM IR?”, but at the same time, isn’t this exactly what the JuliaGPU community has already implemented?

I’m curious to hear all your thoughts on this as it seems like a promising framework.

1 Like