Render Pipeline in Metal.jl

dchang10 · April 24, 2023, 8:44am

I’m currently trying to set up a render pipeline with Metal.jl by defining a vertex shader, rasterizer and fragment shader, but am not sure how to go about it. Is there an example or description of how this is done with Metal.jl?

maleadt · April 24, 2023, 10:25am

Metal.jl is currently focused on compute kernels; there’s no support for shaders yet.

dchang10 · April 24, 2023, 5:03pm

I see. I’m building a 2D projection of a 3D model that I want to use to generate a heatmap which I will use for template matching.

I hacked something together that almost does what I need it do, but I’m having some concurrency issues during rasterization. My rasterization algorithm takes an 100x100 MtlMatrix{Float32}, where 100 is the number of pixels in a row/column, and a 100 element long MtlVector{NTuple{3,NTuple{2,Float32}}} that represents a list of triangles. A triangle is simply three 2D vertices, with each vertex represented by a NTuple{2,Float32}. I’m dong a standard triangular rasterization algorithm like in this example here, where I check to see if a pixel is contained within a triangle, and increase its value by one when this condition is satisfied.

The evaluation of this essentially looks like this:

@metal threads=(100) groups=(100,100) gpu_intensity_map!(img::MtlMatrix{Float32}, field_of_view_x, field_of_view_y, triangles::MtlVector{NTuple{3,NTuple{2,Float32}}}),

so I have chosen to assign a group containing 100 threads to each pixel.

The problem I’m having is with assignment . The snippet of code that does the rasterization checking and assignment looks like this

    i = threadgroup_position_in_grid_2d()
    i_x = i.x 
    i_y = i.y
    i_z = thread_position_in_threadgroup_1d()

    p = Vec2D(xmin+(i_x-1)*Δx, ymin+(i_y-1)*Δy)
    img[i_x, i_y] += check_raster(p, triangles[i_z])

Running this on the GPU results in images that look like this:

Where as the CPU equivalent code looks like this:

This, to me, seems to be the result of a concurrency issue with assignment caused the threads in each group are all running simultaneously. I’ve also tried attempting to pass the vertices as an Vector to the GPU instead of as a MtlVector so that I could loop over it in the kernel, but am faced with the error:

KernelError: passing and using non-bitstype argument

when doing so.

Is there a ways to handle concurrency issues like this in metal?

maleadt · April 24, 2023, 7:04pm

This will overwrite previous values, as threads & groups execute in parallel. Metal.jl currently doesn’t support atomics yet, so you’d need some sort of parallel reduction to reduce the increment and only have a single thread assign to memory.

Support for atomics isn’t very hard, I just haven’t gotten to it yet, so alternatively you could take a stab at Support for atomics · Issue #79 · JuliaGPU/Metal.jl · GitHub

maleadt · April 27, 2023, 7:40am

FYI, I’ve started working on atomics: Implement atomics by maleadt · Pull Request #168 · JuliaGPU/Metal.jl · GitHub

dchang10 · April 27, 2023, 6:06pm

Thanks for the update. I’m attempting to do some sort of parallel reduction first because I’ve been advised that this should be faster than using atomics. I see in the Metal specification that it is possible to define a shared memory amongst threads in a group. Is there a way to do this in metal.jl?

maleadt · April 28, 2023, 2:33pm

Yes, use MtlThreadGroupArray, e.g., Metal.jl/intrinsics.jl at e0b33e607241a928ce78794c9a77b34dc2080e79 · JuliaGPU/Metal.jl · GitHub. Do not that with Metal you need to specify the threadgroup memory scope flag to threadgroup_barrier (looks like the tests are wrong about that).

dchang10 · April 28, 2023, 7:01pm

I was told that the threadgroup_barrier would likely be slower than just using Metals default atomic implementation. Given that you’ve already started working on implementing this, is there anything that I could do that wouldn’t result in me stepping on your toes? I was in the middle of understanding how to write and compile Metal shaders with metal.cpp. So I would need sometime to figure that out first.

maleadt · April 28, 2023, 8:08pm

That’s interesting; I would have expected synchronization within threadblocks to be cheaper than a global atomic.

Anyway, I don’t think it’s easy to work on this concurrently, at least not before the initial bits are ironed out. Lots of other things to improve though

dchang10 · April 30, 2023, 5:07pm

Thanks for the help. I think I got something based off your test example that seems to work. I’m sure that there’s more optimization that could be done. I did a parallel reduction with a threadgroup_barrier(Metal.MemoryFlagThreadGroup) to synchronize each step. The final logic looks schematically like this:

and in code like this:

    i = threadgroup_position_in_grid_2d()
    i_x = i.x 
    i_y = i.y
    i_z = thread_position_in_threadgroup_1d()
    p = Vec2D(xmin+(i_x-1)*Δx, ymin+(i_y-1)*Δy)

    s = MtlThreadGroupArray(Float32, 1024)
    s[i_z] = check_raster(p, triangles[i_z])

    stride = 128
    while stride > 0
        threadgroup_barrier(Metal.MemoryFlagThreadGroup)
        if i_z <= stride
            s[i_z] += s[i_z+stride]
        end
        stride = stride ÷ 2
    end

    if i_z == 1
        img[i_x, i_y] = s[1]
    end

The image renders properly now.

Topic		Replies	Views
Scatter/gather operation with Metal GPU	1	663	June 30, 2023
Writing a Metal Kernel GPU	9	680	September 1, 2024
Launching a Metal kernel from a thread GPU gpu , multithreading , metaljl	3	402	July 24, 2023
Help with AutoDiff in Metal.jl GPU	7	322	May 17, 2023
Metal throws ERROR: Exception handler triggered on unmanaged thread Performance gpu , apple , metaljl	2	336	April 21, 2023

Render Pipeline in Metal.jl

Related topics