I’m currently trying to set up a render pipeline with Metal.jl by defining a vertex shader, rasterizer and fragment shader, but am not sure how to go about it. Is there an example or description of how this is done with Metal.jl?
Metal.jl is currently focused on compute kernels; there’s no support for shaders yet.
I see. I’m building a 2D projection of a 3D model that I want to use to generate a heatmap which I will use for template matching.
I hacked something together that almost does what I need it do, but I’m having some concurrency issues during rasterization. My rasterization algorithm takes an 100x100 MtlMatrix{Float32}
, where 100 is the number of pixels in a row/column, and a 100 element long MtlVector{NTuple{3,NTuple{2,Float32}}}
that represents a list of triangles. A triangle is simply three 2D vertices, with each vertex represented by a NTuple{2,Float32}
. I’m dong a standard triangular rasterization algorithm like in this example here, where I check to see if a pixel is contained within a triangle, and increase its value by one when this condition is satisfied.
The evaluation of this essentially looks like this:
@metal threads=(100) groups=(100,100) gpu_intensity_map!(img::MtlMatrix{Float32}, field_of_view_x, field_of_view_y, triangles::MtlVector{NTuple{3,NTuple{2,Float32}}})
,
so I have chosen to assign a group containing 100 threads to each pixel.
The problem I’m having is with assignment . The snippet of code that does the rasterization checking and assignment looks like this
i = threadgroup_position_in_grid_2d()
i_x = i.x
i_y = i.y
i_z = thread_position_in_threadgroup_1d()
p = Vec2D(xmin+(i_x-1)*Δx, ymin+(i_y-1)*Δy)
img[i_x, i_y] += check_raster(p, triangles[i_z])
Running this on the GPU results in images that look like this:
Where as the CPU equivalent code looks like this:
This, to me, seems to be the result of a concurrency issue with assignment caused the threads in each group are all running simultaneously. I’ve also tried attempting to pass the vertices as an Vector
to the GPU instead of as a MtlVector
so that I could loop over it in the kernel, but am faced with the error:
KernelError: passing and using non-bitstype argument
when doing so.
Is there a ways to handle concurrency issues like this in metal?
This will overwrite previous values, as threads & groups execute in parallel. Metal.jl currently doesn’t support atomics yet, so you’d need some sort of parallel reduction to reduce the increment and only have a single thread assign to memory.
Support for atomics isn’t very hard, I just haven’t gotten to it yet, so alternatively you could take a stab at Support for atomics · Issue #79 · JuliaGPU/Metal.jl · GitHub
FYI, I’ve started working on atomics: Implement atomics by maleadt · Pull Request #168 · JuliaGPU/Metal.jl · GitHub
Thanks for the update. I’m attempting to do some sort of parallel reduction first because I’ve been advised that this should be faster than using atomics. I see in the Metal specification that it is possible to define a shared memory amongst threads in a group. Is there a way to do this in metal.jl?
Yes, use MtlThreadGroupArray
, e.g., Metal.jl/intrinsics.jl at e0b33e607241a928ce78794c9a77b34dc2080e79 · JuliaGPU/Metal.jl · GitHub. Do not that with Metal you need to specify the threadgroup memory scope flag to threadgroup_barrier
(looks like the tests are wrong about that).
I was told that the threadgroup_barrier
would likely be slower than just using Metals default atomic implementation. Given that you’ve already started working on implementing this, is there anything that I could do that wouldn’t result in me stepping on your toes? I was in the middle of understanding how to write and compile Metal shaders with metal.cpp. So I would need sometime to figure that out first.
That’s interesting; I would have expected synchronization within threadblocks to be cheaper than a global atomic.
Anyway, I don’t think it’s easy to work on this concurrently, at least not before the initial bits are ironed out. Lots of other things to improve though
Thanks for the help. I think I got something based off your test example that seems to work. I’m sure that there’s more optimization that could be done. I did a parallel reduction with a threadgroup_barrier(Metal.MemoryFlagThreadGroup)
to synchronize each step. The final logic looks schematically like this:
and in code like this:
i = threadgroup_position_in_grid_2d()
i_x = i.x
i_y = i.y
i_z = thread_position_in_threadgroup_1d()
p = Vec2D(xmin+(i_x-1)*Δx, ymin+(i_y-1)*Δy)
s = MtlThreadGroupArray(Float32, 1024)
s[i_z] = check_raster(p, triangles[i_z])
stride = 128
while stride > 0
threadgroup_barrier(Metal.MemoryFlagThreadGroup)
if i_z <= stride
s[i_z] += s[i_z+stride]
end
stride = stride ÷ 2
end
if i_z == 1
img[i_x, i_y] = s[1]
end
The image renders properly now.