Scatter/gather operation with Metal

matthieu · June 29, 2023, 11:57am

Hey everyone.

I would like to convert the following scatter/gather operation from the CPU to the GPU, using Metal.jl.

function gather!(y::AbstractVector, refs::AbstractVector, x::AbstractVector)
   for i in eachindex(y)
      y[refs[i]] += x[i]
   end
   return y
end

I have written the following code:

using Metal
function gather!(y::MtlVector, refs::MtlVector, x::MtlVector; nthreads =
256)
   nblocks = cld(length(y), nthreads)
   Metal.@sync @metal threads=nthreads groups=nblocks gather_kernel!(y, refs,
x)
   return y
end

function gather_kernel!(y, refs, x)
   i = thread_position_in_grid_1d()
   if i <= length(refs)
      Metal.atomic_fetch_add_explicit(pointer(y, refs[i]), x[i])
   end
   return nothing
end

This works but this is much slower than on the CPU. Is there a way to speed up the code? Tim suggested I look into threadgroup memory but I don’t know what this means (I really don’t know much about GPUs!) See the Github thread for reference.

maleadt · June 30, 2023, 6:10am

You’ll have to educate yourself then There’s no easy way to write kernels without knowing about parallel programming. Luckily, there’s plenty of resources online, and you can mostly refer to CUDA material and substitute the intrinsics for Metal ones (shared memory → threadgroup memory, threadIdx() and blockIdx() → thread_position, etc). For example see scatter and gather with CUDA? - CUDA Programming and Performance - NVIDIA Developer Forums. In general, this kind of pattern requires an efficient parallel reduction, you can’t just hammer global memory and expect the kernel to perform well. See for example our mapreduce implementation: https://github.com/JuliaGPU/Metal.jl/blob/main/src/mapreduce.jl, where we perform several tricks to avoid using global memory (threadgroup memory, SIMD intrinsics, etc).

As an alternative, try to rephrase your problem in terms of existing array operations that we’ve implemented for you (like mapreducedim). But do know that for Metal.jl, these haven’t been as optimized as for CUDA.jl.

Topic		Replies	Views
Writing a Metal Kernel GPU	9	676	September 1, 2024
Render Pipeline in Metal.jl GPU question , metaljl	9	963	April 30, 2023
Launching a Metal kernel from a thread GPU gpu , multithreading , metaljl	3	394	July 24, 2023
Help with AutoDiff in Metal.jl GPU	7	315	May 17, 2023
GPU kernel that is ~20x slower than corresponding CPU version GPU	9	758	November 10, 2023

Scatter/gather operation with Metal

Related topics