Why Atomix.@atomic b[] += a[i] works and b[] = b[] + a[i] does not

Hi all,

I am updating my lecture on GPU programming and this year, I would like to have kernels written in CUDA.jl and KernelAbstractions.jl side by side. The idea is to show students, how GPU accelerators are similar and that with Julia ecosystem, they can write pretty general stuff.

I am working on a classic example of reduction, which @maleadt uses in his talks. But I have encoutered a weird behavior

Specifically this kernel works

using Metal, BenchmarkTools
using KernelAbstractions
import KernelAbstractions as KA
using Atomix

@kernel function reduce_atomic(op, a, b)
    i = @index(Global)
    Atomix.@atomic b[] += a[i]
end

x = rand(Float32, 1024, 1024);
cx = MtlArray(x);
backend = KA.get_backend(cx);
cb = MtlArray([0f0]);
reduce_atomic(backend, 64)(+, cx, cb, ndrange=size(cx))
Metal.GPUArraysCore.@allowscalar cb[]
sum(x)

while this does not


@kernel function reduce_atomic(op, a, b)
    i = @index(Global)
    Atomix.@atomic b[] = b[] + a[i]
end

x = rand(Float32, 1024, 1024);
cx = MtlArray(x);
backend = KA.get_backend(cx);
cb = MtlArray([0f0]);
reduce_atomic(backend, 64)(+, cx, cb, ndrange=size(cx))
Metal.GPUArraysCore.@allowscalar cb[]
sum(x)

Can anyone please help me to understand, what is going on?

The @atomic macro does a syntax analysis only on the operator separating the left hand from the right hand side.

So it’s semantically different to do += which we will turn into an atomic increment, and = which is just an atomic assignment.

The rhs expression will be evaluated independently.

That makes a lot of sense. Can you recommend please, how to write it correctly?

@kernel function reduce_atomic(op, a, b)
    i = @index(Global)
    Atomix.@atomic b[] += a[i]
end

This is correct, no?

Perhaps the confusion is that CUDA.@atomic is different from Atomix.@atomic and IIRC the CUDA version might support the other variant as well, but that is not portable.

This is correct, but it does uses op. What if op = max ?

I would be fine to know that CUDA.@atomic is more general and Atomix.@atomic supports only subset, but in a portable way. I just do not want to rule out a possible solution.

Thanks a lot for help!

It’s less that CUDA.@atomic is more general. It was implemented first, then we added Atomix to support CPU and other backends, and the Atomix design ended up influencing Base. @atomic

So the general syntax should actually be

@kernel function reduce_atomic(op, a, b)
    i = @index(Global)
    Atomix.@atomic b[] op a[i]
end

But I am unsure if that is generally supported by all backends currently

thanks a lot. I undersand