Transducers/FLoops equivalent for mapreduce with dims arg (on GPU)

Hi all,
Is it possible to write an equivalent of mapreduce(; dims) using either Transducers or FLoops? I’m interested in the NVidia GPU context.

I want to do something like:

julia> begin
       using CUDA, LazyArrays
       a = CUDA.rand(Float32, (100_000, 1))
       b = CUDA.rand(Float32, (1, 5_000))
       m = @~ a .+ b
       minimum(m, dims=2)
       end
100000×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.5155287
 0.61906624
 0.5843259
 0.98962003
 0.7108086
 0.050809365
 0.85040236
 0.033321425
 0.5592842
 0.98937845
 0.52907884
 0.49551845
 0.97432935
 ⋮
 0.5292897
 0.9853454
 0.7320634
 0.7673758
 0.07746087
 0.70310813
 0.03921157
 0.3690481
 0.31347668
 0.93635935
 0.2896975
 0.808064

I want to do a reduce operation over all elements of b for every element of a. It has to be broadcasted since m would be prohibitive to instantiate (and just wasteful). CUDA.jl’s reducedim doesn’t exploit the fact that the values of b are shared across all values of a. However, FoldsCUDA.jl does have warp shuffles, which suggests there could be a performance improvement. Note the above is a MWE - I’m interested in reducing over a broadcasted function which takes many args (but still 2 dims)

Any suggestions are welcome, apparently I’m not human since FLoops’ “folds for humans” logic is above me (:
A great starting point would probably be just a mapreduce with dims using Transducers, not even on GPU.

One idea is @tullio (min) c[i] := a[i,1] + b[1,j]. I am not certain but thought Transducers did not support reductions with dims.

The code in the OP and also the version using Tullio look good to me. The complicated warp-aware kernel in FoldsCUDA is for implementing mapreduce “rigorously” (= not assuming commutativity) while still trying to optimize the memory access pattern. But it is not optimal for computing minimum and similar reductions with commutative operations. On the other hand, CUDA.jl’s mapreduce (used for the Boradcasted of CuArrays as in the OP) is optimized for commutative operations IIRC. (Of course, FoldsCUDA probably needs to have commutative reductions and also dims support but this space is already covered by various libraries such as CUDA.jl itself and Tullio.)

Thanks both @mcabbott and @tkf .

“FoldsCUDA probably needs to have … dims support” - guess these aren’t the tools for my problem atm

@tullio (min) c[i] := a[i,1] + b[1,j] gives me a scalar access warning for CuArrays

well yeah, of course it does. It has a[i, 1] and b[1, j]. Those 1s are scalar.

Warning: Performing scalar indexing on task Task (runnable) @0x000000000afa0010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.

This is what I mean by scalar access. The Tullio example is logically equivalent to my original post - so yes there are either implicit or explicit scalar 1s. It’s whether the resulting GPU code is efficient is what I care about, here. If that’s a obvious property of how Tullio works, I missed it because I’m a new user.

Doing

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, k] + b[k, j]
       end

works.

For possibly better performance though, you could also just

a = dropdims(a, dims=2)
a = dropdims(b, dims=1)

or just not create arrays with unnecessary singleton dimensions in the first place.

1 Like

It should be no problem to have some fixed indices in there, although as Mason says not creating extra dimensions in the first place is cleaner.

I think you will get that error if you have not loaded all the packages it needs. Not so obvious, maybe it should have some kind of warning. Without KernelAbstractions, CUDAKernels it generates only ordinary loops.

julia> using Tullio, CUDA, KernelAbstractions, CUDAKernels

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, k] + b[k, j]  # with trivial loop over k
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, 1] + b[1, j]  # with trivial dimensions fixed
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

(Note also that this operation is vec(a .+ minimum(b)), which will be quicker. But I guess it’s just an example.)

Ah I see, good to know!

1 Like

@mcabbott That works very well (even if it was unclear I needed to import KernelAbstractions and CUDAKernels). It performs the same as a custom kernel I wrote for this problem - obviously I’d prefer to have a library take care of that so I’m glad Tullio can fill in. I’m a convert.

1 Like