Transducers/FLoops equivalent for mapreduce with dims arg (on GPU)

Alex_Ellison · May 9, 2022, 3:58pm

Hi all,
Is it possible to write an equivalent of mapreduce(; dims) using either Transducers or FLoops? I’m interested in the NVidia GPU context.

I want to do something like:

julia> begin
       using CUDA, LazyArrays
       a = CUDA.rand(Float32, (100_000, 1))
       b = CUDA.rand(Float32, (1, 5_000))
       m = @~ a .+ b
       minimum(m, dims=2)
       end
100000×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.5155287
 0.61906624
 0.5843259
 0.98962003
 0.7108086
 0.050809365
 0.85040236
 0.033321425
 0.5592842
 0.98937845
 0.52907884
 0.49551845
 0.97432935
 ⋮
 0.5292897
 0.9853454
 0.7320634
 0.7673758
 0.07746087
 0.70310813
 0.03921157
 0.3690481
 0.31347668
 0.93635935
 0.2896975
 0.808064

I want to do a reduce operation over all elements of b for every element of a. It has to be broadcasted since m would be prohibitive to instantiate (and just wasteful). CUDA.jl’s reducedim doesn’t exploit the fact that the values of b are shared across all values of a. However, FoldsCUDA.jl does have warp shuffles, which suggests there could be a performance improvement. Note the above is a MWE - I’m interested in reducing over a broadcasted function which takes many args (but still 2 dims)

Any suggestions are welcome, apparently I’m not human since FLoops’ “folds for humans” logic is above me (:
A great starting point would probably be just a mapreduce with dims using Transducers, not even on GPU.

mcabbott · May 9, 2022, 10:44pm

One idea is @tullio (min) c[i] := a[i,1] + b[1,j]. I am not certain but thought Transducers did not support reductions with dims.

tkf · May 11, 2022, 9:55am

The code in the OP and also the version using Tullio look good to me. The complicated warp-aware kernel in FoldsCUDA is for implementing mapreduce “rigorously” (= not assuming commutativity) while still trying to optimize the memory access pattern. But it is not optimal for computing minimum and similar reductions with commutative operations. On the other hand, CUDA.jl’s mapreduce (used for the Boradcasted of CuArrays as in the OP) is optimized for commutative operations IIRC. (Of course, FoldsCUDA probably needs to have commutative reductions and also dims support but this space is already covered by various libraries such as CUDA.jl itself and Tullio.)

Alex_Ellison · May 12, 2022, 5:22pm

Thanks both @mcabbott and @tkf .

“FoldsCUDA probably needs to have … dims support” - guess these aren’t the tools for my problem atm

@tullio (min) c[i] := a[i,1] + b[1,j] gives me a scalar access warning for CuArrays

Mason · May 12, 2022, 5:33pm

well yeah, of course it does. It has a[i, 1] and b[1, j]. Those 1s are scalar.

Alex_Ellison · May 12, 2022, 5:48pm

Warning: Performing scalar indexing on task Task (runnable) @0x000000000afa0010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.

This is what I mean by scalar access. The Tullio example is logically equivalent to my original post - so yes there are either implicit or explicit scalar 1s. It’s whether the resulting GPU code is efficient is what I care about, here. If that’s a obvious property of how Tullio works, I missed it because I’m a new user.

Mason · May 12, 2022, 5:48pm

Doing

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, k] + b[k, j]
       end

works.

For possibly better performance though, you could also just

a = dropdims(a, dims=2)
a = dropdims(b, dims=1)

or just not create arrays with unnecessary singleton dimensions in the first place.

mcabbott · May 12, 2022, 5:58pm

It should be no problem to have some fixed indices in there, although as Mason says not creating extra dimensions in the first place is cleaner.

I think you will get that error if you have not loaded all the packages it needs. Not so obvious, maybe it should have some kind of warning. Without KernelAbstractions, CUDAKernels it generates only ordinary loops.

julia> using Tullio, CUDA, KernelAbstractions, CUDAKernels

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, k] + b[k, j]  # with trivial loop over k
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, 1] + b[1, j]  # with trivial dimensions fixed
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

(Note also that this operation is vec(a .+ minimum(b)), which will be quicker. But I guess it’s just an example.)

Mason · May 12, 2022, 5:59pm

Ah I see, good to know!

Alex_Ellison · May 12, 2022, 7:41pm

@mcabbott That works very well (even if it was unclear I needed to import KernelAbstractions and CUDAKernels). It performs the same as a custom kernel I wrote for this problem - obviously I’d prefer to have a library take care of that so I’m glad Tullio can fill in. I’m a convert.

Topic		Replies	Views
How do I to transform mapreduce function to work well with CUDA? GPU	5	1552	May 14, 2021
Mapreduce with broadcasting Performance	9	773	March 2, 2022
Error with 3-args dot based on FoldsCUDA General Usage gpu , cuda , reduction	1	405	September 4, 2022
GPU Map without reduction on multiple arrays indices GPU	1	693	February 8, 2019
Unrolling of loops with operations on CuArrays GPU gpu , cuarrays , unrolling	4	758	October 27, 2021

Transducers/FLoops equivalent for mapreduce with dims arg (on GPU)

Related topics