Hi all,
Is it possible to write an equivalent of mapreduce(; dims) using either Transducers or FLoops? I’m interested in the NVidia GPU context.
I want to do something like:
julia> begin
using CUDA, LazyArrays
a = CUDA.rand(Float32, (100_000, 1))
b = CUDA.rand(Float32, (1, 5_000))
m = @~ a .+ b
minimum(m, dims=2)
end
100000×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.5155287
0.61906624
0.5843259
0.98962003
0.7108086
0.050809365
0.85040236
0.033321425
0.5592842
0.98937845
0.52907884
0.49551845
0.97432935
⋮
0.5292897
0.9853454
0.7320634
0.7673758
0.07746087
0.70310813
0.03921157
0.3690481
0.31347668
0.93635935
0.2896975
0.808064
I want to do a reduce operation over all elements of b for every element of a. It has to be broadcasted since m would be prohibitive to instantiate (and just wasteful). CUDA.jl’s reducedim doesn’t exploit the fact that the values of b are shared across all values of a. However, FoldsCUDA.jl doeshave warp shuffles, which suggests there could be a performance improvement. Note the above is a MWE - I’m interested in reducing over a broadcasted function which takes many args (but still 2 dims)
Any suggestions are welcome, apparently I’m not human since FLoops’ “folds for humans” logic is above me (:
A great starting point would probably be just a mapreduce with dims using Transducers, not even on GPU.
The code in the OP and also the version using Tullio look good to me. The complicated warp-aware kernel in FoldsCUDA is for implementing mapreduce “rigorously” (= not assuming commutativity) while still trying to optimize the memory access pattern. But it is not optimal for computing minimum and similar reductions with commutative operations. On the other hand, CUDA.jl’s mapreduce (used for the Boradcasted of CuArrays as in the OP) is optimized for commutative operations IIRC. (Of course, FoldsCUDA probably needs to have commutative reductions and also dims support but this space is already covered by various libraries such as CUDA.jl itself and Tullio.)
Warning: Performing scalar indexing on task Task (runnable) @0x000000000afa0010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
This is what I mean by scalar access. The Tullio example is logically equivalent to my original post - so yes there are either implicit or explicit scalar 1s. It’s whether the resulting GPU code is efficient is what I care about, here. If that’s a obvious property of how Tullio works, I missed it because I’m a new user.
It should be no problem to have some fixed indices in there, although as Mason says not creating extra dimensions in the first place is cleaner.
I think you will get that error if you have not loaded all the packages it needs. Not so obvious, maybe it should have some kind of warning. Without KernelAbstractions, CUDAKernels it generates only ordinary loops.
julia> using Tullio, CUDA, KernelAbstractions, CUDAKernels
julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
@tullio (min) c[i] := a[i, k] + b[k, j] # with trivial loop over k
end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"
julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
@tullio (min) c[i] := a[i, 1] + b[1, j] # with trivial dimensions fixed
end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"
(Note also that this operation is vec(a .+ minimum(b)), which will be quicker. But I guess it’s just an example.)
@mcabbott That works very well (even if it was unclear I needed to import KernelAbstractions and CUDAKernels). It performs the same as a custom kernel I wrote for this problem - obviously I’d prefer to have a library take care of that so I’m glad Tullio can fill in. I’m a convert.