Transducers/FLoops equivalent for mapreduce with dims arg (on GPU)

It should be no problem to have some fixed indices in there, although as Mason says not creating extra dimensions in the first place is cleaner.

I think you will get that error if you have not loaded all the packages it needs. Not so obvious, maybe it should have some kind of warning. Without KernelAbstractions, CUDAKernels it generates only ordinary loops.

julia> using Tullio, CUDA, KernelAbstractions, CUDAKernels

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, k] + b[k, j]  # with trivial loop over k
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

julia> let a = CUDA.rand(Float32, (100_000, 1)), b = CUDA.rand(Float32, (1, 5_000))
           @tullio (min) c[i] := a[i, 1] + b[1, j]  # with trivial dimensions fixed
       end |> summary
"100000-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}"

(Note also that this operation is vec(a .+ minimum(b)), which will be quicker. But I guess it’s just an example.)