Argmax mapreduce on GPU

noetheriankoala · January 9, 2026, 6:55pm

Hello! I am trying to quickly compute

\text{argmax}_{\substack{1 \leq s \leq k\\ k+1 \leq t \leq n}} A_{s,t} + (1-\ell_s)(1 + \ell_t)

I do this on the CPU with the following code.

f = ((i, j),) -> (i, j, A[i, j]^2 + (1 - l[i])*(1 + l[j]))
op = (x, y) -> x[3] > y[3] ? x : y
col = product(1:k, (k+1):n)
i, j, volf = Folds.mapreduce(f, op, col; init=(0, 0, -Inf))

I would like to convert the following code to something that can utilize CUDA. I have done this as follows.

C = CuMatrix{Float64}(k, n-k)
copyto!(C, view(A, :, k+1:n))
C .^= 2
CUBLAS.ger!(1.0, 1 .- l1, 1 .+ l2, C)
    
s = argmax(C)
I = CartesianIndices(C)[s]
i, j = I[1], I[2]
@allowscalar volf = C[s]

However, I’d like to do this without writing each element of C to VRAM similarly as with the CPU version. I would prefer to avoid writing my own CUDA kernel.

I have looked into using FoldsCUDA.jl but it seems to be deprecated and doesn’t support resent CUDA versions. It is also not maintained by JuliaFolds2.

Any suggestions?

gbaraldi · January 9, 2026, 11:22pm

GitHub - JuliaGPU/AcceleratedKernels.jl: Cross-architecture parallel algorithms for Julia's CPU and GPU backends. Targets multithreaded CPUs, and GPUs via Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA. might be a good place to look at for code like this

noetheriankoala · January 10, 2026, 12:50pm

Good suggestion, but see here. In particular, this type of thing “… just go a little out of scope for AK.”

mcabbott · January 10, 2026, 5:22pm

I think the reduction kernels in GPUArrays can act on lazy Broadcasted objects, so you can probably make them do this for you:

julia> begin
       n = 10
       A = randn(n, n)
       l = randn(n)
       f = ((i, j),) -> (i, j, A[i, j]^2 + (1 - l[i])*(1 + l[j]))
       op = (x, y) -> x[3] > y[3] ? x : y
       col = Iterators.product(1:n, 1:n) # simplified from product(1:k, (k+1):n), just make views of A, l as necc.
       i, j, volf = mapreduce(f, op, col; init=(0, 0, -Inf))
       end
(7, 3, 13.250911638285407)

julia> argmax(@. A^2 + (1 - l)*(1 + l'))
CartesianIndex(7, 3)

julia> Meta.@lower @. A^2 + (1 - l)*(1 + l')
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1  = +
│   %2  = ^
│   %3  = A
│   %4  = Core.apply_type(Base.Val, 2)
│   %5  = (%4)()
│   %6  = Base.broadcasted(Base.literal_pow, %2, %3, %5)
│   %7  = *
│   %8  = Base.broadcasted(-, 1, l)
│   %9  = +
│   %10 = var"'"(l)
│   %11 = Base.broadcasted(%9, 1, %10)
│   %12 = Base.broadcasted(%7, %8, %11)
│   %13 = Base.broadcasted(%1, %6, %12)
│   %14 = Base.materialize(%13)
└──       return %14
))))

julia> function lazy(A, l)
       x6 = Base.broadcasted(Base.literal_pow, ^, A, Val(2))
       x8  =Base.broadcasted(-, 1, l)
       x11 = Base.broadcasted(+, 1, l')
       x12 = Base.broadcasted(*, x8, x11)
       x13 = Base.broadcasted(+, x6, x12)
       end
lazy (generic function with 1 method)

# eager

julia> argmax(Base.materialize(lazy(A, l)))
CartesianIndex(7, 3)

julia> using JLArrays

julia> argmax(Base.materialize(lazy(jl(A), jl(l))))
CartesianIndex(7, 3)

# lazy

julia> maximum(lazy(A, l))  # just iterating, I believe
13.250911638285407

julia> maximum(x for x in lazy(A, l))
13.250911638285407

julia> maximum(lazy(jl(A), jl(l)))  # using GPUArrays reduction, as iteration fails
13.250911638285407

julia> maximum(x for x in lazy(jl(A), jl(l)))
ERROR: Scalar indexing is disallowed.

# argmax

julia> argmax(lazy(A, l))
ERROR: MethodError: no method matching keys(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{…}, Nothing, typeof(+), Tuple{…}})

julia> Base.keys(bc::Base.Broadcast.Broadcasted) = CartesianIndices(axes(bc))

julia> argmax(lazy(A, l))
CartesianIndex(7, 3)

julia> argmax(lazy(jl(A), jl(l)))
ERROR: Scalar indexing is disallowed.

# better idea

julia> function lazy3(A, l)
         a1, a2 = axes(A)
         bc = lazy(A, l)
         x14 = Base.broadcasted(tuple, bc, a1, a2')
       end
lazy3 (generic function with 1 method)

julia> maximum(lazy3(A, l))
(13.250911638285407, 7, 3)

julia> maximum(lazy3(jl(A), jl(l)))
ERROR: MethodError: no method matching typemin(::Type{Tuple{Float64, Int64, Int64}})
Stacktrace:
 [1] neutral_element(::typeof(max), T::Type)
   @ GPUArrays ~/.julia/packages/GPUArrays/ouBUA/src/host/mapreduce.jl:25
 [2] _mapreduce(f::typeof(identity), op::typeof(max), As::Base.Broadcast.Broadcasted{…}; dims::Colon, init::Nothing)
   @ GPUArrays ~/.julia/packages/GPUArrays/ouBUA/src/host/mapreduce.jl:49
...

julia> Base.typemin(::Type{Tuple{T,I,J}}) where {T,I,J} = map(typemin, (T,I,J))  # piracy... could overload GPUArrays .neutral_element instead

julia> maximum(lazy3(jl(A), jl(l)))
(13.250911638285407, 7, 3)

noetheriankoala · January 12, 2026, 6:54pm

@mcabbott Thanks a lot, this is perfect.

For anyone else, here is what I will use:

struct NextSwapTriple{TF <: Union{Float32, Float64}, TI <: Int}
    v::TF
    i::TI
    j::TI
end

@inline Base.typemin(::Type{NextSwapTriple{TF, TI}}) where {TF, TI} =
    NextSwapTriple(typemin(TF), typemin(TI), typemin(TI))

@inline Base.max(x::NextSwapTriple, y::NextSwapTriple) = x.v ≥ y.v ? x : y

@inline Base.reduce_empty(::typeof(max), ::Type{NextSwapTriple{TF, TI}}) where {TF, TI} =
    NextSwapTriple(typemin(TF), typemin(TI), typemin(TI))

# the solution suggested by @mcabbott
function test1(A::CuArray{T}, l::CuVector{T}) where {T <: Union{Float32, Float64}}
    k, n = size(A)
    C = view(A, :, k+1:n)
    l1 = view(l, 1:k)
    l2 = view(l, k+1:n)

    function lazy(C, l1, l2)
        a1, a2 = axes(C)
        
        x1 = Base.broadcasted(Base.literal_pow, ^, C, Val(2))
        x2 = Base.broadcasted(-, 1, l1)
        x3 = Base.broadcasted(+, 1, l2')
        x4 = Base.broadcasted(*, x2, x3)
        x5 = Base.broadcasted(+, x1, x4)
        x6 = Base.broadcasted(NextSwapTriple, x5, a1, a2')
    end

    (; v, i, j) = maximum(lazy(C, l1, l2))

    return v, i, j+k
end

mcabbott · January 12, 2026, 7:20pm

Great, I hope it ends up being fast, didn’t test that.

I see I left half an answer here about more convenient ways to construct the Broadcasted thing. One trick I know is this macro definition which I think quite a few packages use internally.

julia> @lazy A^2 + (1 - l)*(1 + l')
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2}}(+, (Base.Broadcast.Broadcasted(literal_pow, (Base.RefValue{typeof(^)}(^), [0.3634207523310115

julia> axA = axes(A);

julia> maximum(@lazy tuple(A^2 + (1 - l)*(1 + l'), axA...))
(13.250911638285407, 7, 7)

julia> jlA = jl(A); jll = jl(l); summary(jlA)
"10×10 JLArray{Float64, 2}"

julia> maximum(@lazy tuple(jlA^2 + (1 - jll)*(1 + jll'), axA...))
(13.250911638285407, 7, 7)

(still relying on my pirate definitions for neutral_element I think, which you avoid.)

Topic		Replies	Views
Finding index and value of largest two elements in a CuArray GPU question , package , cuda	11	2328	January 5, 2021
Memory usage problem when using findmax/min GPU	9	943	December 29, 2022
Max.(v1,v2) on the GPU GPU question	5	1039	September 30, 2021
How do I to transform mapreduce function to work well with CUDA? GPU	5	1628	May 14, 2021
Mapreduce with broadcasting Performance	9	860	March 2, 2022

Argmax mapreduce on GPU

Related topics