Finding index and value of largest two elements in a CuArray

I need to find the indices and values of the two largest elements in a 2D array. I whipped up a foldl-based solution for the CPU that’s much faster than Base.findmax and Base.partialsortperm, but it runs like molasses on the GPU compared to CUDA.findmax.

function findmax2(prev, curr)
    i, v = curr
    v > prev.first.v && return (first=(; i, v), second=prev.first)
    v > prev.second.v && return (first=prev.first, second=(; i, v))
    return prev
end
julia> dA = CUDA.rand(512, 512); cA = Array(dA);

CPU:

julia> @btime foldl(findmax2, enumerate($cA), init=(first=(i=0, v=0.0f0), second=(i=0, v=0.0f0)))
  284.799 μs (0 allocations: 0 bytes)
(first = (i = 20521, v = 0.99999803f0), second = (i = 23437, v = 0.9999958f0))

julia> @btime findmax($cA)
  798.600 μs (0 allocations: 0 bytes)
(0.99999803f0, CartesianIndex(41, 41))

julia> @btime partialsortperm(vec($cA), 1:2, rev=true)
  3.654 ms (7 allocations: 2.00 MiB)
2-element view(::Array{Int64,1}, 1:2) with eltype Int64:
 20521
 23437

GPU:

GPU benchmark wrappers
function run_findmax(A)
    CUDA.@sync begin
        ans = findmax(A)
    end
    return ans
end

function run_findmax2(A)
    CUDA.@sync begin
        ans = foldl(findmax2, enumerate(A), init=(first=(i=0, v=0.0f0), second=(i=0, v=0.0f0)))
    end
    return ans
end
julia> @btime run_findmax2($dA)
  2.548 s (262148 allocations: 24.00 MiB)
(first = (i = 20521, v = 0.99999803f0), second = (i = 23437, v = 0.9999958f0))

julia> @btime run_findmax($dA)
  120.799 μs (123 allocations: 4.80 KiB)
(0.99999803f0, CartesianIndex(41, 41))

I tried to get something working with @tkf’s FoldsCUDA.jl, and it produces fast but slightly incorrect results on the GPU, throws an error with threaded CPU execution, and produces slow but correct results with single-threaded CPU execution.

Floops.jl / FoldsCUDA.jl
function folds_findmax2(xs, ex = xs isa CuArray ? CUDAEx() : ThreadedEx())
    xtypemin = typemin(eltype(xs))
    @floop ex for (i, x) in zip(eachindex(xs), xs)
        @reduce() do (xmax1=xtypemin; x), (imax1=-1; i)
            if isless(xmax1, x)
                imax2, xmax2 = imax1, xmax1
                imax1, xmax1 = i, x
            end
        end
        i == imax1 && continue
        @reduce() do (xmax2=xtypemin; x), (imax2=-1; i)
            if isless(xmax2, x)
                imax2, xmax2 = i, x
            end
        end
    end
    return ((imax1, xmax1), (imax2, xmax2))
end

function run_folds_findmax2(dA)
    CUDA.@sync begin
        ans = folds_findmax2(dA)
    end
    return ans
end
julia> folds_findmax2(cA)
ERROR: ArgumentError: `halve(zip(...))` requires collections with identical `size`
Stacktrace:
 [1] shape at C:\Users\alexa\.julia\packages\SplittablesBase\dZ3Qr\src\implementations.jl:226 [inlined]
...

julia> @btime run_folds_findmax2($dA)
  173.600 μs (137 allocations: 5.67 KiB)
((20521, 0.99999803f0), (98901, 0.9993288f0))

julia> @btime folds_findmax2($cA, SequentialEx(simd = Val(true)))
  1.118 ms (0 allocations: 0 bytes)
((20521, 0.99999803f0), (23437, 0.9999958f0))

Any insights on where I went wrong with FoldsCUDA, or how to avert (invisible?) scalar indexing for my original foldl-based solution?

(edited to include single-threaded FLoops.jl results)

The best strategy to parallelize something like this is to come up with a pair of functions:

  1. f: a function that creates a singleton solution given a single element
    • Imagine you have an array with single element. What is the answer you want?
  2. op: an associative function that merges two solutions
    • How to merge two outputs of f? How to make the output of op re-feedable to both of the argument positions?

and then feed them to mapreduce(f, op, xs).

The associativity of the second function op is the most important ingredient in the data parallel programming. Unfortunately, the definition of findmax2(prev, curr) in the OP requires different types for the first and the second arguments. So, it’s not an associative function.

There are multiple ways to implement this. Probably the easiest way to do it is to use tuples as the input of op. The “merge” operation is then simply concatenation, sorting and truncation:

using Baselet

function merge_max2(a, b)
    x, y = Baselet.sort((a..., b...); by = last, rev = true)
    return (x, y)
end

mapreduce(tuple, merge_max2, pairs([3, 0, 5, 6, 1]))  # (4 => 6, 3 => 5)

This approach is rather nice since it’s possible to generalize to n max/min elements (for small n).

(Note: Baselet.sort is not the only function that implements sorting for “small” container. For example, StaticArrays.jl also implements sorting. See also TupleTools.jl. It might be better to use partialsort instead if available.)

You can also use @floop to do this since @floop is “just” a syntax sugar to (a generalization of) mapreduce:

function findmax2_floop(xs; ex)
    @floop ex for (i, x) in zip(eachindex(xs), xs)
        b = tuple(i => x)
        @reduce() do (a; b)
            x, y = Baselet.sort((a..., b...); by = last, rev = true)
            a = (x, y)
        end
    end
end

On GPU, you might want to set the initial value. Something like init = ((firstindex(xs)-1 => typemin(xs)), (firstindex(xs)-1 => typemin(xs))) might work.


BTW, this @floop usage

    @floop ex for (i, x) in zip(eachindex(xs), xs)
        @reduce() do (xmax1=xtypemin; x), (imax1=-1; i)
            if isless(xmax1, x)
                imax2, xmax2 = imax1, xmax1
                imax1, xmax1 = i, x
            end
        end
        i == imax1 && continue

is not correct; nothing should “leak out” from inside @reduce. I should document this clearly and ideally throw an error with helpful error message.

4 Likes

Hmm, still getting funky results here for CUDAEx(), although your approach works with ThreadedEx()!

using Baselet
function findmax2_floop(xs; ex)
    xtypemin = typemin(eltype(xs))
    @floop ex for (i, x) in zip(eachindex(xs), vec(xs))
        b = tuple(i => x)
        @reduce() do (a = (-1 => xtypemin, -1 => xtypemin); b)
            x, y = Baselet.sort((a..., b...); by = last, rev = true)
            a = (x, y)
        end
    end
    return a
end
julia> findmax2_floop(A, ex=ThreadedEx())
(1 => 1.0f0, 262144 => 0.5f0)

julia> findmax2_floop(dA, ex=CUDAEx())
(1 => 1.0f0, 1 => 1.0f0)

This works for me:

julia> function findmax2_floop(xs; ex)
           init = ((firstindex(xs)-1 => typemin(eltype(xs))), (firstindex(xs)-1 => typemin(eltype(xs))))
           @floop ex for (x, i) in zip(xs, eachindex(xs))
               b = tuple(i => x)
               @reduce() do (a = init; b)
                   x, y = Baselet.sort((a..., b...); by = last, rev = true)
                   a = (x, y)
               end
           end
           return a
       end
findmax2_floop (generic function with 1 method)

julia> xs = CuArray([3, 0, 5, 6, 1])
5-element CuArray{Int64,1}:
 3
 0
 5
 6
 1

julia> findmax2_floop(xs; ex = CUDAEx())
(4 => 6, 3 => 5)

julia> xs = CUDA.rand(1000_0000);

julia> xs[200] = 2
2

julia> xs[300] = 3
3

julia> findmax2_floop(xs; ex = CUDAEx())
(300 => 3.0f0, 200 => 2.0f0)

You can also try bare CUDA.jl:

julia> function merge_max2(a, b)
           x, y = Baselet.sort((a..., b...); by = last, rev = true)
           return (x, y)
       end
merge_max2 (generic function with 1 method)

julia> singleton_max2(x, i) = ((i => x),)
singleton_max2 (generic function with 1 method)

julia> mapreduce(singleton_max2, merge_max2, xs, eachindex(xs); init = ((firstindex(xs)-1 => typemin(eltype(xs))), (firstindex(xs)-1 => typemin(eltype(xs)))))
(4 => 6, 3 => 5)

I can confirm that findmax2_floop works for your small test array, but it fails for my test input:

julia> xs_TKF = CuArray([3, 0, 5, 6, 1])
5-element CuArray{Int64,1}:
 3
 0
 5
 6
 1

julia> findmax2_floop(xs_TKF, ex=CUDAEx())
(4 => 6, 3 => 5)
julia> xs_SS = CUDA.zeros(512, 512); xs_SS[1] = 1; xs_SS[end] = 0.5;

julia> findmax2_floop(xs_SS, ex=CUDAEx())
(1 => 1.0f0, 1 => 1.0f0)

I also tested it with CUDA.rand(1000_0000). I can’t reproduce your example. Both FLopos and CUDA versions work for me. Here is a MWE script with Manifest.toml: https://gist.github.com/81f674b14177abaf8f82229f23a10a77. I checked it with Julia 1.5.2.

Try running it in a clean environment and check if you still see the test failure. In a UNIX-y OS, it’d be something like JULIA_LOAD_PATH=@ JULIA_PROJECT=$PWD julia --startup-file=no mwe.jl (after instantiation).

Hmm, I’m still seeing the error. Maybe an arcane problem with my CUDA installation? I don’t yet have enough experience with CUDA debugging tools to track down the root cause.

The Base.mapreduce solution produces correct results on the GPU, but it’s ~3x slower than on the CPU, and ~2x slower than an extremely stupid brute-force GPU implementation:

function bruteforcemax2(xs)
    xmax1, imax1 = findmax(xs)
    CUDA.@allowscalar xs[imax1] = typemin(eltype(xs))
    xmax2, imax2 = findmax(xs)
    CUDA.@allowscalar xs[imax1] = xmax1
    (imax1 => xmax1, imax2 => xmax2)
end

function run_bruteforcemax2(xs)
    CUDA.@sync begin
        out = bruteforcemax2(xs)
    end
    out
end
julia> @btime run_bruteforcemax2($dA)
  234.699 μs (333 allocations: 11.11 KiB)
(CartesianIndex(1, 1) => 1.0f0, CartesianIndex(512, 512) => 0.5f0)

julia> @btime run_findmax2_cuda($dA)
  435.999 μs (94 allocations: 2.00 MiB)
(1 => 1.0f0, 262144 => 0.5f0)
Test results
$ JULIA_LOAD_PATH=@ JULIA_PROJECT=$PWD julia --startup-file=no mwe.jl
Test Summary:    | Pass  Total
(4 => 6, 3 => 5) |    2      2
Γöî Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
Γöö @ GPUArrays C:\Users\alexa\.julia\packages\GPUArrays\jhRU7\src\host\indexing.jl:43
Test Summary:        | Pass  Total
(300 => 3, 200 => 2) |    2      2
(1 => 1.0f0, 262144 => 0.5f0): Test Failed at C:\Users\alexa\Downloads\81f674b14177abaf8f82229f23a10a77-85c8bdce34b36bdd288ef4937e1ccff81cba4736\mwe.jl:51
  Expression: findmax2_floop(xs) == (1 => 1.0f0, 262144 => 0.5f0)
   Evaluated: (1 => 1.0f0, 1 => 1.0f0) == (1 => 1.0f0, 262144 => 0.5f0)
Stacktrace:
 [1] top-level scope at C:\Users\alexa\Downloads\81f674b14177abaf8f82229f23a10a77-85c8bdce34b36bdd288ef4937e1ccff81cba4736\mwe.jl:51
 [2] top-level scope at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:1115
 [3] top-level scope at C:\Users\alexa\Downloads\81f674b14177abaf8f82229f23a10a77-85c8bdce34b36bdd288ef4937e1ccff81cba4736\mwe.jl:48
Test Summary:                 | Pass  Fail  Total
(1 => 1.0f0, 262144 => 0.5f0) |    1     1      2
ERROR: LoadError: Some tests did not pass: 1 passed, 1 failed, 0 errored, 0 broken.
in expression starting at C:\Users\alexa\Downloads\81f674b14177abaf8f82229f23a10a77-85c8bdce34b36bdd288ef4937e1ccff81cba4736\mwe.jl:47

I think it might be better to focus on the performance issue of mapreduce on your machine. This is because what FoldsCUDA does is quite similar to how mapreduce is implemented in CUDA.jl. It might be better to ask for help from CUDA.jl devs (cc @maleadt)

FYI, findmax2_cuda(::CuArray) is comparable to findmax(::CuArray) and much faster than findmax2_cuda(::Array) for me:

julia> xs = CUDA.rand(1000_0000);

julia> xs0 = Array(xs);

julia> @btime findmax2_cuda(xs0);
  222.814 ms (13 allocations: 152.59 MiB)

julia> @btime CUDA.@sync findmax2_cuda(xs);
  147.533 μs (77 allocations: 2.13 KiB)

julia> @btime CUDA.@sync findmax(xs);
  194.248 μs (129 allocations: 3.48 KiB)
1 Like

And with New findmin/max implementation using single-pass reduction by maleadt · Pull Request #576 · JuliaGPU/CUDA.jl · GitHub it should be even faster :slight_smile:

3 Likes

Yep, looks like a 2x speedup with the single-pass reduction!

Can the new findmin/max be backported to a Julia v1.5-compatible patch release? The LazyArtifacts.jl dependency prevents CUDA#master from being installed on versions lower than v1.6.

1 Like

I’ve created such a release here, https://github.com/JuliaGPU/CUDA.jl/pull/623, please test.

I’m getting a variety of test errors, but I get similar errors for v2.3.0, which leads me to suspect some flaw in my CUDA installation. Your speedy new findmin/max works, though!