# Finding index and value of largest two elements in a CuArray

I need to find the indices and values of the two largest elements in a 2D array. I whipped up a `foldl`-based solution for the CPU that’s much faster than `Base.findmax` and `Base.partialsortperm`, but it runs like molasses on the GPU compared to `CUDA.findmax`.

``````function findmax2(prev, curr)
i, v = curr
v > prev.first.v && return (first=(; i, v), second=prev.first)
v > prev.second.v && return (first=prev.first, second=(; i, v))
return prev
end
``````
``````julia> dA = CUDA.rand(512, 512); cA = Array(dA);
``````

CPU:

``````julia> @btime foldl(findmax2, enumerate(\$cA), init=(first=(i=0, v=0.0f0), second=(i=0, v=0.0f0)))
284.799 μs (0 allocations: 0 bytes)
(first = (i = 20521, v = 0.99999803f0), second = (i = 23437, v = 0.9999958f0))

julia> @btime findmax(\$cA)
798.600 μs (0 allocations: 0 bytes)
(0.99999803f0, CartesianIndex(41, 41))

julia> @btime partialsortperm(vec(\$cA), 1:2, rev=true)
3.654 ms (7 allocations: 2.00 MiB)
2-element view(::Array{Int64,1}, 1:2) with eltype Int64:
20521
23437
``````

GPU:

GPU benchmark wrappers
``````function run_findmax(A)
CUDA.@sync begin
ans = findmax(A)
end
return ans
end

function run_findmax2(A)
CUDA.@sync begin
ans = foldl(findmax2, enumerate(A), init=(first=(i=0, v=0.0f0), second=(i=0, v=0.0f0)))
end
return ans
end
``````
``````julia> @btime run_findmax2(\$dA)
2.548 s (262148 allocations: 24.00 MiB)
(first = (i = 20521, v = 0.99999803f0), second = (i = 23437, v = 0.9999958f0))

julia> @btime run_findmax(\$dA)
120.799 μs (123 allocations: 4.80 KiB)
(0.99999803f0, CartesianIndex(41, 41))
``````

I tried to get something working with @tkf’s FoldsCUDA.jl, and it produces fast but slightly incorrect results on the GPU, throws an error with threaded CPU execution, and produces slow but correct results with single-threaded CPU execution.

Floops.jl / FoldsCUDA.jl
``````function folds_findmax2(xs, ex = xs isa CuArray ? CUDAEx() : ThreadedEx())
xtypemin = typemin(eltype(xs))
@floop ex for (i, x) in zip(eachindex(xs), xs)
@reduce() do (xmax1=xtypemin; x), (imax1=-1; i)
if isless(xmax1, x)
imax2, xmax2 = imax1, xmax1
imax1, xmax1 = i, x
end
end
i == imax1 && continue
@reduce() do (xmax2=xtypemin; x), (imax2=-1; i)
if isless(xmax2, x)
imax2, xmax2 = i, x
end
end
end
return ((imax1, xmax1), (imax2, xmax2))
end

function run_folds_findmax2(dA)
CUDA.@sync begin
ans = folds_findmax2(dA)
end
return ans
end
``````
``````julia> folds_findmax2(cA)
ERROR: ArgumentError: `halve(zip(...))` requires collections with identical `size`
Stacktrace:
[1] shape at C:\Users\alexa\.julia\packages\SplittablesBase\dZ3Qr\src\implementations.jl:226 [inlined]
...

julia> @btime run_folds_findmax2(\$dA)
173.600 μs (137 allocations: 5.67 KiB)
((20521, 0.99999803f0), (98901, 0.9993288f0))

julia> @btime folds_findmax2(\$cA, SequentialEx(simd = Val(true)))
1.118 ms (0 allocations: 0 bytes)
((20521, 0.99999803f0), (23437, 0.9999958f0))
``````

Any insights on where I went wrong with FoldsCUDA, or how to avert (invisible?) scalar indexing for my original `foldl`-based solution?

(edited to include single-threaded FLoops.jl results)

The best strategy to parallelize something like this is to come up with a pair of functions:

1. `f`: a function that creates a singleton solution given a single element
• Imagine you have an array with single element. What is the answer you want?
2. `op`: an associative function that merges two solutions
• How to merge two outputs of `f`? How to make the output of `op` re-feedable to both of the argument positions?

and then feed them to `mapreduce(f, op, xs)`.

The associativity of the second function `op` is the most important ingredient in the data parallel programming. Unfortunately, the definition of `findmax2(prev, curr)` in the OP requires different types for the first and the second arguments. So, it’s not an associative function.

There are multiple ways to implement this. Probably the easiest way to do it is to use tuples as the input of `op`. The “merge” operation is then simply concatenation, sorting and truncation:

``````using Baselet

function merge_max2(a, b)
x, y = Baselet.sort((a..., b...); by = last, rev = true)
return (x, y)
end

mapreduce(tuple, merge_max2, pairs([3, 0, 5, 6, 1]))  # (4 => 6, 3 => 5)
``````

This approach is rather nice since it’s possible to generalize to `n` max/min elements (for small `n`).

(Note: `Baselet.sort` is not the only function that implements sorting for “small” container. For example, StaticArrays.jl also implements sorting. See also TupleTools.jl. It might be better to use `partialsort` instead if available.)

You can also use `@floop` to do this since `@floop` is “just” a syntax sugar to (a generalization of) `mapreduce`:

``````function findmax2_floop(xs; ex)
@floop ex for (i, x) in zip(eachindex(xs), xs)
b = tuple(i => x)
@reduce() do (a; b)
x, y = Baselet.sort((a..., b...); by = last, rev = true)
a = (x, y)
end
end
end
``````

On GPU, you might want to set the initial value. Something like `init = ((firstindex(xs)-1 => typemin(xs)), (firstindex(xs)-1 => typemin(xs)))` might work.

BTW, this `@floop` usage

``````    @floop ex for (i, x) in zip(eachindex(xs), xs)
@reduce() do (xmax1=xtypemin; x), (imax1=-1; i)
if isless(xmax1, x)
imax2, xmax2 = imax1, xmax1
imax1, xmax1 = i, x
end
end
i == imax1 && continue
``````

is not correct; nothing should “leak out” from inside `@reduce`. I should document this clearly and ideally throw an error with helpful error message.

4 Likes

Hmm, still getting funky results here for `CUDAEx()`, although your approach works with `ThreadedEx()`!

``````using Baselet
function findmax2_floop(xs; ex)
xtypemin = typemin(eltype(xs))
@floop ex for (i, x) in zip(eachindex(xs), vec(xs))
b = tuple(i => x)
@reduce() do (a = (-1 => xtypemin, -1 => xtypemin); b)
x, y = Baselet.sort((a..., b...); by = last, rev = true)
a = (x, y)
end
end
return a
end
``````
``````julia> findmax2_floop(A, ex=ThreadedEx())
(1 => 1.0f0, 262144 => 0.5f0)

julia> findmax2_floop(dA, ex=CUDAEx())
(1 => 1.0f0, 1 => 1.0f0)
``````

This works for me:

``````julia> function findmax2_floop(xs; ex)
init = ((firstindex(xs)-1 => typemin(eltype(xs))), (firstindex(xs)-1 => typemin(eltype(xs))))
@floop ex for (x, i) in zip(xs, eachindex(xs))
b = tuple(i => x)
@reduce() do (a = init; b)
x, y = Baselet.sort((a..., b...); by = last, rev = true)
a = (x, y)
end
end
return a
end
findmax2_floop (generic function with 1 method)

julia> xs = CuArray([3, 0, 5, 6, 1])
5-element CuArray{Int64,1}:
3
0
5
6
1

julia> findmax2_floop(xs; ex = CUDAEx())
(4 => 6, 3 => 5)

julia> xs = CUDA.rand(1000_0000);

julia> xs[200] = 2
2

julia> xs[300] = 3
3

julia> findmax2_floop(xs; ex = CUDAEx())
(300 => 3.0f0, 200 => 2.0f0)
``````

You can also try bare CUDA.jl:

``````julia> function merge_max2(a, b)
x, y = Baselet.sort((a..., b...); by = last, rev = true)
return (x, y)
end
merge_max2 (generic function with 1 method)

julia> singleton_max2(x, i) = ((i => x),)
singleton_max2 (generic function with 1 method)

julia> mapreduce(singleton_max2, merge_max2, xs, eachindex(xs); init = ((firstindex(xs)-1 => typemin(eltype(xs))), (firstindex(xs)-1 => typemin(eltype(xs)))))
(4 => 6, 3 => 5)
``````

I can confirm that `findmax2_floop` works for your small test array, but it fails for my test input:

``````julia> xs_TKF = CuArray([3, 0, 5, 6, 1])
5-element CuArray{Int64,1}:
3
0
5
6
1

julia> findmax2_floop(xs_TKF, ex=CUDAEx())
(4 => 6, 3 => 5)
``````
``````julia> xs_SS = CUDA.zeros(512, 512); xs_SS[1] = 1; xs_SS[end] = 0.5;

julia> findmax2_floop(xs_SS, ex=CUDAEx())
(1 => 1.0f0, 1 => 1.0f0)
``````

I also tested it with `CUDA.rand(1000_0000)`. I can’t reproduce your example. Both FLopos and CUDA versions work for me. Here is a MWE script with Manifest.toml: https://gist.github.com/81f674b14177abaf8f82229f23a10a77. I checked it with Julia 1.5.2.

Try running it in a clean environment and check if you still see the test failure. In a UNIX-y OS, it’d be something like `JULIA_LOAD_PATH=@ JULIA_PROJECT=\$PWD julia --startup-file=no mwe.jl` (after instantiation).

Hmm, I’m still seeing the error. Maybe an arcane problem with my CUDA installation? I don’t yet have enough experience with CUDA debugging tools to track down the root cause.

The `Base.mapreduce` solution produces correct results on the GPU, but it’s ~3x slower than on the CPU, and ~2x slower than an extremely stupid brute-force GPU implementation:

``````function bruteforcemax2(xs)
xmax1, imax1 = findmax(xs)
CUDA.@allowscalar xs[imax1] = typemin(eltype(xs))
xmax2, imax2 = findmax(xs)
CUDA.@allowscalar xs[imax1] = xmax1
(imax1 => xmax1, imax2 => xmax2)
end

function run_bruteforcemax2(xs)
CUDA.@sync begin
out = bruteforcemax2(xs)
end
out
end
``````
``````julia> @btime run_bruteforcemax2(\$dA)
234.699 μs (333 allocations: 11.11 KiB)
(CartesianIndex(1, 1) => 1.0f0, CartesianIndex(512, 512) => 0.5f0)

julia> @btime run_findmax2_cuda(\$dA)
435.999 μs (94 allocations: 2.00 MiB)
(1 => 1.0f0, 262144 => 0.5f0)
``````
Test results
``````\$ JULIA_LOAD_PATH=@ JULIA_PROJECT=\$PWD julia --startup-file=no mwe.jl
Test Summary:    | Pass  Total
(4 => 6, 3 => 5) |    2      2
Γöî Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
Γöö @ GPUArrays C:\Users\alexa\.julia\packages\GPUArrays\jhRU7\src\host\indexing.jl:43
Test Summary:        | Pass  Total
(300 => 3, 200 => 2) |    2      2
Expression: findmax2_floop(xs) == (1 => 1.0f0, 262144 => 0.5f0)
Evaluated: (1 => 1.0f0, 1 => 1.0f0) == (1 => 1.0f0, 262144 => 0.5f0)
Stacktrace:
[2] top-level scope at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:1115
Test Summary:                 | Pass  Fail  Total
(1 => 1.0f0, 262144 => 0.5f0) |    1     1      2
ERROR: LoadError: Some tests did not pass: 1 passed, 1 failed, 0 errored, 0 broken.
``````

I think it might be better to focus on the performance issue of `mapreduce` on your machine. This is because what FoldsCUDA does is quite similar to how `mapreduce` is implemented in CUDA.jl. It might be better to ask for help from CUDA.jl devs (cc @maleadt)

FYI, `findmax2_cuda(::CuArray)` is comparable to `findmax(::CuArray)` and much faster than `findmax2_cuda(::Array)` for me:

``````julia> xs = CUDA.rand(1000_0000);

julia> xs0 = Array(xs);

julia> @btime findmax2_cuda(xs0);
222.814 ms (13 allocations: 152.59 MiB)

julia> @btime CUDA.@sync findmax2_cuda(xs);
147.533 μs (77 allocations: 2.13 KiB)

julia> @btime CUDA.@sync findmax(xs);
194.248 μs (129 allocations: 3.48 KiB)
``````
1 Like

And with New findmin/max implementation using single-pass reduction by maleadt · Pull Request #576 · JuliaGPU/CUDA.jl · GitHub it should be even faster

3 Likes

Yep, looks like a 2x speedup with the single-pass reduction!

Can the new `findmin/max` be backported to a Julia v1.5-compatible patch release? The LazyArtifacts.jl dependency prevents `CUDA#master` from being installed on versions lower than v1.6.

1 Like

I’ve created such a release here, 1.5 compatibility release by maleadt · Pull Request #623 · JuliaGPU/CUDA.jl · GitHub, please test.

I’m getting a variety of test errors, but I get similar errors for v2.3.0, which leads me to suspect some flaw in my CUDA installation. Your speedy new `findmin/max` works, though!