Memory usage problem when using findmax/min

You can do this with mapreduce, which has an efficient GPU implementation, which can do this in one pass, and should allocate almost nothing.

For CPU arrays, the above answers suggesting iteration, and zip, or just writing a loop, will be better. The CPU mapreduce with more than one array only has a fallback implementation, which calls map then reduce.

julia> using CUDA, GPUArrays, BenchmarkTools

julia> begin
         test1(a, b) =  min(findmax(a.+sqrt.(b))[1], findmax(b.+sqrt.(a))[1])

         max2((x,y), (x2,y2)) = max(x,x2), max(y, y2)
         function test2(as, bs)
           xy = mapreduce(max2, as, bs) do a, b
             a+sqrt(b), sqrt(a)+b
           end
           min(xy...)
         end

       end;  # no real need for begin end

julia> let n = 1000
         as = rand(n)
         bs = rand(n)
         r1 = @btime test1($as, $bs)
         # on CPU, mapreduce(f, op, x, y) is just map then reduce
         r2 = @btime test2($as, $bs)
         r1 ≈ r2
       end
  16.336 μs (2 allocations: 15.88 KiB)
  12.779 μs (1 allocation: 15.75 KiB)   # hence this allocation
true

# First attempt on GPU tells you what to fix (it needs to reduce in parallel):
ERROR: GPUArrays.jl needs to know the neutral element for your operator `max2`.
Please pass it as an explicit argument to `GPUArrays.mapreducedim!`,
or register it globally by defining `GPUArrays.neutral_element(::typeof(max2), T)`.

julia> GPUArrays.neutral_element(::typeof(max2), ::Type{Tuple{T,T}}) where T = (T(-Inf), T(-Inf))

julia> let n = 10^8
         as = cu(rand(n))
         bs = cu(rand(n))
         # sometimes you need @sync for an honest minimum time, perhaps not here
         r1 = @btime CUDA.@sync test1($as, $bs)
         r2 = @btime CUDA.@sync test2($as, $bs)
         r1 ≈ r2
       end
  4.875 ms (315 allocations: 15.09 KiB)
  1.234 ms (173 allocations: 8.53 KiB)  # only CPU allocations, irrelevant
true

julia> let n = 10^8
         as = cu(rand(n))
         bs = cu(rand(n))
         r1 = CUDA.@time test1(as, bs)
         # CUDA.@time shows GPU allocations, mapreduce saves almost 1GB
         r2 = CUDA.@time test2(as, bs)
         r1 ≈ r2
       end
  0.077483 seconds (14.10 k CPU allocations: 979.863 KiB) (6 GPU allocations: 762.944 MiB, 7.13% memmgmt time)
  0.001444 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.89% memmgmt time)
true

ps. Because I initially mis-read the problem, the above test2 is perhaps more complicated than necessary. A simpler one which makes two passes (and does not require you to supply a neutral element) is this:

julia> test3(as, bs) = min(mapreduce((a,b) -> a+sqrt(b), max, as, bs),
                           mapreduce((a,b) -> sqrt(a)+b, max, as, bs));

julia> let n = 10^9  # not timed on the first run after defining test3
         as = cu(rand(n))
         bs = cu(rand(n))
         r1 = CUDA.@time test1(as, bs)
         r2 = CUDA.@time test2(as, bs)
         r3 = CUDA.@time test3(as, bs)
         r1 ≈ r2 ≈ r3
       end
  1.651169 seconds (493 CPU allocations: 21.875 KiB, 94.79% gc time) (6 GPU allocations: 7.451 GiB, 97.06% memmgmt time)
  0.010663 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.17% memmgmt time)
  0.021164 seconds (347 CPU allocations: 17.359 KiB) (4 GPU allocations: 1.258 KiB, 0.25% memmgmt time)
true
2 Likes