You can do this with mapreduce, which has an efficient GPU implementation, which can do this in one pass, and should allocate almost nothing.
For CPU arrays, the above answers suggesting iteration, and zip, or just writing a loop, will be better. The CPU mapreduce with more than one array only has a fallback implementation, which calls map then reduce.
julia> using CUDA, GPUArrays, BenchmarkTools
julia> begin
test1(a, b) = min(findmax(a.+sqrt.(b))[1], findmax(b.+sqrt.(a))[1])
max2((x,y), (x2,y2)) = max(x,x2), max(y, y2)
function test2(as, bs)
xy = mapreduce(max2, as, bs) do a, b
a+sqrt(b), sqrt(a)+b
end
min(xy...)
end
end; # no real need for begin end
julia> let n = 1000
as = rand(n)
bs = rand(n)
r1 = @btime test1($as, $bs)
# on CPU, mapreduce(f, op, x, y) is just map then reduce
r2 = @btime test2($as, $bs)
r1 ≈ r2
end
16.336 μs (2 allocations: 15.88 KiB)
12.779 μs (1 allocation: 15.75 KiB) # hence this allocation
true
# First attempt on GPU tells you what to fix (it needs to reduce in parallel):
ERROR: GPUArrays.jl needs to know the neutral element for your operator `max2`.
Please pass it as an explicit argument to `GPUArrays.mapreducedim!`,
or register it globally by defining `GPUArrays.neutral_element(::typeof(max2), T)`.
julia> GPUArrays.neutral_element(::typeof(max2), ::Type{Tuple{T,T}}) where T = (T(-Inf), T(-Inf))
julia> let n = 10^8
as = cu(rand(n))
bs = cu(rand(n))
# sometimes you need @sync for an honest minimum time, perhaps not here
r1 = @btime CUDA.@sync test1($as, $bs)
r2 = @btime CUDA.@sync test2($as, $bs)
r1 ≈ r2
end
4.875 ms (315 allocations: 15.09 KiB)
1.234 ms (173 allocations: 8.53 KiB) # only CPU allocations, irrelevant
true
julia> let n = 10^8
as = cu(rand(n))
bs = cu(rand(n))
r1 = CUDA.@time test1(as, bs)
# CUDA.@time shows GPU allocations, mapreduce saves almost 1GB
r2 = CUDA.@time test2(as, bs)
r1 ≈ r2
end
0.077483 seconds (14.10 k CPU allocations: 979.863 KiB) (6 GPU allocations: 762.944 MiB, 7.13% memmgmt time)
0.001444 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.89% memmgmt time)
true
ps. Because I initially mis-read the problem, the above test2 is perhaps more complicated than necessary. A simpler one which makes two passes (and does not require you to supply a neutral element) is this:
julia> test3(as, bs) = min(mapreduce((a,b) -> a+sqrt(b), max, as, bs),
mapreduce((a,b) -> sqrt(a)+b, max, as, bs));
julia> let n = 10^9 # not timed on the first run after defining test3
as = cu(rand(n))
bs = cu(rand(n))
r1 = CUDA.@time test1(as, bs)
r2 = CUDA.@time test2(as, bs)
r3 = CUDA.@time test3(as, bs)
r1 ≈ r2 ≈ r3
end
1.651169 seconds (493 CPU allocations: 21.875 KiB, 94.79% gc time) (6 GPU allocations: 7.451 GiB, 97.06% memmgmt time)
0.010663 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.17% memmgmt time)
0.021164 seconds (347 CPU allocations: 17.359 KiB) (4 GPU allocations: 1.258 KiB, 0.25% memmgmt time)
true