You can do this with mapreduce
, which has an efficient GPU implementation, which can do this in one pass, and should allocate almost nothing.
For CPU arrays, the above answers suggesting iteration, and zip
, or just writing a loop, will be better. The CPU mapreduce
with more than one array only has a fallback implementation, which calls map
then reduce
.
julia> using CUDA, GPUArrays, BenchmarkTools
julia> begin
test1(a, b) = min(findmax(a.+sqrt.(b))[1], findmax(b.+sqrt.(a))[1])
max2((x,y), (x2,y2)) = max(x,x2), max(y, y2)
function test2(as, bs)
xy = mapreduce(max2, as, bs) do a, b
a+sqrt(b), sqrt(a)+b
end
min(xy...)
end
end; # no real need for begin end
julia> let n = 1000
as = rand(n)
bs = rand(n)
r1 = @btime test1($as, $bs)
# on CPU, mapreduce(f, op, x, y) is just map then reduce
r2 = @btime test2($as, $bs)
r1 ≈ r2
end
16.336 μs (2 allocations: 15.88 KiB)
12.779 μs (1 allocation: 15.75 KiB) # hence this allocation
true
# First attempt on GPU tells you what to fix (it needs to reduce in parallel):
ERROR: GPUArrays.jl needs to know the neutral element for your operator `max2`.
Please pass it as an explicit argument to `GPUArrays.mapreducedim!`,
or register it globally by defining `GPUArrays.neutral_element(::typeof(max2), T)`.
julia> GPUArrays.neutral_element(::typeof(max2), ::Type{Tuple{T,T}}) where T = (T(-Inf), T(-Inf))
julia> let n = 10^8
as = cu(rand(n))
bs = cu(rand(n))
# sometimes you need @sync for an honest minimum time, perhaps not here
r1 = @btime CUDA.@sync test1($as, $bs)
r2 = @btime CUDA.@sync test2($as, $bs)
r1 ≈ r2
end
4.875 ms (315 allocations: 15.09 KiB)
1.234 ms (173 allocations: 8.53 KiB) # only CPU allocations, irrelevant
true
julia> let n = 10^8
as = cu(rand(n))
bs = cu(rand(n))
r1 = CUDA.@time test1(as, bs)
# CUDA.@time shows GPU allocations, mapreduce saves almost 1GB
r2 = CUDA.@time test2(as, bs)
r1 ≈ r2
end
0.077483 seconds (14.10 k CPU allocations: 979.863 KiB) (6 GPU allocations: 762.944 MiB, 7.13% memmgmt time)
0.001444 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.89% memmgmt time)
true
ps. Because I initially mis-read the problem, the above test2
is perhaps more complicated than necessary. A simpler one which makes two passes (and does not require you to supply a neutral element) is this:
julia> test3(as, bs) = min(mapreduce((a,b) -> a+sqrt(b), max, as, bs),
mapreduce((a,b) -> sqrt(a)+b, max, as, bs));
julia> let n = 10^9 # not timed on the first run after defining test3
as = cu(rand(n))
bs = cu(rand(n))
r1 = CUDA.@time test1(as, bs)
r2 = CUDA.@time test2(as, bs)
r3 = CUDA.@time test3(as, bs)
r1 ≈ r2 ≈ r3
end
1.651169 seconds (493 CPU allocations: 21.875 KiB, 94.79% gc time) (6 GPU allocations: 7.451 GiB, 97.06% memmgmt time)
0.010663 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.17% memmgmt time)
0.021164 seconds (347 CPU allocations: 17.359 KiB) (4 GPU allocations: 1.258 KiB, 0.25% memmgmt time)
true