Memory usage problem when using findmax/min

Hello,

In my case I need to continuously get the max/min values from the Array and it looks like this:

using BenchmarkTools
using CUDA

n = 1000000
a = CUDA.rand(n)
b = CUDA.rand(n)

function test(a, b)
    rst = Float32(0)
    for i in 1:100000
        rst = min(findmax(a.+sqrt.(b))[1], findmax(b.+sqrt.(a))[1])
    end
    return rst
end

@benchmark test($a, $b)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 16.959 s (2.05% GC) to evaluate,
 with a memory estimate of 1.23 GiB, over 25003359 allocations.

I found that even running the test function only once, the GPU memory usage is about 98% (24GB in total). I’m not sure if such a large memory overhead is normal.

I think the problem is not GPU specific.

This will allocate a new array in every iteration of the loop. Just replace the findmax function with a loop over the elements of a,b. This will not allocate unwanted storage and will be as fast or even faster.

Also: At least in this MWE the outer loop seems to be useless since your result won’t change.

2 Likes

I think you want:

maximum(x -> x[1] + sqrt(x[2]), zip(a,b))

there. (although since you are computing another maximum and then the minumum of those, probably a single loop that computes directly that will be faster than two maximum calls)

1 Like

These CuArray collections don’t like to be zipped:

julia> maximum(x->x[1]+sqrt(x[2]), zip(a,b))
┌ Warning: Performing scalar indexing on task Task (runnable) @0x000000000a650010.
│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArraysCore C:\Users\unime\.julia\packages\GPUArraysCore\lojQM\src\GPUArraysCore.jl:90

That said, most of the OP’s troubles are from writing a long loop and not even using it.

julia> @time test(a, b)
 30.462494 seconds (26.44 M allocations: 1.284 GiB, 3.10% gc time, 0.48% compilation time)
1.9989039f0

julia> function test(a, b)
           local f(x,y) = x+√y
           min(findmax(f.(a,b))[1], findmax(f.(b,a))[1])
       end
test (generic function with 1 method)

julia> @btime test($a, $b)
  258.100 μs (240 allocations: 12.50 KiB)
1.9989039f0

Using map works too:

julia> function test2(a, b)
           local f(x,y) = x+√y
           minimum((maximum(map(f, a, b)), maximum(map(f, b, a))))
       end
test2 (generic function with 1 method)

julia> @btime test2($a, $b)
  249.800 μs (242 allocations: 12.16 KiB)
1.9989039f0

Spitballing

Is there a reason we don’t have eachmap and eachfilter for producing generators?

Also, is there a reason findmin and findmax don’t return NamedTuples? I always forget which one is [1] and which one is [2]; could be nice if I could access .index and .value instead.

1 Like

This is probably only tangentially related but using findmax (instead of maximum) is related to a problem in https://discourse.julialang.org/t/findmin-max-and-generators/88473/1
it was fixed/hacked there using:

Base.pairs(itr::Base.Iterators.Zip) = 
  Iterators.map(((i,v),) -> i=>v, enumerate(itr))

and then:

julia> a = rand(1_000); b = rand(1_000);

julia> @btime first(findmax(((x,y),)-> x+√y,zip($a,$b)))
  2.795 μs (0 allocations: 0 bytes)
1.9773820790088965

works.

PS If Cuda vectors don’t like zip maybe they can map:

first(findmax(Iterators.map(((x, y),)-> x+√y, a, b)))

will work.

You should not write dummy loops like this for benchmarking, for two reasons:

  • BenchmarkTools.jl already does it for you, and does it correctly.
  • A dummy loop can sometimes be changed or removed by the compiler if it can prove it will not affect the output value, and then your benchmarking results can be off by a factor of 100000 in your case.
1 Like

You can do this with mapreduce, which has an efficient GPU implementation, which can do this in one pass, and should allocate almost nothing.

For CPU arrays, the above answers suggesting iteration, and zip, or just writing a loop, will be better. The CPU mapreduce with more than one array only has a fallback implementation, which calls map then reduce.

julia> using CUDA, GPUArrays, BenchmarkTools

julia> begin
         test1(a, b) =  min(findmax(a.+sqrt.(b))[1], findmax(b.+sqrt.(a))[1])

         max2((x,y), (x2,y2)) = max(x,x2), max(y, y2)
         function test2(as, bs)
           xy = mapreduce(max2, as, bs) do a, b
             a+sqrt(b), sqrt(a)+b
           end
           min(xy...)
         end

       end;  # no real need for begin end

julia> let n = 1000
         as = rand(n)
         bs = rand(n)
         r1 = @btime test1($as, $bs)
         # on CPU, mapreduce(f, op, x, y) is just map then reduce
         r2 = @btime test2($as, $bs)
         r1 ≈ r2
       end
  16.336 μs (2 allocations: 15.88 KiB)
  12.779 μs (1 allocation: 15.75 KiB)   # hence this allocation
true

# First attempt on GPU tells you what to fix (it needs to reduce in parallel):
ERROR: GPUArrays.jl needs to know the neutral element for your operator `max2`.
Please pass it as an explicit argument to `GPUArrays.mapreducedim!`,
or register it globally by defining `GPUArrays.neutral_element(::typeof(max2), T)`.

julia> GPUArrays.neutral_element(::typeof(max2), ::Type{Tuple{T,T}}) where T = (T(-Inf), T(-Inf))

julia> let n = 10^8
         as = cu(rand(n))
         bs = cu(rand(n))
         # sometimes you need @sync for an honest minimum time, perhaps not here
         r1 = @btime CUDA.@sync test1($as, $bs)
         r2 = @btime CUDA.@sync test2($as, $bs)
         r1 ≈ r2
       end
  4.875 ms (315 allocations: 15.09 KiB)
  1.234 ms (173 allocations: 8.53 KiB)  # only CPU allocations, irrelevant
true

julia> let n = 10^8
         as = cu(rand(n))
         bs = cu(rand(n))
         r1 = CUDA.@time test1(as, bs)
         # CUDA.@time shows GPU allocations, mapreduce saves almost 1GB
         r2 = CUDA.@time test2(as, bs)
         r1 ≈ r2
       end
  0.077483 seconds (14.10 k CPU allocations: 979.863 KiB) (6 GPU allocations: 762.944 MiB, 7.13% memmgmt time)
  0.001444 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.89% memmgmt time)
true

ps. Because I initially mis-read the problem, the above test2 is perhaps more complicated than necessary. A simpler one which makes two passes (and does not require you to supply a neutral element) is this:

julia> test3(as, bs) = min(mapreduce((a,b) -> a+sqrt(b), max, as, bs),
                           mapreduce((a,b) -> sqrt(a)+b, max, as, bs));

julia> let n = 10^9  # not timed on the first run after defining test3
         as = cu(rand(n))
         bs = cu(rand(n))
         r1 = CUDA.@time test1(as, bs)
         r2 = CUDA.@time test2(as, bs)
         r3 = CUDA.@time test3(as, bs)
         r1 ≈ r2 ≈ r3
       end
  1.651169 seconds (493 CPU allocations: 21.875 KiB, 94.79% gc time) (6 GPU allocations: 7.451 GiB, 97.06% memmgmt time)
  0.010663 seconds (176 CPU allocations: 8.625 KiB) (2 GPU allocations: 1.258 KiB, 0.17% memmgmt time)
  0.021164 seconds (347 CPU allocations: 17.359 KiB) (4 GPU allocations: 1.258 KiB, 0.25% memmgmt time)
true
2 Likes

Thank you all! :grinning:

New example

The old MWE may look a bit strange, so I modified it to a more reasonable one:

  1. Following @fatteneder 's hint, I used @views and the arrays tmp_a, tmp_b to avoid extra memory allocation.
  2. Following @mcabbott 's suggestion, I added the mapreduce implementation.
using BenchmarkTools
using CUDA

@inbounds @views function test1(a, b, tmp_a, tmp_b)
    rst = Float64(0)
    for i in 1:100000
        a    .+= rst
        b    .+= rst
        tmp_a .= (a./b).*sqrt.(b)
        tmp_b .= (b./a).*sqrt.(a)
        rst    = 0.01*min(maximum(tmp_a), maximum(tmp_b))
    end
    return rst
end

@inbounds @views function test2(a, b, tmp_a, tmp_b)
    rst = Float64(0)
    for i in 1:100000
        a    .+= rst
        b    .+= rst
        tmp_a .= (a./b).*sqrt.(b)
        tmp_b .= (b./a).*sqrt.(a)
        rst    = 0.01*min(findmax(tmp_a)[1], findmax(tmp_b)[1])
    end
    return rst
end

@inbounds @views function test3(a, b)
    rst = Float64(0)
    for i in 1:100000
        a .+= rst
        b .+= rst
        rst = 0.01*min(mapreduce((a, b) -> (a/b)*sqrt(b), max, a, b),
                       mapreduce((a, b) -> (b/a)*sqrt(a), max, a, b))
    end
    return rst
end

n     = 1000000
ah1   = rand(n)
bh1   = rand(n)
tmp   = zeros(n)

a_1   = CuArray(ah1)
b_1   = CuArray(bh1)
tmpa1 = CuArray(tmp)
tmpb1 = CuArray(tmp)

a_2   = CuArray(ah1)
b_2   = CuArray(bh1)
tmpa2 = CuArray(tmp)
tmpb2 = CuArray(tmp)

a_3   = CuArray(ah1)
b_3   = CuArray(bh1)

rst1 = test1(a_1, b_1, tmpa1, tmpb1)
rst2 = test2(a_2, b_2, tmpa2, tmpb2)
rst3 = test3(a_3, b_3)

rst1 ≈ rst2 ≈ rst3 # true

Benchmarks

@benchmark test1($a_1, $b_1, $tmpa1, $tmpb1)
#=
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 35.882 s (0.99% GC) to evaluate,
 with a memory estimate of 1.78 GiB, over 32511624 allocations.

 Memory used ≈ 2%
=#

@benchmark test2($a_2, $b_2, $tmpa2, $tmpb2)
#=
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 36.149 s (1.01% GC) to evaluate,
 with a memory estimate of 1.82 GiB, over 32695654 allocations.

 Memory used ≈ 2%
=#

@benchmark test3($a_3, $b_3)
#=
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 30.385 s (0.93% GC) to evaluate,
 with a memory estimate of 1.13 GiB, over 23585991 allocations.

 Memory used ≈ 2%
=#

Conclusions

Name Time (s) Memory used (GPU 24GB)
test1 35.882 ≈ 2%
test2 36.149 ≈ 2%
test3 30.385 ≈ 2%
  1. The GPU memory usage is currently stable at 2% for all three methods when benchmarking.
  2. According to the results, the mapreduce (test3) solution is the fastest.

Unlike the first example, this makes Float64 GPU arrays. Depending on your GPU, this can be a medium-sized or an enormous slow-down compared to Float32. If you do need this precision, you should check whether it’s faster on the CPU (I believe my laptop can do this faster than the times shown).

@inbounds @views is doing nothing.

a .+= rst does nothing except waste time. Why?

The version with two mapreduce operations takes about twice as long as the version with one.

I don’t think you know this. @benchmark does not measure GPU allocations. (It’s also completely unnecessary, its cleverness is all about timing short-running things, not 30s.)

With temporary arrays there shouldn’t be large GPU allocations. Note that you don’t need two of them, you can re-use the same one. But I don’t know if that’s quicker.

Hello!

My case just looks like this, it is not the actual case, so we don’t have to care about the purpose of the code😂.

In my case, getting the maximum value from CuArray is just the end part during the iterations, and I think if I transfer the array from GPU to CPU and then get the maximum value will waste more time. Thus, I want to find the fastest way to find get the maximum value from CuArray.

And yes, @benchmark doesn’t measure GPU allocations, I get “≈ 2%” from another tool. As far as I know, extra memory allocations on the CPU will decrease the performance of code, that’s why I also focus on the memory allocations on GPU~