CuArray local scope memory issue

Hi, recently i’m interested in GPU programming, not purpose on deep learning. In my case, i have quite large boolean vectors and have to get the intersect of them. I think these .== and .&& are very simple operations so maybe CUDA.jl could be a faster way. But i’m in memory trouble.

Issue

Full example:

x = rand(Bool, 5000000);
y = rand(Bool, 5000000);
@time for k in 1:100
    bit_x = (x .== 1)
    bit_y = (y .== 1)
    bit_z = bit_x .&& bit_y
end

using CUDA
a = CuArray(x);
b = CuArray(y);
@time for k in 1:100
    bit_a = (a .== 1)
    bit_b = (b .== 1)
    bit_c = bit_a .&& bit_b
end

The CPU part:

julia> x = rand(Bool, 5000000);

julia> y = rand(Bool, 5000000);

julia> @time for k in 1:100       
           bit_x = (x .== 1)      
           bit_y = (y .== 1)      
           bit_z = bit_x .&& bit_y
       end
  1.442580 seconds (763.85 k allocations: 215.714 MiB, 13.73% gc time, 12.20% compilation time)

julia> @time for k in 1:100       
           bit_x = (x .== 1)      
           bit_y = (y .== 1)      
           bit_z = bit_x .&& bit_y
       end
  1.077367 seconds (1.70 k allocations: 180.086 MiB, 1.74% gc time)

I have to do above calculation so many times, more, more than 100.

The GPU part:

julia> using CUDA

julia> a = CuArray(x);

julia> b = CuArray(y);

julia> @time for k in 1:100       
           bit_a = (a .== 1)      
           bit_b = (b .== 1)      
           bit_c = bit_a .&& bit_b
       end
 15.573386 seconds (28.00 M allocations: 1.421 GiB, 4.25% gc time, 35.52% compilation time)

julia> @time for k in 1:100       
           bit_a = (a .== 1)      
           bit_b = (b .== 1)      
           bit_c = bit_a .&& bit_b
       end
  0.017204 seconds (9.20 k allocations: 503.125 KiB)

julia> @time for k in 1:100       
           bit_a = (a .== 1)      
           bit_b = (b .== 1)      
           bit_c = bit_a .&& bit_b
       end
  0.020961 seconds (9.20 k allocations: 503.125 KiB)

First 15.573386 sec is OK. I have understand why julia is slow at first time. Second, third performance are x1000 fast and that’s what i want to get, it’s cool. But at forth time, i got StackOverFlowError:

Honeycam 2023-01-03 01-19-49

Sorry for korean, “전용 GPU 메모리 사용량” in the third row means vram. The snapshot shows the memory got free when i interrupt with Ctrl+C in REPL. But why? Why the GPU usage was increased?

Question

Why CuArray operation leaks the memory in local scope?

Here is my question, I expected the bit_c = bit_a .&& bit_b in GPU part has local scope hence got free when the for-loop was completely executed. In CPU part, bit_z = bit_x .&& bit_y performs like my expectation and I think it’s right in julia scope structure.

How can i fix it? Could i understand why it’s happened? I already have read below documentation and GC.gc(true) doesn’t work.

My test environment

  • window10
  • julia 1.8.3
  • CUDA.jl v3.12.0
  • CUDA v"11.7.0"
  • GTX 1060 6GB

I tested 3 different environments. (window11, julia 1.8.3, CUDA ? and GTX 1060 6GB), (window10, julia 1.8.4, CUDA v12.0.0 and GTX 1060 3GB) I have better server so i could use 24 vram but i think that’s not an essential solution to me. I want to know what is difference between CPU and GPU in julia, how can i handle them.

Thank you for reading.

In Memory management · CUDA.jl, CUDA.unsafe_free! maybe helpful in some cases, but still local scope things are not understandable. Is this a kind of computer science stuff?

julia> @time for k in 1:100
           bit_a = (a .== 1)
           bit_b = (b .== 1)
           bit_c = bit_a .&& bit_b
           CUDA.unsafe_free!(bit_a)
           CUDA.unsafe_free!(bit_b)
           CUDA.unsafe_free!(bit_c)
       end
  0.002051 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
           bit_a = (a .== 1)
           bit_b = (b .== 1)
           bit_c = bit_a .&& bit_b
           CUDA.unsafe_free!(bit_a)
           CUDA.unsafe_free!(bit_b)
           CUDA.unsafe_free!(bit_c)
       end
  0.001994 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
           bit_a = (a .== 1)
           bit_b = (b .== 1)
           bit_c = bit_a .&& bit_b
           CUDA.unsafe_free!(bit_a)
           CUDA.unsafe_free!(bit_b)
           CUDA.unsafe_free!(bit_c)
       end
  0.003385 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
           bit_a = (a .== 1)
           bit_b = (b .== 1)
           bit_c = bit_a .&& bit_b
           CUDA.unsafe_free!(bit_a)
           CUDA.unsafe_free!(bit_b)
           CUDA.unsafe_free!(bit_c)
       end
  0.001997 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
           bit_a = (a .== 1)
           bit_b = (b .== 1)
           bit_c = bit_a .&& bit_b
           CUDA.unsafe_free!(bit_a)
           CUDA.unsafe_free!(bit_b)
           CUDA.unsafe_free!(bit_c)
       end
  0.002219 seconds (9.50 k allocations: 512.500 KiB)

Performance Tips · The Julia Language applies here - using these arrays at the “toplevel” (outside of a function) can cause issues with freeing of memory; try putting these for loops within a function and calling that function with @time.

1 Like

As @jpsamaroo said, you should put the code inside a function. You can also do this with preallocating memory too. You can do operations in-place with .= broadcasting. This function should work for GPU arrays and normal arrays:

function f!(c, a, b)
    c .= (a.==true) .&& (b.==true)
end

This can be benchmarked:

using CUDA
using BenchmarkTools
n=1024
a=CUDA.rand(Bool, n)
b=CUDA.rand(Bool, n)
c=similar(a)
@btime CUDA.@sync f!($c, $a, $b)

This shouldn’t have memory problems or require the GC.

Also, as a PS, could you simplify to the following:

function f!(c, a, b)
    c .= a .&& b
end
1 Like

Wow, .= broadcasting was the key. If I subsitute c .= (a.==true) .&& (b.==true) into c = (a.==true) .&& (b.==true) then the code causes same issue. (‘toplevel’ thing or writing function are not much related when I tested). Thank you for your detailed and kind answer. Thank you!!

1 Like