# CuArray local scope memory issue

Hi, recently i’m interested in GPU programming, not purpose on deep learning. In my case, i have quite large boolean vectors and have to get the intersect of them. I think these `.==` and `.&&` are very simple operations so maybe `CUDA.jl` could be a faster way. But i’m in memory trouble.

## Issue

Full example:

``````x = rand(Bool, 5000000);
y = rand(Bool, 5000000);
@time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end

using CUDA
a = CuArray(x);
b = CuArray(y);
@time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
``````

The CPU part:

``````julia> x = rand(Bool, 5000000);

julia> y = rand(Bool, 5000000);

julia> @time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end
1.442580 seconds (763.85 k allocations: 215.714 MiB, 13.73% gc time, 12.20% compilation time)

julia> @time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end
1.077367 seconds (1.70 k allocations: 180.086 MiB, 1.74% gc time)
``````

I have to do above calculation so many times, more, more than 100.

The GPU part:

``````julia> using CUDA

julia> a = CuArray(x);

julia> b = CuArray(y);

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
15.573386 seconds (28.00 M allocations: 1.421 GiB, 4.25% gc time, 35.52% compilation time)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
0.017204 seconds (9.20 k allocations: 503.125 KiB)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
0.020961 seconds (9.20 k allocations: 503.125 KiB)
``````

First `15.573386 sec` is OK. I have understand why julia is slow at first time. Second, third performance are x1000 fast and that’s what i want to get, it’s cool. But at forth time, i got `StackOverFlowError`:

Sorry for korean, “전용 GPU 메모리 사용량” in the third row means vram. The snapshot shows the memory got free when i interrupt with `Ctrl+C` in REPL. But why? Why the GPU usage was increased?

## Question

Why CuArray operation leaks the memory in local scope?

Here is my question, I expected the `bit_c = bit_a .&& bit_b` in GPU part has local scope hence got free when the for-loop was completely executed. In CPU part, `bit_z = bit_x .&& bit_y` performs like my expectation and I think it’s right in julia scope structure.

How can i fix it? Could i understand why it’s happened? I already have read below documentation and `GC.gc(true)` doesn’t work.

https://cuda.juliagpu.org/stable/usage/memory/

## My test environment

• window10
• julia 1.8.3
• CUDA.jl v3.12.0
• CUDA v"11.7.0"
• GTX 1060 6GB

I tested 3 different environments. (window11, julia 1.8.3, CUDA ? and GTX 1060 6GB), (window10, julia 1.8.4, CUDA v12.0.0 and GTX 1060 3GB) I have better server so i could use 24 vram but i think that’s not an essential solution to me. I want to know what is difference between CPU and GPU in julia, how can i handle them.

In Memory management · CUDA.jl, `CUDA.unsafe_free!` maybe helpful in some cases, but still local scope things are not understandable. Is this a kind of computer science stuff?

``````julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
CUDA.unsafe_free!(bit_a)
CUDA.unsafe_free!(bit_b)
CUDA.unsafe_free!(bit_c)
end
0.002051 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
CUDA.unsafe_free!(bit_a)
CUDA.unsafe_free!(bit_b)
CUDA.unsafe_free!(bit_c)
end
0.001994 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
CUDA.unsafe_free!(bit_a)
CUDA.unsafe_free!(bit_b)
CUDA.unsafe_free!(bit_c)
end
0.003385 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
CUDA.unsafe_free!(bit_a)
CUDA.unsafe_free!(bit_b)
CUDA.unsafe_free!(bit_c)
end
0.001997 seconds (9.50 k allocations: 512.500 KiB)

julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
CUDA.unsafe_free!(bit_a)
CUDA.unsafe_free!(bit_b)
CUDA.unsafe_free!(bit_c)
end
0.002219 seconds (9.50 k allocations: 512.500 KiB)
``````

Performance Tips · The Julia Language applies here - using these arrays at the “toplevel” (outside of a function) can cause issues with freeing of memory; try putting these `for` loops within a function and calling that function with `@time`.

1 Like

As @jpsamaroo said, you should put the code inside a function. You can also do this with preallocating memory too. You can do operations in-place with `.=` broadcasting. This function should work for GPU arrays and normal arrays:

``````function f!(c, a, b)
c .= (a.==true) .&& (b.==true)
end
``````

This can be benchmarked:

``````using CUDA
using BenchmarkTools
n=1024
a=CUDA.rand(Bool, n)
b=CUDA.rand(Bool, n)
c=similar(a)
@btime CUDA.@sync f!(\$c, \$a, \$b)
``````

This shouldn’t have memory problems or require the GC.

Also, as a PS, could you simplify to the following:

``````function f!(c, a, b)
c .= a .&& b
end
``````
1 Like

Wow, `.=` broadcasting was the key. If I subsitute `c .= (a.==true) .&& (b.==true)` into `c = (a.==true) .&& (b.==true)` then the code causes same issue. (‘toplevel’ thing or writing function are not much related when I tested). Thank you for your detailed and kind answer. Thank you!!

1 Like