Hi, recently i’m interested in GPU programming, not purpose on deep learning. In my case, i have quite large boolean vectors and have to get the intersect of them. I think these .==
and .&&
are very simple operations so maybe CUDA.jl
could be a faster way. But i’m in memory trouble.
Issue
Full example:
x = rand(Bool, 5000000);
y = rand(Bool, 5000000);
@time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end
using CUDA
a = CuArray(x);
b = CuArray(y);
@time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
The CPU part:
julia> x = rand(Bool, 5000000);
julia> y = rand(Bool, 5000000);
julia> @time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end
1.442580 seconds (763.85 k allocations: 215.714 MiB, 13.73% gc time, 12.20% compilation time)
julia> @time for k in 1:100
bit_x = (x .== 1)
bit_y = (y .== 1)
bit_z = bit_x .&& bit_y
end
1.077367 seconds (1.70 k allocations: 180.086 MiB, 1.74% gc time)
I have to do above calculation so many times, more, more than 100.
The GPU part:
julia> using CUDA
julia> a = CuArray(x);
julia> b = CuArray(y);
julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
15.573386 seconds (28.00 M allocations: 1.421 GiB, 4.25% gc time, 35.52% compilation time)
julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
0.017204 seconds (9.20 k allocations: 503.125 KiB)
julia> @time for k in 1:100
bit_a = (a .== 1)
bit_b = (b .== 1)
bit_c = bit_a .&& bit_b
end
0.020961 seconds (9.20 k allocations: 503.125 KiB)
First 15.573386 sec
is OK. I have understand why julia is slow at first time. Second, third performance are x1000 fast and that’s what i want to get, it’s cool. But at forth time, i got StackOverFlowError
:
Sorry for korean, “전용 GPU 메모리 사용량” in the third row means vram. The snapshot shows the memory got free when i interrupt with Ctrl+C
in REPL. But why? Why the GPU usage was increased?
Question
Why CuArray operation leaks the memory in local scope?
Here is my question, I expected the bit_c = bit_a .&& bit_b
in GPU part has local scope hence got free when the for-loop was completely executed. In CPU part, bit_z = bit_x .&& bit_y
performs like my expectation and I think it’s right in julia scope structure.
How can i fix it? Could i understand why it’s happened? I already have read below documentation and GC.gc(true)
doesn’t work.
My test environment
- window10
- julia 1.8.3
- CUDA.jl v3.12.0
- CUDA v"11.7.0"
- GTX 1060 6GB
I tested 3 different environments. (window11, julia 1.8.3, CUDA ? and GTX 1060 6GB), (window10, julia 1.8.4, CUDA v12.0.0 and GTX 1060 3GB) I have better server so i could use 24 vram but i think that’s not an essential solution to me. I want to know what is difference between CPU and GPU in julia, how can i handle them.
Thank you for reading.