Hi all.
I’ve been wondering about this for a long time, so I’m asking here.
I show 2 benchmark functions.
The only difference is former’s closure do not receive A as an argument while latter’s receive A.
And the results are
julia> @benchmark bench1!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 139.146 μs … 478.847 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 142.050 μs ┊ GC (median): 0.00%
Time (mean ± σ): 151.921 μs ± 31.933 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
██▅▄▄▃▂ ▁ ▁ ▁ ▁ ▂
████████▇▇▇▇▇▇▇▆▇█▇█▇▆▇▆▆▇▇██▃▁▁▁▁▁▅█▅▆█▅▁▄▅▇▇▆▆▆█▇▅▇▇▆▄▅▇█▇▄ █
139 μs Histogram: log(frequency) by time 294 μs <
Memory estimate: 1.25 KiB, allocs estimate: 29.
julia> @benchmark bench2!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 138.757 μs … 511.451 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 141.719 μs ┊ GC (median): 0.00%
Time (mean ± σ): 153.494 μs ± 37.518 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▆▄▄▂ ▁ ▁ ▁
██████▇▇▇▆▆▇▇█▇▇▇▆▆██▄▄▃▁▆███▄▅▆▇▆▆█▇▇▇▅▆██▇▆▇▄▄▅▄▃▃▁▃▁▁▁▃▁▁▇ █
139 μs Histogram: log(frequency) by time 356 μs <
Memory estimate: 560 bytes, allocs estimate: 18.
julia> CUDA.registers(bench1!(A))
8
julia> CUDA.registers(bench2!(A))
8
So, my question is
- Why is this allocation occur?
- If it is because the GPU compiler is not yet perfect, will this be improved in the future?
- Should I care about this allocation?
Off topic, but I would also like to ask about memory allocation when parallelization is used.
- Why do @threads and @cuda occur memory allocation?
- Why do not polyester’s @batch occur memory allocation?
Thank you in advance.
using CUDA
using BenchmarkTools
function bench1!(A)
function inner()
i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
if i <= length(A)
A[i] = 1
end
return nothing
end
threads = 256
blocks = cld(length(A), threads)
CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner()
end
A1 = CUDA.zeros(100)
bench1!(A1)
all(A1 .== 1)
function bench2!(A)
function inner(A)
i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
if i <= length(A)
A[i] = 1
end
return nothing
end
threads = 256
blocks = cld(length(A), threads)
CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner(A)
end
A2 = CUDA.zeros(100)
bench2!(A2)
all(A2 .== 1)
A = CUDA.zeros(256, 256, 256)
@benchmark bench1!($A)
@benchmark bench2!($A)