Extra memory allocation when using closure with CUDA

0samuraiE · September 14, 2024, 3:29pm

Hi all.
I’ve been wondering about this for a long time, so I’m asking here.
I show 2 benchmark functions.
The only difference is former’s closure do not receive A as an argument while latter’s receive A.
And the results are

julia> @benchmark bench1!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  139.146 μs … 478.847 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.050 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   151.921 μs ±  31.933 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▄▄▃▂                    ▁        ▁  ▁                  ▁   ▂
  ████████▇▇▇▇▇▇▇▆▇█▇█▇▆▇▆▆▇▇██▃▁▁▁▁▁▅█▅▆█▅▁▄▅▇▇▆▆▆█▇▅▇▇▆▄▅▇█▇▄ █
  139 μs        Histogram: log(frequency) by time        294 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 29.

julia> @benchmark bench2!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  138.757 μs … 511.451 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     141.719 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   153.494 μs ±  37.518 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▄▂              ▁        ▁                                 ▁
  ██████▇▇▇▆▆▇▇█▇▇▇▆▆██▄▄▃▁▆███▄▅▆▇▆▆█▇▇▇▅▆██▇▆▇▄▄▅▄▃▃▁▃▁▁▁▃▁▁▇ █
  139 μs        Histogram: log(frequency) by time        356 μs <

 Memory estimate: 560 bytes, allocs estimate: 18.
julia> CUDA.registers(bench1!(A))
8

julia> CUDA.registers(bench2!(A))
8

So, my question is

Why is this allocation occur?
If it is because the GPU compiler is not yet perfect, will this be improved in the future?
Should I care about this allocation?

Off topic, but I would also like to ask about memory allocation when parallelization is used.

Why do @threads and @cuda occur memory allocation?
Why do not polyester’s @batch occur memory allocation?

Thank you in advance.

using CUDA
using BenchmarkTools

function bench1!(A)
    function inner()
        i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
        if i <= length(A)
            A[i] = 1
        end
        return nothing
    end

    threads = 256
    blocks = cld(length(A), threads)

    CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner()
end

A1 = CUDA.zeros(100)
bench1!(A1)
all(A1 .== 1)

function bench2!(A)
    function inner(A)
        i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
        if i <= length(A)
            A[i] = 1
        end
        return nothing
    end

    threads = 256
    blocks = cld(length(A), threads)

    CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner(A)
end

A2 = CUDA.zeros(100)
bench2!(A2)
all(A2 .== 1)

A = CUDA.zeros(256, 256, 256)
@benchmark bench1!($A)
@benchmark bench2!($A)

maleadt · September 15, 2024, 11:14am

Those are CPU allocations, likely because of the additional boxing happening when invoking the kernel (passing a CuDeviceArray directly vs. additionally putting it in a closure object). You don’t have to care about those, the generated GPU code is going to be near identical, and can’t ever allocate anyway. Meanwhile, allocating such tiny objects on the CPU is really fast, and you can’t escape having to allocate some (the kernel’s arguments, it be regular arguments or captured ones, have to be boxed when sending it to the CUDA driver).

0samuraiE · September 15, 2024, 3:51pm

Thank you!!

Topic		Replies	Views
Threads memory allocations General Usage	2	490	January 22, 2020
GPU code has a high amount of CPU allocations? GPU	7	513	February 8, 2023
CUDA CPU allocations with range General Usage cuda	5	798	January 13, 2022
Performance regression with GPUArrays subset sum GPU	9	703	December 9, 2020
Trying to understand memory usage General Usage	7	2963	June 14, 2019

Extra memory allocation when using closure with CUDA

Related topics