Extra memory allocation when using closure with CUDA

Hi all.
I’ve been wondering about this for a long time, so I’m asking here.
I show 2 benchmark functions.
The only difference is former’s closure do not receive A as an argument while latter’s receive A.
And the results are

julia> @benchmark bench1!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  139.146 μs … 478.847 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.050 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   151.921 μs ±  31.933 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▄▄▃▂                    ▁        ▁  ▁                  ▁   ▂
  ████████▇▇▇▇▇▇▇▆▇█▇█▇▆▇▆▆▇▇██▃▁▁▁▁▁▅█▅▆█▅▁▄▅▇▇▆▆▆█▇▅▇▇▆▄▅▇█▇▄ █
  139 μs        Histogram: log(frequency) by time        294 μs <

 Memory estimate: 1.25 KiB, allocs estimate: 29.

julia> @benchmark bench2!($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  138.757 μs … 511.451 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     141.719 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   153.494 μs ±  37.518 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▄▂              ▁        ▁                                 ▁
  ██████▇▇▇▆▆▇▇█▇▇▇▆▆██▄▄▃▁▆███▄▅▆▇▆▆█▇▇▇▅▆██▇▆▇▄▄▅▄▃▃▁▃▁▁▁▃▁▁▇ █
  139 μs        Histogram: log(frequency) by time        356 μs <

 Memory estimate: 560 bytes, allocs estimate: 18.
julia> CUDA.registers(bench1!(A))
8

julia> CUDA.registers(bench2!(A))
8

So, my question is

  • Why is this allocation occur?
  • If it is because the GPU compiler is not yet perfect, will this be improved in the future?
  • Should I care about this allocation?

Off topic, but I would also like to ask about memory allocation when parallelization is used.

  • Why do @threads and @cuda occur memory allocation?
  • Why do not polyester’s @batch occur memory allocation?

Thank you in advance.

using CUDA
using BenchmarkTools

function bench1!(A)
    function inner()
        i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
        if i <= length(A)
            A[i] = 1
        end
        return nothing
    end

    threads = 256
    blocks = cld(length(A), threads)

    CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner()
end

A1 = CUDA.zeros(100)
bench1!(A1)
all(A1 .== 1)

function bench2!(A)
    function inner(A)
        i = threadIdx().x + (blockIdx().x - Int32(1)) * blockDim().x
        if i <= length(A)
            A[i] = 1
        end
        return nothing
    end

    threads = 256
    blocks = cld(length(A), threads)

    CUDA.@sync CUDA.@cuda threads = threads blocks = blocks inner(A)
end

A2 = CUDA.zeros(100)
bench2!(A2)
all(A2 .== 1)

A = CUDA.zeros(256, 256, 256)
@benchmark bench1!($A)
@benchmark bench2!($A)

Those are CPU allocations, likely because of the additional boxing happening when invoking the kernel (passing a CuDeviceArray directly vs. additionally putting it in a closure object). You don’t have to care about those, the generated GPU code is going to be near identical, and can’t ever allocate anyway. Meanwhile, allocating such tiny objects on the CPU is really fast, and you can’t escape having to allocate some (the kernel’s arguments, it be regular arguments or captured ones, have to be boxed when sending it to the CUDA driver).

1 Like

Thank you!!