How to resize and fill more efficiently?

Ahmed_Salih · February 5, 2023, 9:04am

Hello!

Suppose I have:

using StaticArrays

S = rand(SVector{3,Float32},10^6)

function MyResizeAndFill!(S,N)
    resize!(S,N)
    fill!(S,zero(eltype(S)))
end

@benchmark MyResizeAndFill!($S,$(10^8))
BenchmarkTools.Trial: 34 samples with 1 evaluation.
 Range (min … max):   98.789 ms … 200.757 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     151.433 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   149.898 ms ±  24.283 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

# And on GPU which is what I care about
SCU = CuArray(S)
@benchmark MyResizeAndFill!($SCU,$(10^8))
BenchmarkTools.Trial: 182 samples with 1 evaluation.
 Range (min … max):  10.508 ms … 61.047 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     27.079 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.514 ms ± 11.648 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▃                         
  ▃▁▆██▅▅▁▃▃█▁▃▃▃▃▁▃▄▄▃▃▅▃▄▅▃▅▃▄▄▁▅▅▃▃▅▁▁▁▃▁▆▃▃▇▅▅▆▅▇▅▅▅▄▃▅▅▃ ▃
  10.5 ms         Histogram: frequency by time        45.4 ms <

 Memory estimate: 4.77 KiB, allocs estimate: 81.

Is there any way to get this number down? I need to call this function 15 times in a simulation to reset arrays, and it takes way to long like this for me

Kind regards

maleadt · February 6, 2023, 9:20am

Your benchmark seems flawed; the resize only happens once, and you’re not synchronizing.

That said, resize in CUDA.jl currently will never be fast. It’s essentially a new allocation and a copy, even when shrinking. The implementation isn’t complex, so you could take a look at optimizing this (e.g., keep the excess space when shrinking, or adding a sizehint!) for your use case.

Ahmed_Salih · February 6, 2023, 5:55pm

Thank you. I redid the benchmark:

 @benchmark @CUDA.sync MyResizeAndFill!($SCU,$(10^8))
BenchmarkTools.Trial: 380 samples with 1 evaluation.
 Range (min … max):  12.184 ms … 40.047 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.849 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.146 ms ±  2.098 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▁▆█▆▆▃▂
  ▅████████▆▄▅▃▃▂▃▃▂▂▁▁▂▁▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂ ▃
  12.2 ms         Histogram: frequency by time        19.6 ms <

 Memory estimate: 8.19 KiB, allocs estimate: 132.

And I see that is takes ~10 ms to call the function to issue the resize and it takes ~2 ms to perform the actual resize and fill since this function took ~12 ms? Is that correctly understood?

And thank you for the hints - I got an idea of preallocating “a zeroth array” as such and use copyto! instead:

BCU = deepcopy(SCU)
@benchmark @CUDA.sync copyto!($SCU,$BCU)
BenchmarkTools.Trial: 738 samples with 1 evaluation.
 Range (min … max):  6.411 ms …  12.122 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.670 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.762 ms ± 308.416 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▄██▆ ▁    
  ▂▂▂▃▄▆▇███████▇▇▅▄▃▅▅▇▆▆▅▄▄▃▄▄▃▃▄▄▃▃▃▃▃▂▁▂▂▂▂▂▂▂▂▁▃▂▂▁▁▂▂▁▂ ▃
  6.41 ms         Histogram: frequency by time        7.63 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 51.

The nice thing about copyto! is one can specify the indices to copyto as well. This doubles the speed by two and reduces CPU allocs, so perhaps I can find a way by defining an initial zero array to improve this aspect of my code.

Thank you for the help

Ahmed_Salih · February 7, 2023, 8:04pm

For any one wondering how slow the current version of fill on GPU actually is, then in a hot loop this is my performance:

And in these steps I would mainly update a particle neighbour list (part of Stage 0), while clean up is pure fill! commands - and that is what is taking the longest time.

Will see if I can use @maleadt great tips to reduce this, just wanted to make others aware.

I must admit that Julia is so cool for being able to this easily monitor and track performance of stuff

Kind regards

Ahmed_Salih · February 7, 2023, 10:26pm

I did some more work and produced this function:


function ResetToZeroGPU!(SCU)
    @view(reinterpret(eltype(eltype(SCU)), SCU)[begin:end]) .= 0

    return nothing
end

It uses a view in a way it should not, i.e. to allocate the value of an array, but it has even better performance than copyto!:

BenchmarkTools.Trial: 912 samples with 1 evaluation.
 Range (min … max):  4.273 ms … 35.275 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.623 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.454 ms ±  2.284 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▅▄▄▂     ▁▁▁▁▁
  ████████▆▇██████▇▇▇▆▅▄▆▁▁▄▄▄▅▅▅▅▄▄▆▄▁▄▅▅▄▁▁▄▁▁▁▅▁▁▅▁▁▄▅▁▁▄ █
  4.27 ms      Histogram: log(frequency) by time     15.9 ms <

 Memory estimate: 4.72 KiB, allocs estimate: 80.

Which is ~30% faster than copyto! This leads to a timing compared to the picture above of:

Which now is a timing of 153 seconds compared to 682 seconds from before.

I never thought my one of my big challenges would be resetting a GPU array to all zeros, but I am learning a lot in this process If anyone know how to set / fill the values more efficiently, please do share

Kind regards

maleadt · February 8, 2023, 9:11am

No: At some point the command queue fills up, causing a non-synchronizing measurement to include the full execution time. But that doesn’t happen always, so essentially you’re ‘badly’ measuring when not including a synchronization.

maleadt · February 8, 2023, 9:20am

Also, fill! should always be strictly faster than the broadcast you’re doing. I’d think the slowdown came from the resize!, not from the fill!.

Ahmed_Salih · February 8, 2023, 1:24pm

I will benchmark with fill later today

Kind regards

Ahmed_Salih · February 8, 2023, 4:13pm

julia> @benchmark @CUDA.sync fill!($SCU,$(zero(eltype(SCU))))
BenchmarkTools.Trial: 720 samples with 1 evaluation.
 Range (min … max):  5.575 ms … 30.020 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.423 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.923 ms ±  2.055 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▄▄▂▄▄▅▄▂
  ████████████▇█▅█▆█▇▇██▆▇▁▄▆▅▆▄▄▅▁▄▄▁▁▁▁▄▄▁▄▁▄▄▆▄▁▁▁▄▄▄▁▁▁▆ █
  5.58 ms      Histogram: log(frequency) by time     15.6 ms <

 Memory estimate: 4.61 KiB, allocs estimate: 76.

julia> @benchmark @CUDA.sync @view(reinterpret($T, $SCU)[:]) .= 0
BenchmarkTools.Trial: 846 samples with 1 evaluation.
 Range (min … max):  4.338 ms … 34.718 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.834 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.867 ms ±  2.506 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▄▃▂▃▃▂▂  ▁   ▁▁▂▁▂
  ███████████▇█▇▇▇█████▆▆▅▇▇▅▅▄▅▄▄▄▅▁▅▅▁▄▁▄▁▁▁▁▄▁▄▁▄▄▄▁▁▁▁▁▄ █
  4.34 ms      Histogram: log(frequency) by time     17.2 ms <

 Memory estimate: 4.72 KiB, allocs estimate: 80.

To me it seems the view approach is faster?

EDIT: Made a small kernel, which is also faster than fill by a bit, but not as fast at the other:

function GPU_SET_ZERO(ARR)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    N      = length(ARR)
    typ0   = zero(eltype(ARR))
    @inbounds for i = index:stride:N
       @inbounds ARR[i] = typ0
    end

    return nothing
end

# Actual Function to Call for Kernel Gradient Calc
function ResetZero!(ARR)
    kernel  = @cuda launch=false GPU_SET_ZERO(ARR)
    config  = launch_configuration(kernel.fun)

    N       = length(ARR)
    threads = min(N, config.threads)
    blocks  = cld(N, threads)

    return (kernel,threads,blocks)
end

ResetZeroKernel,ResetZeroThreads,ResetZeroBlocks = ResetZero!(SCU)

@benchmark @CUDA.sync ResetZeroKernel($SCU ; threads=$ResetZeroThreads, blocks=$ResetZeroBlocks)

BenchmarkTools.Trial: 587 samples with 1 evaluation.
 Range (min … max):  5.559 ms … 121.348 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.369 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.487 ms ±   7.954 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▅▄▂▁
  ███████▇▇▇▇█▆▇▆▆▅▆▅▄▆▁▁▄▄▁▅▄▁▁▁▅▁▁▁▁▄▁▁▅▁▁▅▁▆▁▁▁▁▁▄▁▄▄▄▁▁▁▅ ▇
  5.56 ms      Histogram: log(frequency) by time      38.8 ms <

 Memory estimate: 3.44 KiB, allocs estimate: 51.

Kind regards

Ahmed_Salih · February 8, 2023, 4:37pm

And something almost two times faster than fill! is this code:

function GPU_SET_ZERO(ARR)
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x
    
    typ0   = eltype(eltype(ARR))
    ARRV   = @view(reinterpret(typ0, ARR)[begin:end]) 
    N      = length(ARRV)
    @inbounds for i = index:stride:N
       @inbounds ARRV[i] = 0
    end

    return nothing
end

# Actual Function to Call for Kernel Gradient Calc
function ResetZero!(ARR)
    kernel  = @cuda launch=false GPU_SET_ZERO(ARR)
    config  = launch_configuration(kernel.fun)

    N       = length(ARR)
    threads = min(N, config.threads)
    blocks  = cld(N, threads)

    return (kernel,threads,blocks)
end

ResetZeroKernel,ResetZeroThreads,ResetZeroBlocks = ResetZero!(SCU)

Which produces:

@benchmark @CUDA.sync ResetZeroKernel($SCU ; threads=$ResetZeroThreads, blocks=$ResetZeroBlocks)
BenchmarkTools.Trial: 1320 samples with 1 evaluation.
 Range (min … max):  2.962 ms … 42.514 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.338 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.770 ms ±  1.970 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▆▅▃▂ ▁ ▂▂▁▁
  ██████████████▇▇▇▇▆▇▆▅▆▄▅▄▄▆▅▅▄▁▁▅▁▅▁▅▄▁▄▁▁▁▄▁▁▁▁▁▄▁▄▁▁▁▁▅ █
  2.96 ms      Histogram: log(frequency) by time     10.3 ms <

 Memory estimate: 112 bytes, allocs estimate: 2.

Ahmed_Salih · February 8, 2023, 5:44pm

And here is the effect on my full simulation, going from 153 s to 59 s by using the newest and fastest way to reset to zero:

Unfortunately I am not good enough (yet!) to go and change src files / package development for general purpose @maleadt , but I hope these small studies can help in improving stuff as CUDA etc.

I still think this is waaaay to slow, but now it is atleast 397μs on average (for this simulation) which is unpar with one full calculation loop, instead of 10 times slower in total

Kind regards

maleadt · February 9, 2023, 8:31am

I was thinking of fill! on a CuArray with supported eltypes, i.e., not the SArray eltypes you’re using. That uses direct CUDA API calls, CUDA.jl/array.jl at e9833ed71977b423586734a5f81151925e00d960 · JuliaGPU/CUDA.jl · GitHub, so should be the fastest, but we can’t generalize that to all element types. Even for setting to zero, the bit representation may be not all zeros, so we can’t generalize. In your case it may be valid to reinterpret the array to a supported element type and use the optimized implementation though.

Ahmed_Salih · February 9, 2023, 1:31pm

Okay, so what you are in a sense saying that for my memory layout it would probably be smarter to let go of StaticArrays and giving it a shot using more “common” datatypes, such as Floats etc packed in conventional arrays and matrices?

I will give it a test tonight, with a 2D array using the same size, and see if speed is significantly faster - if yes, I of course have to switch over

Kind regards

maleadt · February 9, 2023, 3:52pm

If in your case it’s fine to write all zeroes, you could try reinterpreting your complex element type to something that’s compatible with the optimized fill! method. I think that should work:

julia> S = rand(SVector{3,Float32},1)
1-element Vector{SVector{3, Float32}}:
 [0.1546734, 0.986477, 0.37181443]

julia> reinterpret(Float32, S)
3-element reinterpret(Float32, ::Vector{SVector{3, Float32}}):
 0.1546734
 0.986477
 0.37181443

Ahmed_Salih · February 9, 2023, 7:16pm

Hi again

I believe I already did that in the kernel above:

And in the simpler (but slower) example above it:

So you are indeed right, reinterpreting the array using views is the only proper way to do it for SVectors

I hope I did not misunderstand you.

Regarding my second point of trying to use a 2D array, I am happy/sad to inform that it does not provide a speed up:

S = rand(Float32,(3,10^8)
SCU = CuArray(S)

@benchmark @CUDA.sync fill!($SCU,$(zero(eltype(SCU))))
BenchmarkTools.Trial: 1416 samples with 1 evaluation.
 Range (min … max):  3.048 ms … 50.082 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.361 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.510 ms ±  1.418 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▁▂▄▇██▅▁     ▁▁
  ▃▄▄▆█████████▅▆█▇██▆▅▅▄▄▃▃▃▃▃▃▃▃▃▃▁▂▂▂▂▂▁▂▂▂▁▁▂▂▂▁▂▁▁▂▁▁▁▂ ▄
  3.05 ms        Histogram: frequency by time        4.79 ms <

 Memory estimate: 3.42 KiB, allocs estimate: 53.

I am happy because it means I can stick to my StaticArray approach, but sad that I cannot speed it up further

Thank you for all the help and suggestions

Kind regards

Topic		Replies	Views
Fastest way of getting a long zero vector? General Usage	29	864	March 15, 2024
Performance of filling an array General Usage performance	9	968	April 5, 2019
Could fill! be twice as fast? Internals & Design	13	2286	May 15, 2017
How to accelerate GPU operation？ GPU question	12	365	November 18, 2024
Faster zeros with calloc Performance array	26	3908	March 23, 2022

How to resize and fill more efficiently?

Related topics