using StaticArrays
S = rand(SVector{3,Float32},10^6)
function MyResizeAndFill!(S,N)
resize!(S,N)
fill!(S,zero(eltype(S)))
end
@benchmark MyResizeAndFill!($S,$(10^8))
BenchmarkTools.Trial: 34 samples with 1 evaluation.
Range (min … max): 98.789 ms … 200.757 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 151.433 ms ┊ GC (median): 0.00%
Time (mean ± σ): 149.898 ms ± 24.283 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
# And on GPU which is what I care about
SCU = CuArray(S)
@benchmark MyResizeAndFill!($SCU,$(10^8))
BenchmarkTools.Trial: 182 samples with 1 evaluation.
Range (min … max): 10.508 ms … 61.047 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 27.079 ms ┊ GC (median): 0.00%
Time (mean ± σ): 27.514 ms ± 11.648 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃
▃▁▆██▅▅▁▃▃█▁▃▃▃▃▁▃▄▄▃▃▅▃▄▅▃▅▃▄▄▁▅▅▃▃▅▁▁▁▃▁▆▃▃▇▅▅▆▅▇▅▅▅▄▃▅▅▃ ▃
10.5 ms Histogram: frequency by time 45.4 ms <
Memory estimate: 4.77 KiB, allocs estimate: 81.
Is there any way to get this number down? I need to call this function 15 times in a simulation to reset arrays, and it takes way to long like this for me
Your benchmark seems flawed; the resize only happens once, and you’re not synchronizing.
That said, resize in CUDA.jl currently will never be fast. It’s essentially a new allocation and a copy, even when shrinking. The implementation isn’t complex, so you could take a look at optimizing this (e.g., keep the excess space when shrinking, or adding a sizehint!) for your use case.
@benchmark @CUDA.sync MyResizeAndFill!($SCU,$(10^8))
BenchmarkTools.Trial: 380 samples with 1 evaluation.
Range (min … max): 12.184 ms … 40.047 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 12.849 ms ┊ GC (median): 0.00%
Time (mean ± σ): 13.146 ms ± 2.098 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▁▆█▆▆▃▂
▅████████▆▄▅▃▃▂▃▃▂▂▁▁▂▁▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂ ▃
12.2 ms Histogram: frequency by time 19.6 ms <
Memory estimate: 8.19 KiB, allocs estimate: 132.
And I see that is takes ~10 ms to call the function to issue the resize and it takes ~2 ms to perform the actual resize and fill since this function took ~12 ms? Is that correctly understood?
And thank you for the hints - I got an idea of preallocating “a zeroth array” as such and use copyto! instead:
BCU = deepcopy(SCU)
@benchmark @CUDA.sync copyto!($SCU,$BCU)
BenchmarkTools.Trial: 738 samples with 1 evaluation.
Range (min … max): 6.411 ms … 12.122 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.670 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.762 ms ± 308.416 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄██▆ ▁
▂▂▂▃▄▆▇███████▇▇▅▄▃▅▅▇▆▆▅▄▄▃▄▄▃▃▄▄▃▃▃▃▃▂▁▂▂▂▂▂▂▂▂▁▃▂▂▁▁▂▂▁▂ ▃
6.41 ms Histogram: frequency by time 7.63 ms <
Memory estimate: 3.42 KiB, allocs estimate: 51.
The nice thing about copyto! is one can specify the indices to copyto as well. This doubles the speed by two and reduces CPU allocs, so perhaps I can find a way by defining an initial zero array to improve this aspect of my code.
And in these steps I would mainly update a particle neighbour list (part of Stage 0), while clean up is pure fill! commands - and that is what is taking the longest time.
Will see if I can use @maleadt great tips to reduce this, just wanted to make others aware.
I must admit that Julia is so cool for being able to this easily monitor and track performance of stuff
Which now is a timing of 153 seconds compared to 682 seconds from before.
I never thought my one of my big challenges would be resetting a GPU array to all zeros, but I am learning a lot in this process If anyone know how to set / fill the values more efficiently, please do share
No: At some point the command queue fills up, causing a non-synchronizing measurement to include the full execution time. But that doesn’t happen always, so essentially you’re ‘badly’ measuring when not including a synchronization.
Unfortunately I am not good enough (yet!) to go and change src files / package development for general purpose @maleadt , but I hope these small studies can help in improving stuff as CUDA etc.
I still think this is waaaay to slow, but now it is atleast 397μs on average (for this simulation) which is unpar with one full calculation loop, instead of 10 times slower in total
I was thinking of fill! on a CuArray with supported eltypes, i.e., not the SArray eltypes you’re using. That uses direct CUDA API calls, CUDA.jl/array.jl at e9833ed71977b423586734a5f81151925e00d960 · JuliaGPU/CUDA.jl · GitHub, so should be the fastest, but we can’t generalize that to all element types. Even for setting to zero, the bit representation may be not all zeros, so we can’t generalize. In your case it may be valid to reinterpret the array to a supported element type and use the optimized implementation though.
Okay, so what you are in a sense saying that for my memory layout it would probably be smarter to let go of StaticArrays and giving it a shot using more “common” datatypes, such as Floats etc packed in conventional arrays and matrices?
I will give it a test tonight, with a 2D array using the same size, and see if speed is significantly faster - if yes, I of course have to switch over
If in your case it’s fine to write all zeroes, you could try reinterpreting your complex element type to something that’s compatible with the optimized fill! method. I think that should work: