How to efficiently split StaticArrays into columns?

Ahmed_Salih · January 15, 2023, 10:25am

Hi!

Suppose I have:

using StaticArrays

Values = rand(SVector{3,Float64},10^6)

I know that I can split this into components as such:

ValuesX = getindex.(Values,1)

@benchmark getindex.($Values,1)
BenchmarkTools.Trial: 1104 samples with 1 evaluation.
 Range (min … max):  3.079 ms … 24.049 ms  ┊ GC (min … max):  0.00% … 79.77%
 Time  (median):     3.742 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   4.519 ms ±  3.131 ms  ┊ GC (mean ± σ):  14.44% ± 16.67%

  ▇▆█▇▃ ▃▃▃                                               ▁
  ██████████▇▁▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▄███▇ █
  3.08 ms      Histogram: log(frequency) by time     17.5 ms <

And so on for ValuesY and ValuesZ.

How do I perform this operation more efficiently, using @views does not seem to do anything?

The reason I need it is because I have some input data given in the format/type of Values, but for the code to work on GPU I have to split it into columns.

Kind regards

rafael.guerra · January 15, 2023, 11:07am

The following provides views over the different columns but I do not know anything about the GPU requirements:

view(transpose(reshape(reinterpret(Float64, Values), 3, :)), :, 1)
view(transpose(reshape(reinterpret(Float64, Values), 3, :)), :, 2)
view(transpose(reshape(reinterpret(Float64, Values), 3, :)), :, 3)

Ahmed_Salih · January 15, 2023, 11:48am

Magnificent!

Annyoningly fast. Compared to the 3 ms I got, your approach benchmarking like this gives:

@benchmark view(transpose(reshape(reinterpret($(Float64), $(Values)), 3, :)), :, $(1))
BenchmarkTools.Trial: 10000 samples with 886 evaluations.
 Range (min … max):  127.201 ns …  40.439 μs  ┊ GC (min … max): 0.00% … 99.42%
 Time  (median):     135.779 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   152.298 ns ± 605.257 ns  ┊ GC (mean ± σ):  6.81% ±  1.72%

  ▆█▇█▅▄▃▃▂▂▂▄▄▄▃▂                                              ▂
  █████████████████████▇▇▇▆▆▅▄▅▅▅▅▅▄▄▃▅▅▃▄▅▆▆▆▃▅▅▄▁▁▁▁▄▃▁▄▃▃▁▃▅ █
  127 ns        Histogram: log(frequency) by time        290 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

1 million times faster approximately? In that order of magnitude I think at least

Unfortunately I cannot get it to work on GPU and one can test that by just converting Values → CuArray(Values)

It says something about scalar indexing, but I dont understand why that happens

Kind regards

DNF · January 15, 2023, 12:12pm

This is basically a lazy construction, so you don’t really see the performance implications before you start iterating and accessing.

Your function actually collects the data, that’s the real work,if you need it.

Ahmed_Salih · January 15, 2023, 12:15pm

I still think it has potential though because doing:

@benchmark @CUDA.sync CuArray(view(transpose(reshape(reinterpret(eltype(eltype($ValuesCu)), Array($ValuesCu)), 3, :)), :, $(1)))
BenchmarkTools.Trial: 254 samples with 1 evaluation.
 Range (min … max):  14.121 ms … 77.067 ms  ┊ GC (min … max):  0.00% … 35.87%
 Time  (median):     16.216 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   19.746 ms ±  8.772 ms  ┊ GC (mean ± σ):  14.05% ± 18.84%

  ▃▄▅█       
  ████▇▆▄▄▅▄▅▃▃▁▁▁▂▁▁▁▁▂▁▁▁▁▁▂▁▁▂▁▂▁▂▃▄▄▃▂▃▂▃▁▃▁▁▁▁▁▁▁▁▁▁▁▂▁▃ ▃
  14.1 ms         Histogram: frequency by time          49 ms <

 Memory estimate: 30.52 MiB, allocs estimate: 9.

Which is still 15 times faster than what I have now. This in the end becomes really slow though, because there is both “CuArray” / “Array” operations, so constantly transfer between CPU and GPU, and I have to do this 12 times per loop, so need to find out how to remove the Array i.e. need to avoid transfer to cpu

Kind regards

DNF · January 15, 2023, 12:27pm

I think you can simplify it a bit, though

@view reinterpret(Float64, values)[begin:3:end]

Untested.

Ahmed_Salih · January 15, 2023, 1:35pm

Thank you very much to you and @rafael.guerra !

Benchmarking the version you produced DNF takes ~550 ns for me on a GPU array. Really nice.

My question is though, I want this for x y z, so basically:

@view reinterpret(Float64, values)[begin+0:3:end]  #x
@view reinterpret(Float64, values)[begin+1:3:end]  #y
@view reinterpret(Float64, values)[begin+2:3:end]  #z

Is this the best way to do it or is it a smarter way?

Kind regards

Topic		Replies	Views
Efficient vectorized matrix operations for 2D matrix slices of a 3D array? New to Julia question	11	1970	May 8, 2019
How to create sliced views of `CuArray` correctly? GPU cuarrays	1	392	February 3, 2023
Optimizing iteration over slices of multiple matrices Performance question , array	14	1894	December 30, 2018
Sub-arrays of static arrays Performance	6	104	April 25, 2025
Static view General Usage question	5	1176	January 20, 2019

How to efficiently split StaticArrays into columns?

Related topics