Calculating statistics of SubArray of CuArray

Hello everyone!

I am working with CuArrays of complex numbers. After some previous processing, I would like to calculate the variance of only the real part of the array along the first dimension. In order to avoid copying I use

complex_array = CUDA.rand(ComplexF32, 100, 100000) # Example array

array_reinterpreted = reinterpret(reshape, Float32, complex_array)
array_real = selectdim(array_reinterpreted, 1, 1)
variances = var(array_real, dims=1)

However, this seems to be calling the CPU implementation in Statistics.jl which throws an error because that function tries to access the array with first(A).

From MD with Subarrays and CuArrays - #6 by maleadt, I tried to implement a version of Statistics functions on GPU? - #2 by maleadt replacing CuArray with AnyCuArray or StridedCuArray but in both cases the result ended up being slower than computing the variance of both, real and imaginary parts and selecting only the real section of the reinterpreted array:

using Statistics
using CUDA
using Benchmarktools

complex_array = CUDA.rand(Float32, 100, 100000)
array_reinterpreted = reinterpret(reshape, Float32, complex_array)

@btime begin
variances = selectdim(var(array_reinterpreted, dims=2), 1, 1)
end

gave 15.163 μs (86 allocations: 2.39 KiB) whereas using

using Statistics
using CUDA
using Benchmarktools
# File with implementation for AnyCuArray or Strided CuArray
include("src/statistics.jl")

complex_array = CUDA.rand(Float32, 100, 100000)
array_reinterpreted = reinterpret(reshape, Float32, complex_array)

@btime begin
variances = var(selectdim(array_reinterpreted, 1, 1), dims=1)
end

gave 58.351 μs (370 allocations: 11.39 KiB).

I was wondering if there is a faster way of calculating these variances by avoiding copying the real part of the complex array or computing unnecessarily the variance of the imaginary part of the array.

You forgot to synchronize:

julia> @btime CUDA.@sync selectdim(var(array_reinterpreted, dims=2), 1, 1);
  274.745 μs (171 allocations: 4.67 KiB)

julia> @btime CUDA.@sync var(selectdim(array_reinterpreted, 1, 1), dims=1);
  45.089 μs (169 allocations: 4.52 KiB)

Also see CUDA.@profile, which does that for you:

julia> CUDA.@profile selectdim(var(array_reinterpreted, dims=2), 1, 1)
Profiler ran for 774.15 µs, capturing 230 events.

Host-side activity: calling CUDA APIs took 574.35 µs (74.19% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   67.20% │  520.23 µs │     4 │ 130.06 µs ± 148.73 (  1.19 ‥ 276.57) │ cuMemAllocFromPoolAsync │
│    4.43% │   34.33 µs │     4 │   8.58 µs ± 5.52   (  4.29 ‥ 16.69)  │ cuLaunchKernel          │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 269.41 µs (34.80% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────
│   18.11% │  140.19 µs │     1 │                                    │ partial_mapreduce_grid(identity, add_sum, Flo ⋯
│   16.20% │  125.41 µs │     1 │                                    │ partial_mapreduce_grid(ComposedFunction<Fix1< ⋯
│    0.49% │    3.81 µs │     2 │   1.91 µs ± 0.0    (  1.91 ‥ 1.91) │ partial_mapreduce_grid(identity, add_sum, Flo ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────


julia> CUDA.@profile var(selectdim(array_reinterpreted, 1, 1), dims=1)
Profiler ran for 134.23 µs, capturing 192 events.

Host-side activity: calling CUDA APIs took 57.94 µs (43.16% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│   26.47% │   35.52 µs │     4 │   8.88 µs ± 6.08   (  4.53 ‥ 17.88) │ cuLaunchKernel          │
│    6.75% │    9.06 µs │     4 │   2.26 µs ± 1.37   (  1.43 ‥ 4.29)  │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 10.49 µs (7.82% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────
│    2.84% │    3.81 µs │     2 │   1.91 µs ± 0.34   (  1.67 ‥ 2.15) │ partial_mapreduce_grid(identity, add_sum, Flo ⋯
│    2.49% │    3.34 µs │     1 │                                    │ partial_mapreduce_grid(ComposedFunction<Fix1< ⋯
│    2.49% │    3.34 µs │     1 │                                    │ partial_mapreduce_grid(identity, add_sum, Flo ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────
                                                                                                      1 column omitted

Ah, yes! I forgot to synchronize. I should get used to use CUDA.@profile.
Now I see that, for my GPU, the total calculation over the real part only returns 8.591 ms (371 allocations: 11.41 KiB compared to 10.446 ms (386 allocations: 12.39 KiB) for the calculation over both real and imaginary parts. I guess for larger arrays the difference would be worse.

Thank you very much!