I’m trying to benchmark different methods of calling the same in-place function on subsets of a large data block using CuArrays. I’m using fft as an example function because I can baseline against planning an fft over one dimension of a large array. Using this as a minimum working example, I created a testbench with the following methods:

- Create an list of 1D CuArrays. map f(x) to each element of the list
- Create a 2D CuArray. mapslices to f(x) for each column in array.
- Baseline taking f(x) over the full 2D array

```
using FFTW
using BenchmarkTools
using CuArrays
CuArrays.allowscalar(false)
N = 100000
fftsz = 1024
# list of random vectors
A1_d = [CuArray(rand(ComplexF32, fftsz)) for i in 1:N]
# random 2D array
A2_d = CuArray(rand(ComplexF32, N, fftsz))
p1 = plan_fft!(A1_d[1])
p2 = plan_fft!(A2_d, 1)
function onefft!(data, plan)
plan * data
end
#1
@btime CuArrays.@sync map(x -> onefft!(x, p1), $A1_d)
#2
@btime CuArrays.@sync mapslices(x -> onefft!(x, p1), $A2_d, dims=2)
#3
@btime CuArrays.@sync onefft!($A2_d, p2)
```

The results are:

```
629.991 ms (200011 allocations: 3.81 MiB)
3.618 s (13198542 allocations: 440.96 MiB)
20.603 ms (10 allocations: 176 bytes)
```

I would think the approaches should be somewhat identical in terms of performance. However, the strangest things I notice are:

- Why do (1) and (2) allocate memory? All the data should already be on the GPU and all operations should be in-place
- Why are (1) and (2) so different from one another? Even if there is some penalty to map, it seems like (1) and (2) should be identical.