I’m trying to benchmark different methods of calling the same in-place function on subsets of a large data block using CuArrays. I’m using fft as an example function because I can baseline against planning an fft over one dimension of a large array. Using this as a minimum working example, I created a testbench with the following methods:
- Create an list of 1D CuArrays. map f(x) to each element of the list
- Create a 2D CuArray. mapslices to f(x) for each column in array.
- Baseline taking f(x) over the full 2D array
using FFTW using BenchmarkTools using CuArrays CuArrays.allowscalar(false) N = 100000 fftsz = 1024 # list of random vectors A1_d = [CuArray(rand(ComplexF32, fftsz)) for i in 1:N] # random 2D array A2_d = CuArray(rand(ComplexF32, N, fftsz)) p1 = plan_fft!(A1_d) p2 = plan_fft!(A2_d, 1) function onefft!(data, plan) plan * data end #1 @btime CuArrays.@sync map(x -> onefft!(x, p1), $A1_d) #2 @btime CuArrays.@sync mapslices(x -> onefft!(x, p1), $A2_d, dims=2) #3 @btime CuArrays.@sync onefft!($A2_d, p2)
The results are:
629.991 ms (200011 allocations: 3.81 MiB) 3.618 s (13198542 allocations: 440.96 MiB) 20.603 ms (10 allocations: 176 bytes)
I would think the approaches should be somewhat identical in terms of performance. However, the strangest things I notice are:
- Why do (1) and (2) allocate memory? All the data should already be on the GPU and all operations should be in-place
- Why are (1) and (2) so different from one another? Even if there is some penalty to map, it seems like (1) and (2) should be identical.