# Calculate FFT on GPU for every row of a 2D array

Hello,

I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:

``````using CuArrays

function main()
CuArrays.allowscalar(false)   # disable slow fallback methods

Nr = 500
Nt = 2048

# CPU:
E = rand(Complex64, (Nr, Nt))
S = zeros(Complex64, (Nr, Nt))
P = plan_fft(zeros(Complex64, Nt))

Et = zeros(Complex64, Nt)
St = zeros(Complex64, Nt)
@time for i=1:Nr
@views @. St = S[i, :]
A_mul_B!(St, P, Et)
@. E[i, :] = Et
end

# GPU:
E_gpu = CuArray(E)
S_gpu = CuArray(S)
P_gpu = plan_fft(CuArray(zeros(Complex64, Nt)))

Et_gpu = CuArray(zeros(Complex64, Nt))
St_gpu = CuArray(zeros(Complex64, Nt))
@time for i=1:Nr
# @views St_gpu = S_gpu[i, :]   # A_mul_B! LoadError: don't know how to handle argument of type SubArray
# St_gpu[:] = S_gpu[i, :]   # more allocations with preallocated St_gpu
St_gpu = S_gpu[i, :]   # Bottleneck 1
A_mul_B!(St_gpu, P_gpu, Et_gpu)
E_gpu[i, :] = Et_gpu   # Bottleneck 2
end
end

main()
``````

On my computer the results are the following:

``````  0.009219 seconds
3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)
``````

As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.

Can you, please, suggest any workaround?

Or you can use `ArrayFire.jl`, it has support for batched FFT that can do 1D FFT on every row of a N-D array.

Thank you for the advice.
However, I do not want to install external libraries.

I guess I found a slightly better solution which utilizes custom CUDA kernels:

``````using CuArrays
using CUDAnative

function kernel1(b, i, a)
j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if j <= length(b)
b[j] = a[i, j]
end
return nothing
end

function kernel2(b, i, a)
j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
if j <= length(a)
b[i, j] = a[j]
end
return nothing
end

function main()
CuArrays.allowscalar(false)   # disable slow fallback methods

Nr = 500
Nt = 2048

# CPU ---------------------------------------------------------------------
E = rand(Complex64, (Nr, Nt))
S = zeros(Complex64, (Nr, Nt))
P = plan_fft(zeros(Complex64, Nt))

Et = zeros(Complex64, Nt)
St = zeros(Complex64, Nt)

@time for i=1:Nr
@views @. St = S[i, :]
A_mul_B!(St, P, Et)
@. E[i, :] = Et
end

# GPU ---------------------------------------------------------------------
E_gpu = CuArrays.cu(E)
S_gpu = CuArrays.cu(S)
P_gpu = plan_fft(CuArrays.cuzeros(Complex64, Nt))

Et_gpu = CuArrays.cuzeros(Complex64, Nt)
St_gpu = CuArrays.cuzeros(Complex64, Nt)

# One run before the timing to skip the initializations:
Et_gpu = E_gpu[1, :]
A_mul_B!(St_gpu, P_gpu, Et_gpu)
S_gpu[1, :] = St_gpu

@time for i=1:Nr
Et_gpu = E_gpu[i, :]
A_mul_B!(St_gpu, P_gpu, Et_gpu)
S_gpu[i, :] = St_gpu
end

# GPU with custom kernels -------------------------------------------------
dev = CuDevice(0)
blocks = Int(ceil(Nt / threads))

Et_gpu = CuArrays.cuzeros(Complex64, Nt)
St_gpu = CuArrays.cuzeros(Complex64, Nt)

# One run before the timing to skip the initializations:
@cuda (blocks, threads) kernel1(Et_gpu, 1, E_gpu)
A_mul_B!(St_gpu, P_gpu, Et_gpu)
@cuda (blocks, threads) kernel2(S_gpu, 1, St_gpu)

@time for i=1:Nr
@cuda (blocks, threads) kernel1(Et_gpu, i, E_gpu)
A_mul_B!(St_gpu, P_gpu, Et_gpu)
@cuda (blocks, threads) kernel2(S_gpu, i, St_gpu)
end
end

main()

``````

The results are the following:

``````  0.009238 seconds
0.010801 seconds (41.03 k allocations: 1.274 MiB)
0.005430 seconds (16.00 k allocations: 531.250 KiB)
``````

The solution is not so elegant and still suffers from allocations, but at least now the GPU code is faster than the CPU one.