I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:
using CuArrays
function main()
CuArrays.allowscalar(false) # disable slow fallback methods
Nr = 500
Nt = 2048
# CPU:
E = rand(Complex64, (Nr, Nt))
S = zeros(Complex64, (Nr, Nt))
P = plan_fft(zeros(Complex64, Nt))
Et = zeros(Complex64, Nt)
St = zeros(Complex64, Nt)
@time for i=1:Nr
@views @. St = S[i, :]
A_mul_B!(St, P, Et)
@. E[i, :] = Et
# GPU:
E_gpu = CuArray(E)
S_gpu = CuArray(S)
P_gpu = plan_fft(CuArray(zeros(Complex64, Nt)))
Et_gpu = CuArray(zeros(Complex64, Nt))
St_gpu = CuArray(zeros(Complex64, Nt))
@time for i=1:Nr
# @views St_gpu = S_gpu[i, :] # A_mul_B! LoadError: don't know how to handle argument of type SubArray
# St_gpu[:] = S_gpu[i, :] # more allocations with preallocated St_gpu
St_gpu = S_gpu[i, :] # Bottleneck 1
A_mul_B!(St_gpu, P_gpu, Et_gpu)
E_gpu[i, :] = Et_gpu # Bottleneck 2
On my computer the results are the following:
0.009219 seconds
3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)
As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.
Can you, please, suggest any workaround?