I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:
using CuArrays function main() CuArrays.allowscalar(false) # disable slow fallback methods Nr = 500 Nt = 2048 # CPU: E = rand(Complex64, (Nr, Nt)) S = zeros(Complex64, (Nr, Nt)) P = plan_fft(zeros(Complex64, Nt)) Et = zeros(Complex64, Nt) St = zeros(Complex64, Nt) @time for i=1:Nr @views @. St = S[i, :] A_mul_B!(St, P, Et) @. E[i, :] = Et end # GPU: E_gpu = CuArray(E) S_gpu = CuArray(S) P_gpu = plan_fft(CuArray(zeros(Complex64, Nt))) Et_gpu = CuArray(zeros(Complex64, Nt)) St_gpu = CuArray(zeros(Complex64, Nt)) @time for i=1:Nr # @views St_gpu = S_gpu[i, :] # A_mul_B! LoadError: don't know how to handle argument of type SubArray # St_gpu[:] = S_gpu[i, :] # more allocations with preallocated St_gpu St_gpu = S_gpu[i, :] # Bottleneck 1 A_mul_B!(St_gpu, P_gpu, Et_gpu) E_gpu[i, :] = Et_gpu # Bottleneck 2 end end main()
On my computer the results are the following:
0.009219 seconds 3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)
As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.
Can you, please, suggest any workaround?