Calculate FFT on GPU for every row of a 2D array

Hello,

I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:

using CuArrays


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU:
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU:
    E_gpu = CuArray(E)
    S_gpu = CuArray(S)
    P_gpu = plan_fft(CuArray(zeros(Complex64, Nt)))
    
    Et_gpu = CuArray(zeros(Complex64, Nt))
    St_gpu = CuArray(zeros(Complex64, Nt))
    @time for i=1:Nr
        # @views St_gpu = S_gpu[i, :]   # A_mul_B! LoadError: don't know how to handle argument of type SubArray
        # St_gpu[:] = S_gpu[i, :]   # more allocations with preallocated St_gpu
        St_gpu = S_gpu[i, :]   # Bottleneck 1
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        E_gpu[i, :] = Et_gpu   # Bottleneck 2
    end
end


main()

On my computer the results are the following:

  0.009219 seconds
  3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)

As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.

Can you, please, suggest any workaround?

Or you can use ArrayFire.jl, it has support for batched FFT that can do 1D FFT on every row of a N-D array.

Thank you for the advice.
However, I do not want to install external libraries.

I guess I found a slightly better solution which utilizes custom CUDA kernels:

using CuArrays
using CUDAnative
using CUDAdrv


function kernel1(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(b)
        b[j] = a[i, j]
    end
    return nothing
end


function kernel2(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(a)
        b[i, j] = a[j]
    end
    return nothing
end


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU ---------------------------------------------------------------------
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU ---------------------------------------------------------------------
    E_gpu = CuArrays.cu(E)
    S_gpu = CuArrays.cu(S)
    P_gpu = plan_fft(CuArrays.cuzeros(Complex64, Nt))
    
    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    Et_gpu = E_gpu[1, :]
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    S_gpu[1, :] = St_gpu
    
    @time for i=1:Nr
        Et_gpu = E_gpu[i, :]
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        S_gpu[i, :] = St_gpu
    end
    
    # GPU with custom kernels -------------------------------------------------
    dev = CuDevice(0)
    MAX_THREADS = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
    threads = min(Nt, MAX_THREADS)
    blocks = Int(ceil(Nt / threads))

    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    @cuda (blocks, threads) kernel1(Et_gpu, 1, E_gpu)
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    @cuda (blocks, threads) kernel2(S_gpu, 1, St_gpu)
    
    @time for i=1:Nr
        @cuda (blocks, threads) kernel1(Et_gpu, i, E_gpu)
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        @cuda (blocks, threads) kernel2(S_gpu, i, St_gpu)
    end
end


main()

The results are the following:

  0.009238 seconds
  0.010801 seconds (41.03 k allocations: 1.274 MiB)
  0.005430 seconds (16.00 k allocations: 531.250 KiB)

The solution is not so elegant and still suffers from allocations, but at least now the GPU code is faster than the CPU one.