Calculate FFT on GPU for every row of a 2D array

gpu

#1

Hello,

I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:

using CuArrays


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU:
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU:
    E_gpu = CuArray(E)
    S_gpu = CuArray(S)
    P_gpu = plan_fft(CuArray(zeros(Complex64, Nt)))
    
    Et_gpu = CuArray(zeros(Complex64, Nt))
    St_gpu = CuArray(zeros(Complex64, Nt))
    @time for i=1:Nr
        # @views St_gpu = S_gpu[i, :]   # A_mul_B! LoadError: don't know how to handle argument of type SubArray
        # St_gpu[:] = S_gpu[i, :]   # more allocations with preallocated St_gpu
        St_gpu = S_gpu[i, :]   # Bottleneck 1
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        E_gpu[i, :] = Et_gpu   # Bottleneck 2
    end
end


main()

On my computer the results are the following:

  0.009219 seconds
  3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)

As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.

Can you, please, suggest any workaround?


#2

Or you can use ArrayFire.jl, it has support for batched FFT that can do 1D FFT on every row of a N-D array.


#3

Thank you for the advice.
However, I do not want to install external libraries.

I guess I found a slightly better solution which utilizes custom CUDA kernels:

using CuArrays
using CUDAnative
using CUDAdrv


function kernel1(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(b)
        b[j] = a[i, j]
    end
    return nothing
end


function kernel2(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(a)
        b[i, j] = a[j]
    end
    return nothing
end


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU ---------------------------------------------------------------------
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU ---------------------------------------------------------------------
    E_gpu = CuArrays.cu(E)
    S_gpu = CuArrays.cu(S)
    P_gpu = plan_fft(CuArrays.cuzeros(Complex64, Nt))
    
    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    Et_gpu = E_gpu[1, :]
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    S_gpu[1, :] = St_gpu
    
    @time for i=1:Nr
        Et_gpu = E_gpu[i, :]
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        S_gpu[i, :] = St_gpu
    end
    
    # GPU with custom kernels -------------------------------------------------
    dev = CuDevice(0)
    MAX_THREADS = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
    threads = min(Nt, MAX_THREADS)
    blocks = Int(ceil(Nt / threads))

    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    @cuda (blocks, threads) kernel1(Et_gpu, 1, E_gpu)
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    @cuda (blocks, threads) kernel2(S_gpu, 1, St_gpu)
    
    @time for i=1:Nr
        @cuda (blocks, threads) kernel1(Et_gpu, i, E_gpu)
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        @cuda (blocks, threads) kernel2(S_gpu, i, St_gpu)
    end
end


main()

The results are the following:

  0.009238 seconds
  0.010801 seconds (41.03 k allocations: 1.274 MiB)
  0.005430 seconds (16.00 k allocations: 531.250 KiB)

The solution is not so elegant and still suffers from allocations, but at least now the GPU code is faster than the CPU one.