Calculate FFT on GPU for every row of a 2D array

fedoroff · August 8, 2018, 3:06pm

Hello,

I have a 2D array and I want to calculate FFT for every raw of this array. I try to do it on GPU using CuArrays, but my GPU version of the code is too slow because of multiple memory allocations that I do not know how to avoid. Please, find the minimal working example below:

using CuArrays


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU:
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU:
    E_gpu = CuArray(E)
    S_gpu = CuArray(S)
    P_gpu = plan_fft(CuArray(zeros(Complex64, Nt)))
    
    Et_gpu = CuArray(zeros(Complex64, Nt))
    St_gpu = CuArray(zeros(Complex64, Nt))
    @time for i=1:Nr
        # @views St_gpu = S_gpu[i, :]   # A_mul_B! LoadError: don't know how to handle argument of type SubArray
        # St_gpu[:] = S_gpu[i, :]   # more allocations with preallocated St_gpu
        St_gpu = S_gpu[i, :]   # Bottleneck 1
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        E_gpu[i, :] = Et_gpu   # Bottleneck 2
    end
end


main()

On my computer the results are the following:

  0.009219 seconds
  3.814788 seconds (990.56 k allocations: 56.016 MiB, 0.26% gc time)

As you can see, the GPU version of the code suffers from large amount of memory allocations. As I understand, it happens mainly due to the lines commented with “Bottleneck 1” and “Bottleneck 2” (though inplace A_mul_B! also allocates some memory). Unfortunately, A_mul_B! can not work with views, so I can not use them for “Bottleneck 1”. Concerning “Bottleneck 2” I have no ideas at all. Here I can not use element-wise operations (like in case of CPU) because on GPU they are too slow.

Can you, please, suggest any workaround?

anon43133060 · August 8, 2018, 6:05pm

Or you can use ArrayFire.jl, it has support for batched FFT that can do 1D FFT on every row of a N-D array.

fedoroff · August 14, 2018, 11:18am

Thank you for the advice.
However, I do not want to install external libraries.

I guess I found a slightly better solution which utilizes custom CUDA kernels:

using CuArrays
using CUDAnative
using CUDAdrv


function kernel1(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(b)
        b[j] = a[i, j]
    end
    return nothing
end


function kernel2(b, i, a)
    j = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if j <= length(a)
        b[i, j] = a[j]
    end
    return nothing
end


function main()
    CuArrays.allowscalar(false)   # disable slow fallback methods

    Nr = 500
    Nt = 2048

    # CPU ---------------------------------------------------------------------
    E = rand(Complex64, (Nr, Nt))
    S = zeros(Complex64, (Nr, Nt))
    P = plan_fft(zeros(Complex64, Nt))
    
    Et = zeros(Complex64, Nt)
    St = zeros(Complex64, Nt)
    
    @time for i=1:Nr
        @views @. St = S[i, :]
        A_mul_B!(St, P, Et)
        @. E[i, :] = Et
    end

    # GPU ---------------------------------------------------------------------
    E_gpu = CuArrays.cu(E)
    S_gpu = CuArrays.cu(S)
    P_gpu = plan_fft(CuArrays.cuzeros(Complex64, Nt))
    
    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    Et_gpu = E_gpu[1, :]
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    S_gpu[1, :] = St_gpu
    
    @time for i=1:Nr
        Et_gpu = E_gpu[i, :]
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        S_gpu[i, :] = St_gpu
    end
    
    # GPU with custom kernels -------------------------------------------------
    dev = CuDevice(0)
    MAX_THREADS = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)
    threads = min(Nt, MAX_THREADS)
    blocks = Int(ceil(Nt / threads))

    Et_gpu = CuArrays.cuzeros(Complex64, Nt)
    St_gpu = CuArrays.cuzeros(Complex64, Nt)
    
    # One run before the timing to skip the initializations:
    @cuda (blocks, threads) kernel1(Et_gpu, 1, E_gpu)
    A_mul_B!(St_gpu, P_gpu, Et_gpu)
    @cuda (blocks, threads) kernel2(S_gpu, 1, St_gpu)
    
    @time for i=1:Nr
        @cuda (blocks, threads) kernel1(Et_gpu, i, E_gpu)
        A_mul_B!(St_gpu, P_gpu, Et_gpu)
        @cuda (blocks, threads) kernel2(S_gpu, i, St_gpu)
    end
end


main()

The results are the following:

  0.009238 seconds
  0.010801 seconds (41.03 k allocations: 1.274 MiB)
  0.005430 seconds (16.00 k allocations: 531.250 KiB)

The solution is not so elegant and still suffers from allocations, but at least now the GPU code is faster than the CPU one.

Topic		Replies	Views
Using CUDA fft General Usage	1	1172	March 13, 2019
CuArrays and garbage collection when doing FFT convolutions Performance gpu , gpuarrays	12	1737	March 6, 2019
Map Performance with CuArrays GPU question , fftw , cuda , broadcast	15	5286	January 4, 2021
Performance of view with cuArrays GPU	11	2708	November 11, 2018
Efficiency when handling jobs larger than VRAM GPU	3	1089	January 15, 2019

Calculate FFT on GPU for every row of a 2D array

Related topics