CUDA backward real complex FFT

FerreolS · May 23, 2026, 7:42am

I’m trying to perform a convolution using CUDA fft and something puzzle me when comparing CPU and GPU result:

using Adapt, CUDA, FFTW, AbstractFFTs
T = Float32
sz = (2048,2048)
x = randn(T,sz);
x_gpu = adapt(CuArray,x);
d = randn(Complex{T},(1025,2048))
d_gpu = adapt(CuArray,d)

GPU and CPU output are not equal when the backward brfft gives a (2048,2048) output but they are equal for a (2049,2048) output:


julia> adapt(Array, brfft(d_gpu .* rfft(x_gpu), 2048)) ≈ brfft(d .* rfft(x),2048)
false

julia> adapt(Array, brfft(d_gpu .* rfft(x_gpu), 2049)) ≈ brfft(d .* rfft(x),2049)
true

julia> adapt(Array, brfft(rfft(x_gpu), 2048)) ≈ brfft(rfft(x),2048)
true

julia> adapt(Array, d_gpu) ≈ d
true

julia> adapt(Array, d_gpu .* rfft(x_gpu)) ≈ (d .* rfft(x))
true

and it is not just rounding error:

julia> adapt(Array, brfft(d_gpu .* rfft(x_gpu), 2048)) .- brfft(d .* rfft(x),2048)
2048×2048 Matrix{Float32}:
   -8465.38        13274.8  -119964.0             1.24876f5   7778.25       -1.24958f5  …   33341.4  18242.2  -75873.0  -30010.5   -1052.38  228162.0             -1.99176f5
      -1.00414f5  -51067.1        1.1154f5   -54599.9        25838.5    -96901.0           -18502.0  26938.0  -22623.0  -31004.0  -36856.9    -5554.5          48698.5
   -8464.75        13277.2       -1.19968f5       1.24872f5   7778.88  -124956.0            33340.8  18242.8  -75874.0  -30015.0   -1050.0        2.28162f5       -1.99173f5
      -1.00412f5  -51064.5        1.11541f5  -54601.5        25842.0    -96899.5           -18504.0  26935.0  -22622.5  -31004.0  -36852.8    -5557.5          48695.5
   -8466.75        13277.8       -1.19967f5       1.24874f5   7777.0        -1.24954f5      33341.5  18242.5  -75875.1  -30013.0   -1050.0        2.2816f5   -199175.0
      -1.00414f5  -51066.6   111541.0        -54601.5        25840.0    -96897.4        …  -18502.2  26937.6  -22622.8  -31005.4  -36857.1    -5553.75         48695.0
       ⋮                                                                     ⋮          ⋱                                              ⋮                     
   -8465.0         13274.6       -1.19964f5       1.24875f5   7781.88       -1.24954f5      33339.8  18240.2  -75877.8  -30018.4   -1054.5        2.2816f5        -1.99176f5
 -100414.0        -51064.0        1.11544f5  -54599.1        25837.2    -96902.5           -18507.0  26936.2  -22623.2  -31008.0  -36857.5    -5552.5          48692.8
   -8463.5         13273.8       -1.19966f5       1.24875f5   7776.12       -1.24952f5      33341.0  18241.5  -75874.0  -30012.0   -1052.75       2.28162f5  -199176.0
      -1.00414f5  -51064.0        1.11542f5  -54601.0        25840.9    -96903.0        …  -18503.0  26940.1  -22624.0  -31004.8  -36853.8    -5556.0          48693.8
   -8462.0         13276.5  -119968.0             1.24873f5   7778.75       -1.24956f5      33337.2  18241.0  -75875.8  -30011.4   -1052.75       2.28161f5       -1.99174f5
 -100413.0        -51062.0        1.1154f5   -54601.5        25843.5    -96899.5           -18498.5  26935.5  -22623.8  -31006.8  -36856.8    -5554.5          48695.5

I can’t see where is my issue. Any hint?

stevengj · May 23, 2026, 1:54pm

The input to a complex-to-real FFT is not arbitrary — it has some conjugate-symmetries implied by the real-to-complex FFT (because the output has slightly more data than the input). When you give it an input that does not have these symmetries, the FFT algorithm performs some kind of projection, but exactly what projection it performs might differ between implementations.

Try setting d = rfft(randn(T, sz)) to ensure that d has the correct symmetries.

(For example, if you have a 1d real array x of length 8 and perform an rfft, the output is a complex array z of length 5 in which the first and last elements are purely real. If you perform a brfft(z, 8), it expects z to be a length-5 array with the first and last elements purely real. If you give it a randn(ComplexF64, 5) array, most likely it just ignores the imaginary parts of the first and last elements, but this may be implementation-dependent. For multi-dimensional transforms, the “edges” of the array have some more complicated conjugate symmetries.)

Topic		Replies	Views
CUDA fft against CPU fft GPU	4	2844	March 19, 2019
Fastest Julia implementation for a cyclic convolution of real vectors Performance fftw , fft	17	620	December 12, 2024
ANN: InplaceRealFFT.jl : inplace real-to-complex and complex-to-real FFT Numerics fftw	25	3526	January 5, 2018
CuPy CuFFT ~2x faster than CUDA.jl CuFFT GPU performance , cuda , fft	15	3095	February 27, 2023
Unreasonably fast FFT on CUDA Performance question	8	1655	February 3, 2024

CUDA backward real complex FFT

Related topics