CUDA fft against CPU fft

Ferran_Mazzanti · March 19, 2019, 9:53am

Dear all,

in my attempts to play with CUDA in Julia, I’ve come accross something I can’t really understand -hopefully because I’m doing something wrong. The fact is that in my calculations I need to perform Fourier transforms, which I do wiht the fft() function. But sadly I find that the result of performing the fft() on the CPU, and on the same array transferred to the GPU, is different. My code is simply

using CuArrays 
using CUDAnative 
using CUDAdrv

using FFTW

N = 64;
A = rand(Float32,N,N,N);
B = fft(A);
Ad = cu(A);
Bd = fft(Ad);
BB = Array(Bd);

maximum(abs.(B-BB))
> 0.015625f0

…and the difference worsens with increasing N. For instance with N=256 I get a difference (in a single run) of 0.2275149f0. Of course the value of the FT also grows, so at this point I see small differences that are neither negligible nor impossible to live with.
So is this an expected behaviour? Or am I doing something weird?

Best regards,

Ferran.

maleadt · March 19, 2019, 10:19am

The values at those indices is probably huge:

julia> findmax(abs.(B-BB))
(1.0f0, CartesianIndex(1, 1, 1))

julia> (B-BB)[1,1,1]
-1.0f0 + 0.0f0im

julia> B[1,1,1]
8.388863f6 + 0.0f0im

julia> BB[1,1,1]
8.388864f6 + 0.0f0im

Relative accuracy seems fine:

julia> isapprox(B, BB)
true

The FFT stuff is tested for accuracy: CuArrays.jl/fft.jl at master · JuliaGPU/CuArrays.jl · GitHub
What kind of accuracy do you expect?

Ferran_Mazzanti · March 19, 2019, 1:39pm

Hi,
the question you pose me is an interesting one
The FT is part of an iterated solver for an integro-differential equation… so in short, I don’t think it is going to be a problem. I was just curious to notice the differences between the fft() in CPU and the fft() in GPU, both performed on numbers with the same precision (float32). Naively I would have thought that the precision would be the same…
Best,
Ferran.

alandion · March 19, 2019, 4:38pm

2 things :

are you sure the CPU version is more precise? Did you try running using float64, for instance, to compare to the float32 results?
One should not expect identical answer between CPU and GPU, even though they both use ieee floating point. The reason is that the order in which operations are performed, which is not even well-defined in a parallel environment, will change the results. Floating point addition is not associative

if relative differences get much bigger than ε√N, with ε the machine precision, then start to worry

stevengj · March 19, 2019, 6:50pm

Actually, a properly implemented FFT has errors that grow far more slowly than this, proportional to ε√log(N).

But in any case the key here is to look at the relative errors, not absolute errors. I would compare to the same transform performed in double precision (which you can treat as the “correct” result), e.g. try looking at:

B = fft(A)
Bc = Array(fft(cu(A))
Bd = fft(Float64.(A))
norm(B - Bd) / norm(Bd),  norm(Bc - Bd) / norm(Bd)

i.e. the relative (L2 norm) errors for both FFTs. If they are not both similar in magnitude to eps(Float32) (≈ 1.2e-7) then something is wrong.

Topic		Replies	Views
Different results when running on CPU or GPU General Usage cuda	6	2248	June 28, 2020
Is this a fair GPU benchmark? GPU	1	1143	June 1, 2019
Using CUDA fft General Usage	1	1170	March 13, 2019
Calculate FFT on GPU for every row of a 2D array Performance gpu	2	1042	August 14, 2018
Parallel FFT not that much faster General Usage fftw , parallel , multithreading	10	7085	December 19, 2016

CUDA fft against CPU fft

Related topics