CUDA fft against CPU fft

Dear all,

in my attempts to play with CUDA in Julia, I’ve come accross something I can’t really understand -hopefully because I’m doing something wrong. The fact is that in my calculations I need to perform Fourier transforms, which I do wiht the fft() function. But sadly I find that the result of performing the fft() on the CPU, and on the same array transferred to the GPU, is different. My code is simply

using CuArrays 
using CUDAnative 
using CUDAdrv

using FFTW

N = 64;
A = rand(Float32,N,N,N);
B = fft(A);
Ad = cu(A);
Bd = fft(Ad);
BB = Array(Bd);

maximum(abs.(B-BB))
> 0.015625f0

…and the difference worsens with increasing N. For instance with N=256 I get a difference (in a single run) of 0.2275149f0. Of course the value of the FT also grows, so at this point I see small differences that are neither negligible nor impossible to live with.
So is this an expected behaviour? Or am I doing something weird?

Best regards,

Ferran.

The values at those indices is probably huge:

julia> findmax(abs.(B-BB))
(1.0f0, CartesianIndex(1, 1, 1))

julia> (B-BB)[1,1,1]
-1.0f0 + 0.0f0im

julia> B[1,1,1]
8.388863f6 + 0.0f0im

julia> BB[1,1,1]
8.388864f6 + 0.0f0im

Relative accuracy seems fine:

julia> isapprox(B, BB)
true

The FFT stuff is tested for accuracy: CuArrays.jl/fft.jl at master · JuliaGPU/CuArrays.jl · GitHub
What kind of accuracy do you expect?

Hi,
the question you pose me is an interesting one :slight_smile:
The FT is part of an iterated solver for an integro-differential equation… so in short, I don’t think it is going to be a problem. I was just curious to notice the differences between the fft() in CPU and the fft() in GPU, both performed on numbers with the same precision (float32). Naively I would have thought that the precision would be the same…
Best,
Ferran.

2 things :

  1. are you sure the CPU version is more precise? Did you try running using float64, for instance, to compare to the float32 results?
  2. One should not expect identical answer between CPU and GPU, even though they both use ieee floating point. The reason is that the order in which operations are performed, which is not even well-defined in a parallel environment, will change the results. Floating point addition is not associative

if relative differences get much bigger than ε√N, with ε the machine precision, then start to worry

Actually, a properly implemented FFT has errors that grow far more slowly than this, proportional to ε√log(N).

But in any case the key here is to look at the relative errors, not absolute errors. I would compare to the same transform performed in double precision (which you can treat as the “correct” result), e.g. try looking at:

B = fft(A)
Bc = Array(fft(cu(A))
Bd = fft(Float64.(A))
norm(B - Bd) / norm(Bd),  norm(Bc - Bd) / norm(Bd)

i.e. the relative (L2 norm) errors for both FFTs. If they are not both similar in magnitude to eps(Float32) (≈ 1.2e-7) then something is wrong.

1 Like