Optimization of differentiation using Fourier transforms

In my Bachelor thesis I need to solve a nonlinear wave equation using the pseudospectral method that relies on the basic property of the Fourier transform

\mathcal{F}(\frac{d^n f}{dx^n}) = (ik)^n \mathcal{F}(f)

where i is the imaginary number and k is the wave number. I started by writing the code in Python, but since I need to solve for 2^{12} points in space and actually there are two coupled PDEs I have 2^{13} coupled ODEs to solve, so the solving was very slow. Along the way of optimization I found Julia, but no equivalent to diff function in scipy.fftpack.

My implementation of the diff function

module DiffFFT

export diff, DiffPlan

using FFTW

@doc raw"""
    Differentiates a vector `v` that is assumed to have a periodical boundaries. Using the identity
        ```math
        F(\frac{\partial^n}{\partial x^n} v(x)) = (ik)^n F(v(x))
        \frac{\partial^n}{\partial x^n} v(x) = F^{-1}((ik)^n F(v(\omega)))```

    v:      Vector
    n:      order of the derivative
    period: the period
"""
function diff(v, n, period)
    N = length(v)                   # number of points
    k = fftfreq(N, 2*pi/period)*N   # wave number
    d = ifft((1im*k).^n .* fft(v))  # calculate the derivative
    return real.(d)
end

struct DiffPlan
    N::Int64
    period::Float64
    k::Vector{Float64}
    fft::FFTW.cFFTWPlan
    ifft::FFTW.ScaledPlan

    function DiffPlan(N, period)
        new(N,
            period,
            fftfreq(N, 2π/period)*N,
            plan_fft(collect(Float64, 1:N), flags=FFTW.PATIENT),
            plan_ifft(fft(collect(Float64, 1:N)), flags=FFTW.PATIENT)
        )
    end
end

function diff(plan::DiffPlan, v, n)
    return real.(plan.ifft*((1im*plan.k).^n .* (plan.fft*v)))
end
function diff(d, plan::DiffPlan, v, n)
    d[:] = real.(plan.ifft*((1im*plan.k).^n .* (plan.fft*v)))
    nothing
end

end # module DiffFFT

The speed is in the same range as calling scipy’s diff from Julia. The initial conditions I used

N = 2^12
period = 48π
X = collect(range(-period/2, period/2, N))
U = sech.(0.2.*(X)).^2

And taking the first derivative with both methods

plan = DiffPlan(N, period)
@benchmark DiffFFT.diff(plan, U, 1)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):   67.700 μs …   4.218 ms  ┊ GC (min … max):  0.00% … 94.62%
#  Time  (median):      92.100 μs               ┊ GC (median):     0.00%
#  Time  (mean ± σ):   147.815 μs ± 194.474 μs  ┊ GC (mean ± σ):  10.39% ±  7.96%
#  Memory estimate: 352.41 KiB, allocs estimate: 17.

fftpack = pyimport("scipy.fftpack")
@benchmark fftpack.diff(U, 1, period)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):   84.600 μs …  11.058 ms  ┊ GC (min … max): 0.00% … 51.68%
#  Time  (median):     117.300 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   127.163 μs ± 251.314 μs  ┊ GC (mean ± σ):  2.28% ±  1.15%
#  Memory estimate: 33.67 KiB, allocs estimate: 42.

However, the fft operation itself outperforms scipy’s by a factor of 3, so I wonder if it could be made even faster?

fftplan = plan_fft(U, flags=FFTW.PATIENT)
@benchmark fftplan*U
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  21.600 μs …   4.617 ms  ┊ GC (min … max):  0.00% … 97.69%
#  Time  (median):     30.400 μs               ┊ GC (median):     0.00%
#  Time  (mean ± σ):   44.879 μs ± 119.376 μs  ┊ GC (mean ± σ):  12.13% ±  5.00%
#  Memory estimate: 128.09 KiB, allocs estimate: 4.

spfft = pyimport("scipy.fft")
@benchmark spfft.fft(U)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):   66.700 μs …  12.680 ms  ┊ GC (min … max): 0.00% … 17.22%
#  Time  (median):     108.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   124.106 μs ± 295.472 μs  ┊ GC (mean ± σ):  2.99% ±  1.32%
#  Memory estimate: 65.48 KiB, allocs estimate: 35.

I am aware of the @code_warntype macro that does show my diff function has some red Any types the compiler can’t solve for. But I am not competent enough for solving these problems.
The optimization is not crucial for my thesis, the Python code works well. This mostly out of curiosity and an opportunity to learn about Julia for me.

1 Like

You are using abstract types such as Real or Integer. You should use concrete types such °Float32orInt64` for efficiency. One aay to do that and keep your code generic ia to use type parameters.

3 Likes

Thanks for the suggestion, I changed my structure to

...
    N::Int64
    period::Float64
    k::Vector{Float64}
...

But the performance is very similar

@benchmark DiffFFT.diff(U, 1, period)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  112.000 μs …   5.906 ms  ┊ GC (min … max): 0.00% … 81.28%
#  Time  (median):     241.100 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   231.927 μs ± 273.673 μs  ┊ GC (mean ± σ):  7.47% ±  6.25%
#  Memory estimate: 288.83 KiB, allocs estimate: 18.

Welcome to the community and congrats to the very nice MWE – I like it :slight_smile:

Your diff functions generate a lot of temporary variables which could be avoided. E.g. consider

d = ifft((1im*k).^n .* fft(v))  # calculate the derivative

This allocates three new vectors:

  1. when calling fft(v)
  2. when multiplying out (1im*k).^n .* fft(v)
  3. when calling ifft(...)

You could avoid this by using inplace transformations like fft! and ifft. E.g.

d .= v # copy data into d
fft!(d)
@. d *= (1im*k)^n
ifft!(d)

The same also goes for your usage of plans: plan_fft * v allocates a new vector. You can work around this with mul!(d, plan_fft, v) (I think).

2 Likes

Furthermore, you can also apply an algorithmic improvement to speed things up.
When computing derivatives with a DFT to solve PDEs one is often only interested in the real part of the result, which I guess is the case for you since you use real in some places.

You can look up the functions FFTW.r2r and FFTW.r2r! on how to use them to appropriately and save roughly a factor of two. API · FFTW.jl
To apply them correctly you need to decide on the kind of real trafo you want to apply, and this is defined by the boundary conditions of the solution you want to approximate. This can be a bit tricky to figure out at first, because in general one needs a different kind for the forward and backward transform.
The relevant options are described on this page Real even/odd DFTs (cosine/sine transforms) (FFTW 3.3.10).

2 Likes

What was the purpose of the plan line? The benchmark only tests the diff(v, n, period) method that doesn’t use DiffPlan at all. That would be why the performance didn’t change when you edited DiffPlan’s field types.

1 Like

You are completely right, thank you for noticing! I copied the wrong line of code, that did not use the plans. I have edited my original post and rephrased the problem.

I just have had the issue in mind for many months now and I just wonder about it from time to time.

The diff version using plans is as fast as calling scipy’s implementation

Specifying the types actually improved the speed by a factor of 2. I rerun the test using the correct diff function that uses a precomputed plan, thanks to @Benny who pointed that out.

The new benchmark is in the original edited post. Here is the benchmark for using the old version with abstract types

@benchmark DiffFFT.diff(plan, U, 1)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation.
#  Range (min … max):  199.300 μs …   5.045 ms  ┊ GC (min … max):  0.00% … 91.02%
#  Time  (median):     224.000 μs               ┊ GC (median):     0.00%
#  Time  (mean ± σ):   304.339 μs ± 336.109 μs  ┊ GC (mean ± σ):  11.94% ± 10.18%
#  Memory estimate: 608.73 KiB, allocs estimate: 8217.
1 Like

Oh I should also point out, when you use BenchmarkTools, generally write $ in front of arguments taken from the global scope. This is to prevent inlining optimizations and type instabilities, which throw off the timings in both directions.

It’s apparent at this point that you measure smaller times when you make plans before benchmarking. This makes it unfair for the scipy benchmark because reusing plans is still not implemented anywhere yet. It’s still fast despite that, so it should be doing its own optimizations, for example when I looked into the diff implementation, it’s doing a convolve based on r2r_fftpack, not your fft-ifft algorithm. No idea what the logic there is. If that source code helps you implement an unplanned diff in Julia better, go for it, but otherwise enjoy the precomputed plans in Julia.

1 Like