[ANN] RustFFT.jl: Compute forward and inverse FFTs with RustFFT

RustFFT is a high-performance, SIMD-accelerated FFT library written in pure Rust. It can compute FFTs of any size, including prime-number sizes, in O(nlogn) time. You can now use it from Julia!

Usage

Forward FFT:

using RustFFT

planner64 = RustFFT.FftPlanner64()
instance = RustFFT.plan_fft_forward(planner64, UInt(1))
data = complex([1.0])
RustFFT.fft!(instance, data)
@assert data[1] β‰ˆ 1.0

Inverse FFT:

using RustFFT

planner64 = RustFFT.FftPlanner64()
instance = RustFFT.plan_fft_inverse(planner64, UInt(1))
data = complex([1.0])
RustFFT.fft!(instance, data)
@assert data[1] β‰ˆ 1.0

Note that RustFFT does not normalize outputs:

Callers must manually normalize the results by scaling each element by 1/len().sqrt(). Multiple normalization steps can be merged into one via pairwise multiplication, so when doing a forward FFT followed by an inverse callers can normalize once by scaling each element by 1/len()

A few other limitations apply. It’s currently not possible to choose the specific algorithm that will be used to compute the transform. It’s also not possible to compute the FFT of an array with a rank not equal to 1. The interface provided by AbstractFFTs is not used yet, either.

Documentation
GitHub

10 Likes

It would be great if you added some performance comparison with other implementations in the eco system.

4 Likes

Good point! I’ve opened an issue for it

julia> using BenchmarkTools, RustFFT, FFTW, LinearAlgebra

julia> v = randn(ComplexF64, 1<<16);

julia> rust_planner64 = RustFFT.FftPlanner64();

julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));

julia> fftw_plan! = FFTW.plan_fft!(v);

julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  337.667 ΞΌs … 569.458 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     385.834 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   384.350 ΞΌs Β±  13.164 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                  β–β–†β–ˆβ–‡β–…β–
  β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–„β–‡β–‡β–‡β–†β–†β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  338 ΞΌs           Histogram: frequency by time          423 ΞΌs <

 Memory estimate: 64 bytes, allocs estimate: 2.

julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  314.833 ΞΌs … 478.750 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     316.834 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   320.026 ΞΌs Β±   8.955 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–‚β–‡β–ˆβ–†β–ƒβ–…β–…β–„β–β–‚β–‚β–      ▁▁▁▁                                        β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–‡β–ˆβ–‡β–‡β–†β–‡β–†β–†β–…β–†β–…β–…β–…β–…β–„β–…β–„β–…β–…β–„β–…β–„β–…β–† β–ˆ
  315 ΞΌs        Histogram: log(frequency) by time        357 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Seems to be in the same ballpark as FFTW, but RustFFT.fft! isn’t fully in-place, there are a couple of allocations.

Feedback: API could be simplified a bit, for example a method

RustFFT.plan_fft_forward(v::Vector{ComplexF32}) =
    RustFFT.plan_fft_forward(RustFFT.FftPlanner32(), UInt(length(v)))

RustFFT.plan_fft_forward(v::Vector{ComplexF64}) =
    RustFFT.plan_fft_forward(RustFFT.FftPlanner64(), UInt(length(v)))

or something like that.

Edit: for the record, my platform is

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 Γ— Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores
9 Likes

Some more β€˜difficult’ lengths would be interesting.

Just slightly off a power of 2:

julia> v = randn(ComplexF64, 1<<16 + 7);

julia> rust_planner64 = RustFFT.FftPlanner64();

julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));

julia> fftw_plan! = FFTW.plan_fft!(v);

julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 1487 samples with 1 evaluation.
 Range (min … max):  3.024 ms …  3.692 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.272 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.274 ms Β± 62.298 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                                   β–„β–†β–ˆβ–†
  β–‚β–ƒβ–„β–ƒβ–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–…β–„β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–β–‚β–β–‚ β–ƒ
  3.02 ms        Histogram: frequency by time        3.42 ms <

 Memory estimate: 64 bytes, allocs estimate: 2.

julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 2228 samples with 1 evaluation.
 Range (min … max):  2.101 ms …  3.082 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.130 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.153 ms Β± 56.421 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‡β–ˆ
  β–ƒβ–‡β–ˆβ–ˆβ–†β–„β–ƒβ–„β–„β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–„β–…β–†β–‡β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚ β–ƒ
  2.1 ms         Histogram: frequency by time         2.3 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Using a multiple of either (2,3,5,7), which is still an optimal choice for FFTW:

julia> v = randn(ComplexF64, nextprod((2,3,5,7), 1<<16+1));

julia> rust_planner64 = RustFFT.FftPlanner64();

julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));

julia> fftw_plan! = FFTW.plan_fft!(v);

julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 5304 samples with 1 evaluation.
 Range (min … max):  841.375 ΞΌs …  1.117 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     890.855 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   890.873 ΞΌs Β± 15.784 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                             β–„β–ˆβ–‡β–ˆβ–†β–ƒ
  β–β–‚β–‚β–‚β–‚β–β–β–β–β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–‚β–ƒβ–„β–…β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  841 ΞΌs          Histogram: frequency by time          941 ΞΌs <

 Memory estimate: 64 bytes, allocs estimate: 2.

julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 8733 samples with 1 evaluation.
 Range (min … max):  467.458 ΞΌs … 893.875 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     468.583 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   475.325 ΞΌs Β±  46.382 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–ƒ                                                            ▁
  β–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–„β–…β–ƒβ–„β–„β–„β–„β–ƒβ–β–β–β–β–ƒβ–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–†β–‡ β–ˆ
  467 ΞΌs        Histogram: log(frequency) by time        822 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
1 Like

If you care about performance you should generally pass a planner flag to enable MEASURE or PATIENT mode. Also for 1d transforms FFTW may be faster for out-of-place with preallocated output.

3 Likes

I had tried PATIENT for the 2^16 vector and didn’t find any significant difference.

Side note, I did my benchmarks on aarch64 Darwin, I don’t know how much RustFFT is optimised on this platform, or if we need to turn on special flags in our build on Yggdrasil.

Julia’s FFTW build for aarch64 is missing planner support due to a missing configure flag if I recall correctly so that’s why MEASURE made no difference.

1 Like

On x86_64

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 Γ— Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, haswell)
  Threads: 1 on 8 virtual cores

Code:

using BenchmarkTools, RustFFT, FFTW, LinearAlgebra

function bench(N)
    v = randn(ComplexF64, N)
    rust_planner64 = RustFFT.FftPlanner64()
    rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)))
    fftw_plan! = FFTW.plan_fft!(copy(v); flags=FFTW.PATIENT)
    println("RustFFT:")
    display(@benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy($(v))) evals=1)
    println("FFTW:")
    display(@benchmark mul!(data, $(fftw_plan!), data) setup=(data=copy($(v))) evals=1)
end

Benchmarks:

julia> bench(1<<16)
RustFFT:
BenchmarkTools.Trial: 7155 samples with 1 evaluation.
 Range (min … max):  540.918 ΞΌs …  1.741 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     569.428 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   579.482 ΞΌs Β± 48.960 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–„β–ˆβ–…β–„β–‡β–ƒβ–                                                     
  β–‚β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  541 ΞΌs          Histogram: frequency by time          801 ΞΌs <

 Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 7151 samples with 1 evaluation.
 Range (min … max):  506.168 ΞΌs …  1.013 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     573.886 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   583.459 ΞΌs Β± 33.517 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                β–„β–ˆβ–†β–„β–                                           
  β–‚β–‚β–‚β–‚β–β–β–β–‚β–β–‚β–‚β–ƒβ–„β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–†β–‡β–ˆβ–†β–ˆβ–‡β–…β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  506 ΞΌs          Histogram: frequency by time          715 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> bench(1<<16 + 7)
RustFFT:
BenchmarkTools.Trial: 1458 samples with 1 evaluation.
 Range (min … max):  3.055 ms …   4.598 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.263 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.296 ms Β± 153.412 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁▂▁       β–ƒβ–„β–†β–ˆβ–ˆβ–†β–…β–„β–„β–ƒβ–‚β–‚β–   ▁                                 ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–†β–‡β–‡β–…β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–…β–…β–…β–„β–„β–„β–…β–…β–…β–†β–‡β–‡β–‡β–†β–β–„β–…β–β–β–„β–β–…β–„β–…β–„β–„β–„β–„ β–ˆ
  3.06 ms      Histogram: log(frequency) by time      3.91 ms <

 Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 1297 samples with 1 evaluation.
 Range (min … max):  3.619 ms …   6.162 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     3.699 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   3.729 ms Β± 134.475 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–ˆβ–‡β–β–                                                   
  β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–β–β–‚β–β–‚β–‚β–β–β–‚β–ƒβ–‚β–ƒβ–β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–β–β–‚ β–ƒ
  3.62 ms         Histogram: frequency by time        4.14 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> bench(nextprod((2, 3, 5, 7), 1<<16 + 1))
RustFFT:
BenchmarkTools.Trial: 6508 samples with 1 evaluation.
 Range (min … max):  615.043 ΞΌs …  1.369 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     650.530 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   656.128 ΞΌs Β± 36.265 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

           β–‚β–„β–ˆβ–…                                                 
  β–‚β–‚β–‚β–ƒβ–ƒβ–„β–„β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–„β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  615 ΞΌs          Histogram: frequency by time          800 ΞΌs <

 Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 5800 samples with 1 evaluation.
 Range (min … max):  710.186 ΞΌs …  1.572 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     730.624 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   763.972 ΞΌs Β± 79.095 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–†β–†β–„β–‚β–‚β–‚β–β–ƒβ–„β–„β–ƒβ–ƒβ–ƒβ–‚β–β–                                           β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–‡β–‡β–†β–†β–†β–†β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–†β–†β–…β–…β–β–„β–ƒβ–ƒβ–„β–…β–„β–„β–…β–ƒβ–ƒβ–…β–ƒβ–†β–‡β–… β–ˆ
  710 ΞΌs        Histogram: log(frequency) by time      1.14 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Here RustFFT seems to beat FFTW on non-powers-of-2 sizes.

2 Likes

the main advantage is it’s not GPL?

1 Like

It should be, the library is always compiled with the release flag, and the neon feature should be enabled by default:

On AArch64, the neon feature enables compilation of Neon-accelerated code. This requires rustc 1.61 or newer, and is enabled by default. If this feature is disabled, rustc 1.37 or newer is required.

On other platforms than AArch64, this feature does nothing and RustFFT will behave like it is not set.

There is some additional overhead, in particular the array and planner are tracked when the FFT is computed to avoid creating multiple mutable references, the returned data is boxed and requires building the DataType. That should be a reasonably constant amount of overhead per call, though.

FFTW also seems a lot more general. It looks like RustFFT doesn’t currently have multidimensional transforms, real-input transforms, DCTs/DSTs, or multi-threading?

5 Likes

Yeah, RustFFT does not support those directly as far as I’m aware. There is RealFFT, written by a contributor to RustFFT, it might be nice add support for that as well.

For this first release I decided I wanted to only release a PoC to show that a binding library could be written using jlrs and that most of the necessary glue code could be generated automatically.

8 Likes