RustFFT is a high-performance, SIMD-accelerated FFT library written in pure Rust. It can compute FFTs of any size, including prime-number sizes, in O(nlogn) time. You can now use it from Julia!
Usage
Forward FFT:
using RustFFT
planner64 = RustFFT.FftPlanner64()
instance = RustFFT.plan_fft_forward(planner64, UInt(1))
data = complex([1.0])
RustFFT.fft!(instance, data)
@assert data[1] β 1.0
Inverse FFT:
using RustFFT
planner64 = RustFFT.FftPlanner64()
instance = RustFFT.plan_fft_inverse(planner64, UInt(1))
data = complex([1.0])
RustFFT.fft!(instance, data)
@assert data[1] β 1.0
Note that RustFFT does not normalize outputs:
Callers must manually normalize the results by scaling each element by 1/len().sqrt()
. Multiple normalization steps can be merged into one via pairwise multiplication, so when doing a forward FFT followed by an inverse callers can normalize once by scaling each element by 1/len()
A few other limitations apply. Itβs currently not possible to choose the specific algorithm that will be used to compute the transform. Itβs also not possible to compute the FFT of an array with a rank not equal to 1. The interface provided by AbstractFFTs
is not used yet, either.
Documentation
GitHub
10 Likes
It would be great if you added some performance comparison with other implementations in the eco system.
4 Likes
Good point! Iβve opened an issue for it
julia> using BenchmarkTools, RustFFT, FFTW, LinearAlgebra
julia> v = randn(ComplexF64, 1<<16);
julia> rust_planner64 = RustFFT.FftPlanner64();
julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));
julia> fftw_plan! = FFTW.plan_fft!(v);
julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 337.667 ΞΌs β¦ 569.458 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 385.834 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 384.350 ΞΌs Β± 13.164 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
β
βββββββββββββββββββββββββββββββ
βββββββββ
βββββββββββββββββββββ β
338 ΞΌs Histogram: frequency by time 423 ΞΌs <
Memory estimate: 64 bytes, allocs estimate: 2.
julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 314.833 ΞΌs β¦ 478.750 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 316.834 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 320.026 ΞΌs Β± 8.955 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
β
βββββ ββββ β
ββββββββββββββββββββββββββββββββββββββββββββββ
ββ
β
β
β
ββ
ββ
β
ββ
ββ
β β
315 ΞΌs Histogram: log(frequency) by time 357 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
Seems to be in the same ballpark as FFTW, but RustFFT.fft!
isnβt fully in-place, there are a couple of allocations.
Feedback: API could be simplified a bit, for example a method
RustFFT.plan_fft_forward(v::Vector{ComplexF32}) =
RustFFT.plan_fft_forward(RustFFT.FftPlanner32(), UInt(length(v)))
RustFFT.plan_fft_forward(v::Vector{ComplexF64}) =
RustFFT.plan_fft_forward(RustFFT.FftPlanner64(), UInt(length(v)))
or something like that.
Edit : for the record, my platform is
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 8 Γ Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores
9 Likes
DNF
May 29, 2023, 7:36pm
5
Some more βdifficultβ lengths would be interesting.
Just slightly off a power of 2:
julia> v = randn(ComplexF64, 1<<16 + 7);
julia> rust_planner64 = RustFFT.FftPlanner64();
julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));
julia> fftw_plan! = FFTW.plan_fft!(v);
julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 1487 samples with 1 evaluation.
Range (min β¦ max): 3.024 ms β¦ 3.692 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.272 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.274 ms Β± 62.298 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
βββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββ β
3.02 ms Histogram: frequency by time 3.42 ms <
Memory estimate: 64 bytes, allocs estimate: 2.
julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 2228 samples with 1 evaluation.
Range (min β¦ max): 2.101 ms β¦ 3.082 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 2.130 ms β GC (median): 0.00%
Time (mean Β± Ο): 2.153 ms Β± 56.421 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ
βββββββββββββββββββββββ
βββ
ββββββββββββββββββββββββββββββββ β
2.1 ms Histogram: frequency by time 2.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Using a multiple of either (2,3,5,7), which is still an optimal choice for FFTW:
julia> v = randn(ComplexF64, nextprod((2,3,5,7), 1<<16+1));
julia> rust_planner64 = RustFFT.FftPlanner64();
julia> rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)));
julia> fftw_plan! = FFTW.plan_fft!(v);
julia> @benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 5304 samples with 1 evaluation.
Range (min β¦ max): 841.375 ΞΌs β¦ 1.117 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 890.855 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 890.873 ΞΌs Β± 15.784 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββ
ββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββ β
841 ΞΌs Histogram: frequency by time 941 ΞΌs <
Memory estimate: 64 bytes, allocs estimate: 2.
julia> @benchmark mul!(data, fftw_plan!, data) setup=(data=copy(v)) evals=1
BenchmarkTools.Trial: 8733 samples with 1 evaluation.
Range (min β¦ max): 467.458 ΞΌs β¦ 893.875 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 468.583 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 475.325 ΞΌs Β± 46.382 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β
βββββββ
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ β
467 ΞΌs Histogram: log(frequency) by time 822 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
1 Like
giordano:
FFTW.plan_fft!(v);
If you care about performance you should generally pass a planner flag to enable MEASURE or PATIENT mode. Also for 1d transforms FFTW may be faster for out-of-place with preallocated output.
3 Likes
I had tried PATIENT for the 2^16 vector and didnβt find any significant difference.
Side note, I did my benchmarks on aarch64 Darwin, I donβt know how much RustFFT is optimised on this platform, or if we need to turn on special flags in our build on Yggdrasil.
Juliaβs FFTW build for aarch64 is missing planner support due to a missing configure flag if I recall correctly so thatβs why MEASURE made no difference.
1 Like
On x86_64
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 Γ Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, haswell)
Threads: 1 on 8 virtual cores
Code:
using BenchmarkTools, RustFFT, FFTW, LinearAlgebra
function bench(N)
v = randn(ComplexF64, N)
rust_planner64 = RustFFT.FftPlanner64()
rust_instance = RustFFT.plan_fft_forward(rust_planner64, UInt(length(v)))
fftw_plan! = FFTW.plan_fft!(copy(v); flags=FFTW.PATIENT)
println("RustFFT:")
display(@benchmark RustFFT.fft!($(rust_instance), data) setup=(data=copy($(v))) evals=1)
println("FFTW:")
display(@benchmark mul!(data, $(fftw_plan!), data) setup=(data=copy($(v))) evals=1)
end
Benchmarks:
julia> bench(1<<16)
RustFFT:
BenchmarkTools.Trial: 7155 samples with 1 evaluation.
Range (min β¦ max): 540.918 ΞΌs β¦ 1.741 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 569.428 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 579.482 ΞΌs Β± 48.960 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
ββββ
βββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββ β
541 ΞΌs Histogram: frequency by time 801 ΞΌs <
Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 7151 samples with 1 evaluation.
Range (min β¦ max): 506.168 ΞΌs β¦ 1.013 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 573.886 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 583.459 ΞΌs Β± 33.517 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββ
ββββββββββββββ
βββββββββββββββ
β
ββββββββββββββββββββββββββββββ β
506 ΞΌs Histogram: frequency by time 715 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> bench(1<<16 + 7)
RustFFT:
BenchmarkTools.Trial: 1458 samples with 1 evaluation.
Range (min β¦ max): 3.055 ms β¦ 4.598 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.263 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.296 ms Β± 153.412 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ βββββββ
ββββββ β β
ββββββββ
β
βββββββββββββββββββββββ
β
β
ββββ
β
β
ββββββββ
βββββ
ββ
ββββ β
3.06 ms Histogram: log(frequency) by time 3.91 ms <
Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 1297 samples with 1 evaluation.
Range (min β¦ max): 3.619 ms β¦ 6.162 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 3.699 ms β GC (median): 0.00%
Time (mean Β± Ο): 3.729 ms Β± 134.475 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ β
3.62 ms Histogram: frequency by time 4.14 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> bench(nextprod((2, 3, 5, 7), 1<<16 + 1))
RustFFT:
BenchmarkTools.Trial: 6508 samples with 1 evaluation.
Range (min β¦ max): 615.043 ΞΌs β¦ 1.369 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 650.530 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 656.128 ΞΌs Β± 36.265 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
615 ΞΌs Histogram: frequency by time 800 ΞΌs <
Memory estimate: 64 bytes, allocs estimate: 2.
FFTW:
BenchmarkTools.Trial: 5800 samples with 1 evaluation.
Range (min β¦ max): 710.186 ΞΌs β¦ 1.572 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 730.624 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 763.972 ΞΌs Β± 79.095 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
β
ββββββ
βββ
βββ
ββββ
β
710 ΞΌs Histogram: log(frequency) by time 1.14 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Here RustFFT seems to beat FFTW on non-powers-of-2 sizes.
2 Likes
jling
May 29, 2023, 10:06pm
11
the main advantage is itβs not GPL?
1 Like
It should be, the library is always compiled with the release flag, and the neon feature should be enabled by default:
On AArch64, the neon
feature enables compilation of Neon-accelerated code. This requires rustc 1.61 or newer, and is enabled by default. If this feature is disabled, rustc 1.37 or newer is required.
On other platforms than AArch64, this feature does nothing and RustFFT will behave like it is not set.
There is some additional overhead, in particular the array and planner are tracked when the FFT is computed to avoid creating multiple mutable references, the returned data is boxed and requires building the DataType
. That should be a reasonably constant amount of overhead per call, though.
FFTW also seems a lot more general. It looks like RustFFT doesnβt currently have multidimensional transforms, real-input transforms, DCTs/DSTs, or multi-threading?
5 Likes
Yeah, RustFFT does not support those directly as far as Iβm aware. There is RealFFT, written by a contributor to RustFFT, it might be nice add support for that as well.
For this first release I decided I wanted to only release a PoC to show that a binding library could be written using jlrs and that most of the necessary glue code could be generated automatically.
8 Likes