Fastest Julia implementation for a cyclic convolution of real vectors

lukemin · December 10, 2024, 3:34pm

Hi,

I am looking for the fastest way to implement the cyclic convolution of two real vectors. In other words, given some positive integer n > 0 which is not necessarily power-of-two, I want to multiply two degree-(n-1) polynomials a(x) and b(x) modulo x^n-1.

Currently I am using FFTW.jl package for this. My approach is as follows:

If n is a product of 2, 3, 5, 7 (which is pretty well optimised in FFTW library I believe), I compute real FFT (rfft) of length n for both a(x) and b(x), and multiply them and apply real iFFT (irfft) to obtain a(x) * b(x) mod x^n-1 directly.
Otherwise, I pad the vector to length m, which is the smallest product of 2, 3, 5, 7 greater than 2n-1. Then, I compute real FFT of length m for a(x) and b(x) to multiply the polynomial, and apply real iFFT (irfft) to obtain a(x) * b(x). Then, I reduce it by the polynomial x^n-1.

However, I would like to optimize this further. Is there a more efficient method or a more performant library that provides the real FFT (or possibly the convolution directly)?

Thanks in advance.

Oscar_Smith · December 10, 2024, 3:36pm

How big is n? If it’s less than 100 or so, you may have better luck just doing the multiplication and reduction the obvious way.

lukemin · December 10, 2024, 4:16pm

It depends, but usually between 1024 and 4096. I think it also worths mention that I perform a substantial number of the same-length convolution (of length n).

DNF · December 10, 2024, 5:52pm

In that case, make sure you create a plan that you re-use for all convolutions. You may also benefit from performing the transformations into a pre-allocated buffer.

DNF · December 10, 2024, 5:56pm

Wait, if you’re doing cyclic convolution, why do you need to pad past 2n-1? Smallest product greater or equal to n is enough, no?

stevengj · December 10, 2024, 6:21pm

If you want a cyclic convolution of length n, then if you zero-pad to a cyclic convolution of length less than 2n-1 the “wrap-around” terms will be wrong and you won’t be able to disentangle them in post-processing. Whereas if you zero-pad to at least 2n-1, then the wrap-around terms only fall onto the zero padding — you effectively have a linear (non-cyclic) convolution, and you can convert it back to the desired length-n cyclic convolution by wrapping it back to the original length (adding the overlap terms).

stevengj · December 10, 2024, 6:23pm

I would use brfft instead of irfft, which omits an additional pass over the data to multiply by the 1/n normalization — you can instead include this factor in one of your other passes over the data, e.g. for your element-wise multiplication stage.

And, as @DNF wrote, if you care about performance always re-use a precomputed plan (plan_rfft / plan_brfft) with a preallocated output array via mul!. (This is an FFTW FAQ.)

For real-data FFTs, even n are most efficient (in addition to having small prime factors).

DNF · December 10, 2024, 8:23pm

I’m confused now. Every time I’ve implemented cyclic convolution, I have simply used ifft(fft(x) .* fft(y)), and this is also how every online source I can find describes it. Padding to 2N-1 is necessary to make the cyclic and linear convolutions equivalent, but cyclic convolution in and of itself is non-padded. Or what am I missing?

mikmoore · December 10, 2024, 8:58pm

I think the point was to use an FFT of a performant size. For example, a length 1024 (edit: or, more saliently, even 2048) FFT is probably a bit faster than a length 1021 (prime length) FFT. It has nothing to do with correctness and everything to do with speed.

It’s easy to make the result of a linear convolution circular by wrapping the trailing elements back to the front. If the longer FFT makes it faster then there can be a performance benefit even with this extra step.

All this said, FFTW claims it “employs O(n \log n) algorithms for all lengths, including prime numbers.”. The constants will be worse at sizes with large prime factors, but doubling the input size isn’t great either. Here’s a benchmark:

julia> using BenchmarkTools, FFTW

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> fft1021 = plan_fft(x1021); fft1023 = plan_fft(x1023); fft2048 = plan_fft(x2048);

julia> @benchmark *($fft1021,$x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.186 μs …  3.402 ms  ┊ GC (min … max): 0.00% … 95.81%
 Time  (median):     16.603 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   17.347 μs ± 33.956 μs  ┊ GC (mean ± σ):  1.88% ±  0.96%

             ▅▇▆█▅▂▂▃▃▁▂▃▃
  ▁▁▁▂▂▂▂▃▄▆███████████████▆▅▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  13.2 μs         Histogram: frequency by time        24.6 μs <

 Memory estimate: 32.12 KiB, allocs estimate: 6.

julia> @benchmark *($fft1023,$x1023)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.545 μs …  3.461 ms  ┊ GC (min … max): 0.00% … 95.55%
 Time  (median):     13.380 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.222 μs ± 34.556 μs  ┊ GC (mean ± σ):  2.33% ±  0.96%

   ▃█▇▆▅▃
  ▃███████▅▅▄▄▄▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  12.5 μs         Histogram: frequency by time        21.9 μs <

 Memory estimate: 32.12 KiB, allocs estimate: 6.

julia> @benchmark *($fft2048,$x2048)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):   6.293 μs … 737.873 μs  ┊ GC (min … max): 0.00% … 96.86%
 Time  (median):      8.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.200 μs ±  17.177 μs  ┊ GC (mean ± σ):  7.43% ±  5.36%

   ▁▃▄▅▆▇█▇▆▅▅▄▃▃▃▃▂▂▁▁                       ▁▃▂▁             ▂
  ▆███████████████████████▇▇▇▆▆▆▅▅▅▅▅▃▃▃▄▁▁▁▃▃█████▇▇▅▄▅▄▅▆▅▅▅ █
  6.29 μs       Histogram: log(frequency) by time      21.9 μs <

 Memory estimate: 64.12 KiB, allocs estimate: 6.

So for this very bad case 1021 (prime), the obvious padded length of 2048 is only 1.5-2x faster. And the less-bad case of 1023 (3 * 11 * 31) shows a smaller gap. I’ll remark that my benchmarks here were surprisingly inconsistent, so the precise performance numbers are debatable.

So there’s some room for improvement here in the bad cases, but not a ton. And the padding may perform worse in some cases.

stevengj · December 10, 2024, 9:04pm

That’s correct if you want a cyclic convolution of length n and length(x) == length(y) == n.

However, suppose you want to compute a cyclic convolution of length n, but you want to compute it using a cyclic convolution of a longer length m > n — typically because n has large prime factors so you want to zero-pad to a length m that is highly composite for use with FFTs. The question is, given the length-m cyclic convolution of the zero-padded data [x; zeros(m-n)] and [y; zeros(m-n)], can you quickly recover the desired cyclic convolution of length n? The answer is yes, but only provided that m \ge 2n - 1. In this case, you can take the length-m convolution c[1:m] and simply compute c[1:n] + [c[n+1:2n-1]; 0] to obtain the convolution of the original x and y. However, this doesn’t work if m < 2n-1: there isn’t enough “room” for the cyclic-wraparound piece c[n+1:2n-1].

stevengj · December 10, 2024, 9:14pm

You’re benchmarking a lot of copying overhead, because you’re giving the (complex-FFT) plan real inputs but it has to copy to a complex array first. You’re also not preallocating the output. Here’s the same benchmark, but with correctly allocated input, preallocated output, and an FFTW.MEASURE plan to be more efficient:

julia> using BenchmarkTools, FFTW, LinearAlgebra

julia> x1021 = randn(ComplexF64, 1021); x1023 = randn(ComplexF64, 1023); x2048 = randn(ComplexF64, 2048);

julia> y1021 = randn(ComplexF64, 1021); y1023 = randn(ComplexF64, 1023); y2048 = randn(ComplexF64, 2048);

julia> fft1021 = plan_fft(x1021, flags=FFTW.MEASURE); fft1023 = plan_fft(x1023, flags=FFTW.MEASURE); fft2048 = plan_fft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, fft1021, x1021); @btime mul!(y1023, fft1023, x1023); @btime mul!(y2048, fft2048, x2048);
  14.210 μs (0 allocations: 0 bytes)
  11.446 μs (0 allocations: 0 bytes)
  3.935 μs (0 allocations: 0 bytes)

Notice that the length-2048 complex FFT is now about 3.6x faster than the length-1021 FFT. (This is on Intel.)

If you’re going to zero-pad at all, you must pad to a length \ge 2n-1 or you can’t easily recover the desired cyclic convolution of length n, as I explained above.

stevengj · December 10, 2024, 9:23pm

And the difference is even bigger if you are looking at rfft (real-to-complex transforms), as is the case for the OP:

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> y1021 = randn(ComplexF64, 1021÷2+1); y1023 = randn(ComplexF64, 1023÷2+1); y2048 = randn(ComplexF64, 2048÷2+1);

julia> rfft1021 = plan_rfft(x1021, flags=FFTW.MEASURE); rfft1023 = plan_rfft(x1023, flags=FFTW.MEASURE); rfft2048 = plan_rfft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.423 μs (0 allocations: 0 bytes)
  8.792 μs (0 allocations: 0 bytes)
  2.366 μs (0 allocations: 0 bytes)

The size-2028 rfft is 15x faster than the size-1021 rfft.

Prime-size rffts in FFTW still use an O(n \log n) algorithm (based on Rader’s algorithm), but the constant factors are worse than in the complex case, especially compared to highly composite even lengths like 2048.

mikmoore · December 11, 2024, 3:29pm

Do note, however, that the 1021 RFFT is 2.5x slower than the (more general) 1021 FFT. If you compare a 2048 RFFT to the more performant 1021 FFT, the gap is only 6x. People have killed for less, but the 15x figure is more a reflection of a lackluster implementation than anything fundamental. I’d even call it something of a “performance bug” unless one wants to justify the slower algorithm by a smaller memory footprint.

I will also point out that I commonly see 1.5-2x reported performance differences among repeated @btime calls to FFTW (even non-allocating). I don’t know why it’s so variable, but in any case the relative numbers in your benchmarks happened to come out as roughly reflective of the best I saw from repeated trials.

stevengj · December 11, 2024, 5:18pm

I can’t reproduce.

mikmoore · December 11, 2024, 5:49pm

For example:

julia> using BenchmarkTools, FFTW, LinearAlgebra

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> y1021 = randn(ComplexF64, 1021÷2+1); y1023 = randn(ComplexF64, 1023÷2+1); y2048 = randn(ComplexF64, 2048÷2+1);

julia> rfft1021 = plan_rfft(x1021, flags=FFTW.MEASURE); rfft1023 = plan_rfft(x1023, flags=FFTW.MEASURE); rfft2048 = plan_rfft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  26.591 μs (0 allocations: 0 bytes)
  11.763 μs (0 allocations: 0 bytes)
  1.375 μs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  18.691 μs (0 allocations: 0 bytes)
  6.456 μs (0 allocations: 0 bytes)
  1.379 μs (0 allocations: 0 bytes)

julia> @benchmark mul!(y1021, rfft1021, x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.617 μs … 64.019 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.422 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.057 μs ±  1.869 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▇▆▅▅▆▆▆▅▅▄▄▃▄▃▃▃▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁  ▁▁    ▁▁▁             ▃
  ███████████████████████████████████████████████▇█▇▇▇▇▅▆▆▅▇▇ █
  18.6 μs      Histogram: log(frequency) by time      26.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!(y1021, rfft1021, x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.312 μs … 71.018 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     28.750 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.565 μs ±  2.305 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▃▃▄█▆▅▄▃▃▃▁▁▁             ▁   ▁▁              ▂
  ▃▁▁▁▁▄▁▁▄▃▆▆▆▆█████████████████▇▇█▇██▇███████████▇▇▇▆▇▇▆▅▅▄ █
  24.3 μs      Histogram: log(frequency) by time        39 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m not used to discrepancies being even this big for calls this long. And the range of runtimes seems somewhat unusually large for a deterministic and nonallocating function (unless it multithreads? but that seems unlikely to be profitable at this size).

It could just be a quirk of my specific machine. It’s plugged in so shouldn’t be throttling for power, but may still throttle for heat or other reasons.

stevengj · December 11, 2024, 6:51pm

I’m getting a much smaller variance:

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  36.669 μs (0 allocations: 0 bytes)
  8.814 μs (0 allocations: 0 bytes)
  2.472 μs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.679 μs (0 allocations: 0 bytes)
  8.594 μs (0 allocations: 0 bytes)
  2.417 μs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.712 μs (0 allocations: 0 bytes)
  8.608 μs (0 allocations: 0 bytes)
  2.463 μs (0 allocations: 0 bytes)

Note that it should be single threaded if Threads.nthreads() == 1.

Jake · December 11, 2024, 7:11pm

Similar to @stevengj for an Intel laptop

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.400 μs (0 allocations: 0 bytes)
  9.700 μs (0 allocations: 0 bytes)
  2.267 μs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  34.600 μs (0 allocations: 0 bytes)
  9.300 μs (0 allocations: 0 bytes)
  2.167 μs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  34.600 μs (0 allocations: 0 bytes)
  9.300 μs (0 allocations: 0 bytes)
  2.222 μs (0 allocations: 0 bytes)

lukemin · December 12, 2024, 4:44pm

Thanks, it turned out that I was already using brfft and scaled the convolved vector (which is constantly reused) Also, I’m using plans of course.
Seems that what I’ve been doing is already pretty optimal.

Topic		Replies	Views
Mul! modifying input arrays when using FFTW.jl to compute 2D inverse RFFT General Usage fftw , bug	2	75	April 11, 2025
FFT'ing lots of short vectors on GPU GPU fftw	3	714	January 31, 2022
Computing linear convolution efficiently New to Julia dsp	22	4180	August 18, 2021
Multidimensional Real Fast Fourier Transform General Usage fftw , dsp	8	4909	March 5, 2021
Inplace fft convolution General Usage	2	984	August 13, 2017

Fastest Julia implementation for a cyclic convolution of real vectors

Related topics