Fastest Julia implementation for a cyclic convolution of real vectors

Hi,

I am looking for the fastest way to implement the cyclic convolution of two real vectors. In other words, given some positive integer n > 0 which is not necessarily power-of-two, I want to multiply two degree-(n-1) polynomials a(x) and b(x) modulo x^n-1.

Currently I am using FFTW.jl package for this. My approach is as follows:

  1. If n is a product of 2, 3, 5, 7 (which is pretty well optimised in FFTW library I believe), I compute real FFT (rfft) of length n for both a(x) and b(x), and multiply them and apply real iFFT (irfft) to obtain a(x) * b(x) mod x^n-1 directly.

  2. Otherwise, I pad the vector to length m, which is the smallest product of 2, 3, 5, 7 greater than 2n-1. Then, I compute real FFT of length m for a(x) and b(x) to multiply the polynomial, and apply real iFFT (irfft) to obtain a(x) * b(x). Then, I reduce it by the polynomial x^n-1.

However, I would like to optimize this further. Is there a more efficient method or a more performant library that provides the real FFT (or possibly the convolution directly)?

Thanks in advance.

1 Like

How big is n? If it’s less than 100 or so, you may have better luck just doing the multiplication and reduction the obvious way.

It depends, but usually between 1024 and 4096. I think it also worths mention that I perform a substantial number of the same-length convolution (of length n).

In that case, make sure you create a plan that you re-use for all convolutions. You may also benefit from performing the transformations into a pre-allocated buffer.

1 Like

Wait, if you’re doing cyclic convolution, why do you need to pad past 2n-1? Smallest product greater or equal to n is enough, no?

If you want a cyclic convolution of length n, then if you zero-pad to a cyclic convolution of length less than 2n-1 the β€œwrap-around” terms will be wrong and you won’t be able to disentangle them in post-processing. Whereas if you zero-pad to at least 2n-1, then the wrap-around terms only fall onto the zero padding β€” you effectively have a linear (non-cyclic) convolution, and you can convert it back to the desired length-n cyclic convolution by wrapping it back to the original length (adding the overlap terms).

1 Like

I would use brfft instead of irfft, which omits an additional pass over the data to multiply by the 1/n normalization β€” you can instead include this factor in one of your other passes over the data, e.g. for your element-wise multiplication stage.

And, as @DNF wrote, if you care about performance always re-use a precomputed plan (plan_rfft / plan_brfft) with a preallocated output array via mul!. (This is an FFTW FAQ.)

For real-data FFTs, even n are most efficient (in addition to having small prime factors).

2 Likes

I’m confused now. Every time I’ve implemented cyclic convolution, I have simply used ifft(fft(x) .* fft(y)), and this is also how every online source I can find describes it. Padding to 2N-1 is necessary to make the cyclic and linear convolutions equivalent, but cyclic convolution in and of itself is non-padded. Or what am I missing?

I think the point was to use an FFT of a performant size. For example, a length 1024 (edit: or, more saliently, even 2048) FFT is probably a bit faster than a length 1021 (prime length) FFT. It has nothing to do with correctness and everything to do with speed.

It’s easy to make the result of a linear convolution circular by wrapping the trailing elements back to the front. If the longer FFT makes it faster then there can be a performance benefit even with this extra step.


All this said, FFTW claims it β€œemploys O(n \log n) algorithms for all lengths, including prime numbers.”. The constants will be worse at sizes with large prime factors, but doubling the input size isn’t great either. Here’s a benchmark:

julia> using BenchmarkTools, FFTW

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> fft1021 = plan_fft(x1021); fft1023 = plan_fft(x1023); fft2048 = plan_fft(x2048);

julia> @benchmark *($fft1021,$x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.186 ΞΌs …  3.402 ms  β”Š GC (min … max): 0.00% … 95.81%
 Time  (median):     16.603 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   17.347 ΞΌs Β± 33.956 ΞΌs  β”Š GC (mean Β± Οƒ):  1.88% Β±  0.96%

             β–…β–‡β–†β–ˆβ–…β–‚β–‚β–ƒβ–ƒβ–β–‚β–ƒβ–ƒ
  β–β–β–β–‚β–‚β–‚β–‚β–ƒβ–„β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–…β–…β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  13.2 ΞΌs         Histogram: frequency by time        24.6 ΞΌs <

 Memory estimate: 32.12 KiB, allocs estimate: 6.

julia> @benchmark *($fft1023,$x1023)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.545 ΞΌs …  3.461 ms  β”Š GC (min … max): 0.00% … 95.55%
 Time  (median):     13.380 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   14.222 ΞΌs Β± 34.556 ΞΌs  β”Š GC (mean Β± Οƒ):  2.33% Β±  0.96%

   β–ƒβ–ˆβ–‡β–†β–…β–ƒ
  β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–…β–„β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–‚
  12.5 ΞΌs         Histogram: frequency by time        21.9 ΞΌs <

 Memory estimate: 32.12 KiB, allocs estimate: 6.

julia> @benchmark *($fft2048,$x2048)
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (min … max):   6.293 ΞΌs … 737.873 ΞΌs  β”Š GC (min … max): 0.00% … 96.86%
 Time  (median):      8.300 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   10.200 ΞΌs Β±  17.177 ΞΌs  β”Š GC (mean Β± Οƒ):  7.43% Β±  5.36%

   β–β–ƒβ–„β–…β–†β–‡β–ˆβ–‡β–†β–…β–…β–„β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–β–                       ▁▃▂▁             β–‚
  β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–†β–†β–…β–…β–…β–…β–…β–ƒβ–ƒβ–ƒβ–„β–β–β–β–ƒβ–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–„β–…β–„β–…β–†β–…β–…β–… β–ˆ
  6.29 ΞΌs       Histogram: log(frequency) by time      21.9 ΞΌs <

 Memory estimate: 64.12 KiB, allocs estimate: 6.

So for this very bad case 1021 (prime), the obvious padded length of 2048 is only 1.5-2x faster. And the less-bad case of 1023 (3 * 11 * 31) shows a smaller gap. I’ll remark that my benchmarks here were surprisingly inconsistent, so the precise performance numbers are debatable.

So there’s some room for improvement here in the bad cases, but not a ton. And the padding may perform worse in some cases.

1 Like

That’s correct if you want a cyclic convolution of length n and length(x) == length(y) == n.

However, suppose you want to compute a cyclic convolution of length n, but you want to compute it using a cyclic convolution of a longer length m > n β€” typically because n has large prime factors so you want to zero-pad to a length m that is highly composite for use with FFTs. The question is, given the length-m cyclic convolution of the zero-padded data [x; zeros(m-n)] and [y; zeros(m-n)], can you quickly recover the desired cyclic convolution of length n? The answer is yes, but only provided that m \ge 2n - 1. In this case, you can take the length-m convolution c[1:m] and simply compute c[1:n] + [c[n+1:2n-1]; 0] to obtain the convolution of the original x and y. However, this doesn’t work if m < 2n-1: there isn’t enough β€œroom” for the cyclic-wraparound piece c[n+1:2n-1].

2 Likes

You’re benchmarking a lot of copying overhead, because you’re giving the (complex-FFT) plan real inputs but it has to copy to a complex array first. You’re also not preallocating the output. Here’s the same benchmark, but with correctly allocated input, preallocated output, and an FFTW.MEASURE plan to be more efficient:

julia> using BenchmarkTools, FFTW, LinearAlgebra

julia> x1021 = randn(ComplexF64, 1021); x1023 = randn(ComplexF64, 1023); x2048 = randn(ComplexF64, 2048);

julia> y1021 = randn(ComplexF64, 1021); y1023 = randn(ComplexF64, 1023); y2048 = randn(ComplexF64, 2048);

julia> fft1021 = plan_fft(x1021, flags=FFTW.MEASURE); fft1023 = plan_fft(x1023, flags=FFTW.MEASURE); fft2048 = plan_fft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, fft1021, x1021); @btime mul!(y1023, fft1023, x1023); @btime mul!(y2048, fft2048, x2048);
  14.210 ΞΌs (0 allocations: 0 bytes)
  11.446 ΞΌs (0 allocations: 0 bytes)
  3.935 ΞΌs (0 allocations: 0 bytes)

Notice that the length-2048 complex FFT is now about 3.6x faster than the length-1021 FFT. (This is on Intel.)

If you’re going to zero-pad at all, you must pad to a length \ge 2n-1 or you can’t easily recover the desired cyclic convolution of length n, as I explained above.

3 Likes

And the difference is even bigger if you are looking at rfft (real-to-complex transforms), as is the case for the OP:

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> y1021 = randn(ComplexF64, 1021Γ·2+1); y1023 = randn(ComplexF64, 1023Γ·2+1); y2048 = randn(ComplexF64, 2048Γ·2+1);

julia> rfft1021 = plan_rfft(x1021, flags=FFTW.MEASURE); rfft1023 = plan_rfft(x1023, flags=FFTW.MEASURE); rfft2048 = plan_rfft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.423 ΞΌs (0 allocations: 0 bytes)
  8.792 ΞΌs (0 allocations: 0 bytes)
  2.366 ΞΌs (0 allocations: 0 bytes)

The size-2028 rfft is 15x faster than the size-1021 rfft.

Prime-size rffts in FFTW still use an O(n \log n) algorithm (based on Rader’s algorithm), but the constant factors are worse than in the complex case, especially compared to highly composite even lengths like 2048.

3 Likes

Do note, however, that the 1021 RFFT is 2.5x slower than the (more general) 1021 FFT. If you compare a 2048 RFFT to the more performant 1021 FFT, the gap is only 6x. People have killed for less, but the 15x figure is more a reflection of a lackluster implementation than anything fundamental. I’d even call it something of a β€œperformance bug” unless one wants to justify the slower algorithm by a smaller memory footprint.

I will also point out that I commonly see 1.5-2x reported performance differences among repeated @btime calls to FFTW (even non-allocating). I don’t know why it’s so variable, but in any case the relative numbers in your benchmarks happened to come out as roughly reflective of the best I saw from repeated trials.

1 Like

I can’t reproduce.

For example:

julia> using BenchmarkTools, FFTW, LinearAlgebra

julia> x1021 = randn(1021); x1023 = randn(1023); x2048 = randn(2048);

julia> y1021 = randn(ComplexF64, 1021Γ·2+1); y1023 = randn(ComplexF64, 1023Γ·2+1); y2048 = randn(ComplexF64, 2048Γ·2+1);

julia> rfft1021 = plan_rfft(x1021, flags=FFTW.MEASURE); rfft1023 = plan_rfft(x1023, flags=FFTW.MEASURE); rfft2048 = plan_rfft(x2048, flags=FFTW.MEASURE);

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  26.591 ΞΌs (0 allocations: 0 bytes)
  11.763 ΞΌs (0 allocations: 0 bytes)
  1.375 ΞΌs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  18.691 ΞΌs (0 allocations: 0 bytes)
  6.456 ΞΌs (0 allocations: 0 bytes)
  1.379 ΞΌs (0 allocations: 0 bytes)

julia> @benchmark mul!(y1021, rfft1021, x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.617 ΞΌs … 64.019 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     19.422 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   20.057 ΞΌs Β±  1.869 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–ˆβ–‡β–†β–…β–…β–†β–†β–†β–…β–…β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–  ▁▁    ▁▁▁             β–ƒ
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–…β–†β–†β–…β–‡β–‡ β–ˆ
  18.6 ΞΌs      Histogram: log(frequency) by time      26.5 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!(y1021, rfft1021, x1021)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.312 ΞΌs … 71.018 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     28.750 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   29.565 ΞΌs Β±  2.305 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                β–ƒβ–ƒβ–„β–ˆβ–†β–…β–„β–ƒβ–ƒβ–ƒβ–β–β–             ▁   ▁▁              β–‚
  β–ƒβ–β–β–β–β–„β–β–β–„β–ƒβ–†β–†β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–‡β–†β–…β–…β–„ β–ˆ
  24.3 ΞΌs      Histogram: log(frequency) by time        39 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m not used to discrepancies being even this big for calls this long. And the range of runtimes seems somewhat unusually large for a deterministic and nonallocating function (unless it multithreads? but that seems unlikely to be profitable at this size).

It could just be a quirk of my specific machine. It’s plugged in so shouldn’t be throttling for power, but may still throttle for heat or other reasons.

I’m getting a much smaller variance:

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  36.669 ΞΌs (0 allocations: 0 bytes)
  8.814 ΞΌs (0 allocations: 0 bytes)
  2.472 ΞΌs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.679 ΞΌs (0 allocations: 0 bytes)
  8.594 ΞΌs (0 allocations: 0 bytes)
  2.417 ΞΌs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.712 ΞΌs (0 allocations: 0 bytes)
  8.608 ΞΌs (0 allocations: 0 bytes)
  2.463 ΞΌs (0 allocations: 0 bytes)

Note that it should be single threaded if Threads.nthreads() == 1.

Similar to @stevengj for an Intel laptop

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  35.400 ΞΌs (0 allocations: 0 bytes)
  9.700 ΞΌs (0 allocations: 0 bytes)
  2.267 ΞΌs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  34.600 ΞΌs (0 allocations: 0 bytes)
  9.300 ΞΌs (0 allocations: 0 bytes)
  2.167 ΞΌs (0 allocations: 0 bytes)

julia> @btime mul!(y1021, rfft1021, x1021); @btime mul!(y1023, rfft1023, x1023); @btime mul!(y2048, rfft2048, x2048);
  34.600 ΞΌs (0 allocations: 0 bytes)
  9.300 ΞΌs (0 allocations: 0 bytes)
  2.222 ΞΌs (0 allocations: 0 bytes)

Thanks, it turned out that I was already using brfft and scaled the convolved vector (which is constantly reused) :sweat_smile: Also, I’m using plans of course.
Seems that what I’ve been doing is already pretty optimal.