Computing linear convolution efficiently

Hi, I am looking for a way to efficiently compute linear convolution. I’m thinking about using the ToeplitzMatrix package with its Circulant matrix to store one of the two signals efficiently, and apply this matrix to another signal to be convolved with. According to the documentation, this multiplication is done by FFT, so asymptotically faster than convolution. However, I will need to pad the second signal by zeros, multiply by the Circulant matrix, and truncate the result.

My question is, is there an NLog(N) implementation that handles these internally without allocating additional zeros? In other words, is there a linear convolution implementation that make use of FFT internally and efficiently? If there is a better way to do please also let me know. Thanks.

DSP.conv is already an implementation that automatically chooses FFT when appropriate. I believe this fits with “internally and efficiently,” but not “without allocating additional zeros.” If you look at the code, the methods seem to call _zeropad() which should allocate some zeros. Strictly speaking, conv does not satisfy your requirements, but it is considered quite efficient.

However, how sure are you that it’s better to avoid allocations? My naive understanding is that FFT is fast because it boils down to a bunch of parallel shift-adds, which is very fast to compute. You can certainly virtually pad an array, e.g. PaddedViews.jl, but that adds overhead and probably not SIMD friendly. I’d be surprised if that could be faster than FFTW. Also, I’d be surprised if you could make your own convolution based on Toeplitz matrices and do better than conv, unless there’s something special about your problem. Many others can provide better answers than this one, but it might help if you can provide some context.

4 Likes

You can also have a look at
https://fluxml.ai/Flux.jl/stable/models/nnlib/#NNlib.conv
which has quite a few options.

2 Likes

It sounds like conv is already doing things right. However, my problem is to compute h=f*g, where * is the convolution. Here g has a fixed length N, h should have the same length N, f is treated as an operator where its length is as long as needed (2N-1). In other words, length of h is NOT the sum of f and g. If I use conv, it looks like I need to do truncation afterwards. Is there an option to avoid that?

Very unlikely to be relevant outside of deep learning though. Convolution in deep learning and in signal processing (which I assume this is about) have very different characteristics in how they are used.

Again replying naively, I doubt if you avoid the truncation. If your array is big enough to merit FFT, then you end up with an ifft to get the result. It doesn’t seem straightforward to muck with the internals to avoid the truncation. Then again, I don’t quite understand what you’re doing. Shouldn’t the convolution with lengths N and <=(2N-1) yield a length of 3N-2 or such? Is your f already zero-padded, or have some structure where you know you’re only interested in length N output?

As @GunnarFarneback points out, are you doing neural networks or signal processing? In signal processing, the arrays can be long which makes FFT best. For neural networks, the kernels are quite small and best to do regular convolution, especially with edge-padding as such. We still have no idea what you’re trying to do. [EDIT:] Is memory the critical issue for you?

It gets a little complicated, because some libraries do support FFT-based convolutions (well, the usual “cross-correlation but we’re calling it convolution”, though I believe NNlib does flip the kernel by default). In particular, both NNPACK and CuDNN (which NNlib uses for CPU and GPU fast paths respectively) do.

Edit: apparently NNPACK is no longer used by default as of requires for nnpack by CarloLucibello · Pull Request #227 · FluxML/NNlib.jl · GitHub.

Thank you for replying. What I’m doing is some numerical computation that has a convolution structure in it. The convolution is treated as a linear operator h=f*g where g is the input, h is the output and f is the “linear system”. The input and output are both truncated from infinite lengths, at some fixed N, so f is a Toeplitz matrix with appropriate diagonals (2N-1 diagonals). The DSP.conv probably does it in such a way that the output is longer than the input, thus I’ll have to truncate it properly. What I want to have is this linear operation (convolution by f) to be implemented efficiently so that I can apply it iteratively. (Here N is in the order of thousands, but could be larger when the numerical problem becomes larger)

So what I need is the Toeplitz matrix to be efficiently stored and applied. I am not sure if this is the case for the ToeplitzMatrix package, as the documentation only says a Circulant matrix is applied via FFT.

I somewhat doubt that those are well-optimized for large 1D images and kernels with a single channel and batch size one, but what I primarily was thinking of was that options like stride, pad, dilation, flipped, and groups have at best marginal relevance in signal processing (assuming you set flipped the right way).

Yes, Circulant is actually stored as a single vector, and you can feed it either the first column, or a matrix whose first column is used. The code is here. You can do ? Circulant to see a doc string.

As for the multiplication, the relevant snippet from mul! is here:

# Large case: use FFT
mul!(y, factorize(A), x, α, β)

If N is large enough, it does a smart factorize with fft. Viewing the code is left as an exercise for the reader but it’s in the same link above.

My impression is that you will have some truncating to do. The FFT approach already has a bunch of zero-padding and such, and in convolution there’s an ifft so it’s not easy to truncate until after the result. FFT involves some overhead that doesn’t seem easy to avoid.

I’m still a bit confused by your problem. Is f a vector to begin with, and you are choosing to set up your own convolution by turning it into a Toeplitz matrix (say F=Circulant(f)) then F*g matrix mult? If so, just do conv(f,g). Both ways will use FFT, but it seems like conv will have less overhead. I don’t see what’s requiring you to set up Toeplitz, but if you do, at least it doesn’t waste a lot of memory. It seems like you’ll have to truncate either way.

This is difficult to say with any certainty because the libraries in question generally tune for a bunch of different input sizes/types that are dispatched to at runtime (because higher-level deep learning APIs generally don’t allow users to do their own tuning). The biggest deciding factor would be batch size, but I don’t think the recommendation was unreasonable given that information wasn’t provided.

Anyhow, this is all a moot point because NNlib doesn’t expose FFT- or even Winograd-based 1D convolutions on CPU, while the question was explicitly asking about FFTs.

I tend to agree. If I’m not misunderstanding what’s going on you need to have all that padding to avoid circular wrap-around in the FFT. With a direct convolution you can easily skip computing the output samples you’re not interested in but you won’t get the complexity gains from the FFT.

Thank you all for the input. I did a benchmark test with 3 possible implementations of the convolution h=f*g, where g and h are both N = 1 million in length, and f is at 2N-1. The 3 implementations are direct convolution with DSP.conv, Toeplitz convolution with ToeplitzMatrices, and circular convolution with FFTW. Zero-paddings and truncations are used explicitly whenever necessary.
Here is the result:

It looks like DSP.conv is fastest, but ToeplitzMatrices is more memory efficient. They are of the same order of magnitudes though. I guess FFT was used in all cases.

1 Like

Interesting. I’m surprised that Toeplitz could save any memory. Looking at DSP.conv, it’s hard to see where there might be unnecessary allocations that could be removed. Are you sure you don’t have unnecessary allocations in your functions directconv? There might be subtle allocations between your implementations. Perhaps you can share your test.

By the way, there may be a way to save memory at the end, using views. Rather than allocating to store the truncated output, you might get away with hfull = conv(f, g) and then h = @view hfull[from:to], which is in-place and should act as if the array had been truncated.

The optimal solution depends on:

1. What are the dimensions of the data and the convolution kernel?
2. Which boundary type are you after? Periodic, Replicate, Symmetry, etc…
3. Are you applying the same kernel on multiple images? Or is it one time thing?

Once you answer those questions we’ll be able to figure the appropriate solution.

I believe the Toeplitz matrix does exactly what I want without truncation, namely, multiply a N by N Toeplitz matrix with a N-vector. The DSP.conv convolves a N-vector with a 2N-1 vector, leading to a 3N-2 vector, which is truncated to a N vector. The FFT approach does zero-padding to the N-vector to 2N-1, apply the FFT, multiply, IFFT, then truncate to N.

So it looks reasonable that the Toeplitz matrix is most memory efficient for my purpose. I am wondering if it is possible to let DSP.conv know exactly the kind of convolution I need?

At times it sounds like you want a circular convolution, in which case there may indeed be some savings to be found, but hard to know unless we understand exactly what you want.

A circular should be roughly ifft(fft(f) .* fft(g)) plus some overhead. At one point it sounded like your Toeplitz was 2N-1, now it sounds like N? Why were you doing your own zero-padding, was that part of your problem requirements? Manual padding shouldn’t be necessary for a regular conv, or using Toeplitz to do circular convolution.

Can you post your benchmark so we see the exact definition? And answer @RoyiAvital’s questions? (I guess I can answer N is roughly 1 million.)

Here is the exact code I implemented

The convolution kernel is 2 million in length, the vector is 1 million, the convolution result is truncated to 1 million, corresponding to exactly the Toeplitz matrix multiplication. The Toeplitz matrix has the first half of f as its first column, and second half as its first row. The kernel will be used repeatedly.

I’d appreciate it if some insights can be made, especially if the direct convolution can avoid the truncated part somehow.

For a fair comparison between circconv and toepconv you should reuse fft(f) like you reuse F.

Not really, the Toeplitz matrix F is just a vector. The fft of this vector is done interally in the F*g operation.