But both give me an output of 200 elements. Do I then just have to manually remove the zeros at the end of the result? Is there a better way of doing this?

I donât know why this choice was made but it certainly means that in order to get the equivalent of MATLABâs same operation youâd need to do some work with the indices of the result.

It is not well documented what is the size of the returned data (Or even if it is applying linear convolution of cyclic convolution using the fft() step) and I couldnât figure it out from the code.

Probably if you write the loop yourself using LoopVectorization.jl youâll get better performance and be able to customize the indices of the loop to match MATLABâs full, same or valid modes.
An idea of how to apply convolution using LoopVectorization.jl is given at the Image Filtering example.

Well, it is only faster for really large cases and where the kernel and the signal have similar size.
In most cases this is not the case. This is the reason, for instance, that MATLAB chose direct implementation for conv() while for cross correlation, which usually deals with signals with similar length, it uses fft() based method.
Iâm pretty sure that implementing the vanilla loop with vectorization would beat it easily for most real life cases.

Also, The documentation isnât good in my opinion.
When it says it uses fft(), does it expect the user to pad arrays in order to get the linear convolution result?

I think imfilter() applies correlation and not convolution. Moreover, I am not sure the centered() function wonât yield the result of the same in MATLAB. If I get documentation it only shifts the kernel axis but it shouldnât change the output result.

I think this is one of those things thatâs pretty field-specific. For instance Iâm generally not convolving with filters that are small enough for the direct implementation to be efficient. As you mention, for similarly-sized inputs the FFT approach is generally better, and when they are very differently-sized you want to use overlap-save convolution, which uses FFTs in a block-wise fashion. The conv implementation currently selects between those two algorithms depending on the size of the input, but it would be good for it to also select the direct method when one of the inputs is small enough.

Basically itâs a matter of perspective - some folks say âFFT convolution is only useful for really big inputsâ and other say âdirect convolution is only useful for really small inputsâ. Seems like conv could be smart enough to always do the right thing, but someone needs to add that check.

I agree the docs could be improved. I usually expect the convolution of a length-N array and a length-M array to be length N+M-1, but that should be documented.

Though it uses the FFT algorithm, DSP.jlâs conv implementation takes care of zero-padding internally, so it does linear convolution, not circular convolution.

To answer the original question, it seems that what you want depends on the characteristics of your filter b. First off, Iâm assuming that the idea is for the output to âline upâ with the input. If the filter peaks on the first sample (like exp.(0:-0.5:-10)), then you probably want something like conv(a, b)[1:length(a)]. But if you have some filter that includes a delay, like gaussian(21, 0.15), then you might want conv(a,b)[11:end-10].

Do you have speed comparison?
Iâd bet that direct method with proper vectorization and efficient code will beat overlap and save.

Iâm not sure how up to date this package is: https://github.com/aamini/FastConv.jl but as you can see it was faster. Also its implementation isnât as optimal as it should be.

Iâd be happy to see numbers.
In the past, using MATLAB, my testing showed you need very long signals and kernels to justify using methods using DFT (Overlap and Save, etcâŚ).

Ah okay, I was indexing like you mentioned in your post and was wondering if there was a direct method. I needed this for a MATLAB based tutorial I was doing in Julia, and I was wondering since MATLAB seemed to find the âsameâ option important I was assuming it had broader application.

Right, so the kernel should be reversed to do convolution. My example also had the different border padding than MATLABâs, so the correct version to get the behavior of MATLABâs conv(...,'same') is

The output result has the same dimensions as signal, regardless of the use of centered (you can get a smaller result by passing the padding Inner(), which is like MATLABâs 'valid'). Centering the kernel means you would get the central part of the full correlation.

We have signal length and kernel length. I only did the case the signal isnât shorter than the kernel.
So the upper triangle of the matrix, in Dark Blue isnât relevant.

In the diagonal and lower triangle we have which method was faster:

Light Blue - Direct.

Green - Overlap and Save.

Yellow - Frequency Domain.

I am not sure what cases you use, but as you can see, there is almost no reason to use Overlap & Save.
In my opinion, as long you have the whole data in memory use either direct (Cases the kernel is shorter than ~400-500 samples in the implementation above) or Frequency Domain.

Iâd even mention that MATLABâs conv() is probably much slower than what you can do with LoopVectorization.jl in Julia.

@ElectronicTeaCup, In the code I posted in the answer to the question you will be able to see which exact samples you need to extract in order to replicate MATLABâs same.
Though the optimal thing to do will be implementing direct convolution using LoopVecotrization.jl which on the computation only compute the required samples (full, same or valid). The current approach in DSP.jl is sub optimal in my opinion.

Overlap-save convolution just turns a convolution with un-equal lengths into a bunch of convolutions with equal-lengths (for each block). So if the blocks are large enough (larger than about 600 samples from your plot) than you want to use FFT processing for each block, not use direct convolution.

But I agree that for short convolutions (seemingly less than about 400 samples from your plot) conv should be doing direct convolution. Thereâs actually already some code for direct covolution in DSP.jl, so it would be a pretty good intro PR for someone to include it within conv.

As I mentioned before, I think we are using different definitions of âshortâ and âlongâ. Your plot only goes up to N=2000, whereas Iâm frequently convolving a signal of millions of samples with a signal with some hundreds of thousands, which is a common use-case for overlap-save convolution.

@galenlynch did some similar analysis with kernels from 1 to 2^20 that compared FFT to overlap-save, both in CPU time and memory usage:

edit: that plot has different dimensionalities as well, so for this conversation you want the N=1 (one-dimensional) subplot

There is a closed form equation to the optimal DFT Window for a given length of kernel.
The code I used to create the above image uses the optimal DFT Window.
Still there are no cases Overlap and Save is superior to Direct or Frequency Domain.

Overlap and Save and Overlap and Add were conceived with a constant stream of data in mind.
For data which is loaded to memory either use direct or frequency. I think the cases one would gain from Overlap and Save are edge cases (Probably more related to correlation [Template Matching] than convolution [Filtration]).

In cases of very very long kernel and still order(s) of magnitude longer signal we might see Overlap and Save dominate Frequency Based (Direct method has nothing to sell once the Kernel is bigger than the CPU cache). But I think those cases are more related to template matching (Correlation). Though they are still be implemented using same ideas (As it is flipped kernel) they should have their own dedicated function (xcorr() or something).

I think the point where âfastâ (fft) convolution techniques are faster than direct convolution will be with a much smaller kernel size than 400 with well-optimized code. Many sources Iâve seen suggest that the break-even point is a kernel size around 128, while some simple testing with 1D arrays in DSP.jlâs conv (fft) and filt (direct) functions suggests itâs 58 with our convolution code. As you can see from the plots that Russel linked to, there are many cases where overlap-save is faster than simple fft convolution: basically most cases where the two input vectors are not the same size. I canât speak to why you have different results in your Matlab code. I would be curious to see how DSP.jlâs conv times compare to your Matlab times, if you would be willing to look at that.

As I think Russel pointed out, we already have very fast (and vectorized) direct convolution implemented in DSP.jl as part of the filt function for one dimension, but unfortunately I think it may have slightly different behavior than conv which prevents us from using that code for conv as well (at least that is my recollection). It has been a long-term goal of mine to combine FIR filtering code with convolution code, to allow direct convolution to be used more generally. One difficulty is that I have found it difficult to write fast direct convolution code that works with arbitrary array indexing and an arbitrary number of dimensions, which is currently what conv supports. It might be an easy PR to calculate the break-even point and use direct convolution for small kernels when using just one dimension and 1-indexed arrays.

Ultimately, the choice to use fft convolution for conv was made a long time ago in Julia base. Even though that conv code has since been moved into DSP.jl and modified to use overlap-save when it is faster, we have not yet added a direct convolution branch when that would be faster.

In terms of wishing for better documentation, youâre right that it could be better documented. PRs are, of course, welcome.

Well, talking to @ssfrr, I get that Fourier Domain Convolution is using the next power of 2. This is the reason you get the result you get.

The image I posted shows you clearly that Direct Method easily beat Overlap and Save for kernels up to 400 samples. I bet with highly tuned code (Letâs say like in Intel IPP) it will be even higher.

Have a look at my code, implement the Frequency Domain the way I implemented it (Pad with zeros to M + N - 1 and not nextpow2(M + N - 1)) and youâll see OVS doesnât shine as it seems.

The cases OVS is useful wii be the rare cases where the Kernel is huge and the data is still order of magnitude bigger.
As I wrote above, this cases mostly happen for Template Matching (Correlation).
This is the reason MATLAB implements conv() in the direct method and for those rare cases it has xcorr() which is done using the Frequency Domain.