Convolution (conv) with same size output

Is there a way of using conv(a, b) to produce an output of the same size (as vector a) output? (Like the MATLAB option conv(a, b, 'same'))

I was advised by @staticfloat to pad vector to the length size(a) .+ size(b) .- 1 but when I did the following:

signal = rand(100)
kernel = rand(51)
padding_count = Int((length(kernel) - 1) / 2)
padding_signal = vcat(zeros(padding_count), signal, zeros(padding_count))
convolution = xcorr(padded_signal, kernel, padmpde=:none)
convolution2 = conv(padded_signal, kernel)

But both give me an output of 200 elements. Do I then just have to manually remove the zeros at the end of the result? Is there a better way of doing this?

1 Like

It seems convolution is implemented using fft().

I don’t know why this choice was made but it certainly means that in order to get the equivalent of MATLAB’s same operation you’d need to do some work with the indices of the result.

It is not well documented what is the size of the returned data (Or even if it is applying linear convolution of cyclic convolution using the fft() step) and I couldn’t figure it out from the code.

Probably if you write the loop yourself using LoopVectorization.jl you’ll get better performance and be able to customize the indices of the loop to match MATLAB’s full, same or valid modes.
An idea of how to apply convolution using LoopVectorization.jl is given at the Image Filtering example.

1 Like

The reason fft is used is because it is a much faster way of doing large convolutions (n log(n) vs n^2).

Well, it is only faster for really large cases and where the kernel and the signal have similar size.
In most cases this is not the case. This is the reason, for instance, that MATLAB chose direct implementation for conv() while for cross correlation, which usually deals with signals with similar length, it uses fft() based method.
I’m pretty sure that implementing the vanilla loop with vectorization would beat it easily for most real life cases.

Also, The documentation isn’t good in my opinion.
When it says it uses fft(), does it expect the user to pad arrays in order to get the linear convolution result?

1 Like

It’s probably easiest to use the ImageFiltering package. Despite its name, it’s useful for all sorts of arrays, not just images.

using ImageFiltering

signal = rand(100)
kernel = centered(rand(51)) 

EDIT: corrected version which matches the bahavior of MATLAB conv(..., 'same') below

1 Like

I think imfilter() applies correlation and not convolution. Moreover, I am not sure the centered() function won’t yield the result of the same in MATLAB. If I get documentation it only shifts the kernel axis but it shouldn’t change the output result.

1 Like

I think this is one of those things that’s pretty field-specific. For instance I’m generally not convolving with filters that are small enough for the direct implementation to be efficient. As you mention, for similarly-sized inputs the FFT approach is generally better, and when they are very differently-sized you want to use overlap-save convolution, which uses FFTs in a block-wise fashion. The conv implementation currently selects between those two algorithms depending on the size of the input, but it would be good for it to also select the direct method when one of the inputs is small enough.

Basically it’s a matter of perspective - some folks say “FFT convolution is only useful for really big inputs” and other say “direct convolution is only useful for really small inputs”. Seems like conv could be smart enough to always do the right thing, but someone needs to add that check.

I agree the docs could be improved. I usually expect the convolution of a length-N array and a length-M array to be length N+M-1, but that should be documented.

Though it uses the FFT algorithm, DSP.jl’s conv implementation takes care of zero-padding internally, so it does linear convolution, not circular convolution.

To answer the original question, it seems that what you want depends on the characteristics of your filter b. First off, I’m assuming that the idea is for the output to “line up” with the input. If the filter peaks on the first sample (like exp.(0:-0.5:-10)), then you probably want something like conv(a, b)[1:length(a)]. But if you have some filter that includes a delay, like gaussian(21, 0.15), then you might want conv(a,b)[11:end-10].


Do you have speed comparison?
I’d bet that direct method with proper vectorization and efficient code will beat overlap and save.

I’m not sure how up to date this package is: GitHub - aamini/FastConv.jl: Fast Convolutions in Julia but as you can see it was faster. Also its implementation isn’t as optimal as it should be.

I’d be happy to see numbers.
In the past, using MATLAB, my testing showed you need very long signals and kernels to justify using methods using DFT (Overlap and Save, etc…).

Ah okay, I was indexing like you mentioned in your post and was wondering if there was a direct method. I needed this for a MATLAB based tutorial I was doing in Julia, and I was wondering since MATLAB seemed to find the ‘same’ option important I was assuming it had broader application.


Right, so the kernel should be reversed to do convolution. My example also had the different border padding than MATLAB’s, so the correct version to get the behavior of MATLAB’s conv(...,'same') is

signal = 1:100
kernel = 1:3
imfilter( signal, reflect(centered(kernel)), Fill(0) )

The output result has the same dimensions as signal, regardless of the use of centered (you can get a smaller result by passing the padding Inner(), which is like MATLAB’s 'valid'). Centering the kernel means you would get the central part of the full correlation.

1 Like


I did a comparison between 3 methods of convolution:

  1. Direct (Using MATLAB’s conv()).
  2. Using Overlap and Save (Implemented in MATLAB, no allocation in the loop).
  3. In frequency domain (Using fft()).

This is the result I got:

We have signal length and kernel length. I only did the case the signal isn’t shorter than the kernel.
So the upper triangle of the matrix, in Dark Blue isn’t relevant.

In the diagonal and lower triangle we have which method was faster:

  1. Light Blue - Direct.
  2. Green - Overlap and Save.
  3. Yellow - Frequency Domain.

I am not sure what cases you use, but as you can see, there is almost no reason to use Overlap & Save.
In my opinion, as long you have the whole data in memory use either direct (Cases the kernel is shorter than ~400-500 samples in the implementation above) or Frequency Domain.

I’d even mention that MATLAB’s conv() is probably much slower than what you can do with LoopVectorization.jl in Julia.

This was an answer to Convolution Strategy / Method for the Fastest 1D Convolution.

@ElectronicTeaCup, In the code I posted in the answer to the question you will be able to see which exact samples you need to extract in order to replicate MATLAB’s same.
Though the optimal thing to do will be implementing direct convolution using LoopVecotrization.jl which on the computation only compute the required samples (full, same or valid). The current approach in DSP.jl is sub optimal in my opinion.


Overlap-save convolution just turns a convolution with un-equal lengths into a bunch of convolutions with equal-lengths (for each block). So if the blocks are large enough (larger than about 600 samples from your plot) than you want to use FFT processing for each block, not use direct convolution.

But I agree that for short convolutions (seemingly less than about 400 samples from your plot) conv should be doing direct convolution. There’s actually already some code for direct covolution in DSP.jl, so it would be a pretty good intro PR for someone to include it within conv.

As I mentioned before, I think we are using different definitions of “short” and “long”. Your plot only goes up to N=2000, whereas I’m frequently convolving a signal of millions of samples with a signal with some hundreds of thousands, which is a common use-case for overlap-save convolution.

@galenlynch did some similar analysis with kernels from 1 to 2^20 that compared FFT to overlap-save, both in CPU time and memory usage:

edit: that plot has different dimensionalities as well, so for this conversation you want the N=1 (one-dimensional) subplot

There is a closed form equation to the optimal DFT Window for a given length of kernel.
The code I used to create the above image uses the optimal DFT Window.
Still there are no cases Overlap and Save is superior to Direct or Frequency Domain.

Overlap and Save and Overlap and Add were conceived with a constant stream of data in mind.
For data which is loaded to memory either use direct or frequency. I think the cases one would gain from Overlap and Save are edge cases (Probably more related to correlation [Template Matching] than convolution [Filtration]).

In cases of very very long kernel and still order(s) of magnitude longer signal we might see Overlap and Save dominate Frequency Based (Direct method has nothing to sell once the Kernel is bigger than the CPU cache). But I think those cases are more related to template matching (Correlation). Though they are still be implemented using same ideas (As it is flipped kernel) they should have their own dedicated function (xcorr() or something).

I think the point where “fast” (fft) convolution techniques are faster than direct convolution will be with a much smaller kernel size than 400 with well-optimized code. Many sources I’ve seen suggest that the break-even point is a kernel size around 128, while some simple testing with 1D arrays in DSP.jl’s conv (fft) and filt (direct) functions suggests it’s 58 with our convolution code. As you can see from the plots that Russel linked to, there are many cases where overlap-save is faster than simple fft convolution: basically most cases where the two input vectors are not the same size. I can’t speak to why you have different results in your Matlab code. I would be curious to see how DSP.jl’s conv times compare to your Matlab times, if you would be willing to look at that.

As I think Russel pointed out, we already have very fast (and vectorized) direct convolution implemented in DSP.jl as part of the filt function for one dimension, but unfortunately I think it may have slightly different behavior than conv which prevents us from using that code for conv as well (at least that is my recollection). It has been a long-term goal of mine to combine FIR filtering code with convolution code, to allow direct convolution to be used more generally. One difficulty is that I have found it difficult to write fast direct convolution code that works with arbitrary array indexing and an arbitrary number of dimensions, which is currently what conv supports. It might be an easy PR to calculate the break-even point and use direct convolution for small kernels when using just one dimension and 1-indexed arrays.

Ultimately, the choice to use fft convolution for conv was made a long time ago in Julia base. Even though that conv code has since been moved into DSP.jl and modified to use overlap-save when it is faster, we have not yet added a direct convolution branch when that would be faster.

In terms of wishing for better documentation, you’re right that it could be better documented. PRs are, of course, welcome.

1 Like

Well, talking to @ssfrr, I get that Fourier Domain Convolution is using the next power of 2. This is the reason you get the result you get.

The image I posted shows you clearly that Direct Method easily beat Overlap and Save for kernels up to 400 samples. I bet with highly tuned code (Let’s say like in Intel IPP) it will be even higher.

Have a look at my code, implement the Frequency Domain the way I implemented it (Pad with zeros to M + N - 1 and not nextpow2(M + N - 1)) and you’ll see OVS doesn’t shine as it seems.

The cases OVS is useful wii be the rare cases where the Kernel is huge and the data is still order of magnitude bigger.
As I wrote above, this cases mostly happen for Template Matching (Correlation).
This is the reason MATLAB implements conv() in the direct method and for those rare cases it has xcorr() which is done using the Frequency Domain.

Conversation continued at DSP.jl issue #355.


OK, this thread died a long time ago but I’ve spent some time trying to figure out what the optimal 2D convolution is and I’ve settled in ImageFilter.jl for my use case (5x5 Wallis Filter applied to a 7000x7000 image). Here is the timing:

naive loop
28.633 s (565838525 allocations: 32.49 GiB)
naive loop - multiple threads
10.809 s (701927312 allocations: 34.52 GiB)
naive mapwindow - multiple threads
8.786 s (50063722 allocations: 12.10 GiB)
ImageFilter.imfilter - multiple threads
** 627.990 ms (1890 allocations: 2.56 GiB)**
LoopVectorization - multiple threads
803.172 ms (15 allocations: 2.19 GiB)
DSP.conv - multiple threads
1.266 s (196 allocations: 2.56 GiB)

1 Like

And here is the code in case anyone finds it useful or has comments:

using Statistics
using ImageFiltering
using DSP
using BenchmarkTools
using LoopVectorization, OffsetArrays, Images

function filter_test()
w = 5;
x = 7000;
y = x;
I = rand(x,y)
F = zeros(x,y)

test lesser types

#I = convert(Matrix{Float32}, I)
#F = convert(Matrix{Float32}, F)

define wallis filter

function wallis(chip)
chip_mean = mean(chip)
kernel_sum = length(chip)
chip_std = sqrt(sum((chip .- chip_mean).^2)/kernel_sum)
return out = (chip[ceil(Int,kernel_sum/2)] - chip_mean) / chip_std

ww = floor(Int,w/2)

now using 1 core [time = 27s for x = 7000]

println(“naive loop”)
@btime for i = (ww+1):1:(x-ww-1), j = (ww+1):1:(x-ww-1)
@inbounds F[i,j] = wallis(I[(i-ww):(i+ww), (j-ww):(j+ww)])

now using mutiplt threads (if Julia setup to use multiple threads) [time = 11s for x = 7000]

println(“naive loop - multiple threads”)
@btime Threads.@threads for i = (ww+1):1:(x-ww-1)
for j = (ww+1):1:(x-ww-1)
@inbounds F[i,j] = wallis(I[(i-ww):(i+ww), (j-ww):(j+ww)])

now using mapwindow and mutiplt threads (if Julia setup to use multiple threads) [time = 9s for x = 7000]

println(“naive mapwindow - multiple threads”)
@btime mapwindow(wallis, I, (w,w));

now try using imfilter [time = ~0.75s for x = 7000]

function wallis_fast(I,w)
I = I - imfilter(I, ones(w,w)/w^2);
I = I ./ sqrt.(imfilter(I.^2, ones(w,w)/w^2))
println(“imfilter - multiple threads”)
@btime wallis_fast(I,w)

now try using loop LoopVectorization

function conv2(out::AbstractMatrix, A::AbstractMatrix, kern)
@turbo for J in CartesianIndices(out)
tmp = zero(eltype(out))
for I ∈ CartesianIndices(kern)
@inbounds tmp += A[I + J] * kern[I]
out[J] = tmp

function wallis_fast2(I,w)
kern =,w))
out = OffsetArray(similar(I, size(I) .- size(kern) .+ 1), -1 .- kern.offsets);

I[3:end-2,3:end-2] = I[3:end-2,3:end-2] - OffsetArrays.no_offset_view(conv2(out,I, kern));
I[3:end-2, 3:end-2] = I[3:end-2, 3:end-2] ./ sqrt.(OffsetArrays.no_offset_view(conv2(out,I.^2, kern)))


println(“LoopVectorization - multiple threads”)
@btime wallis_fast2(I,w)

now try using DSP.conv

function wallis_fast3(I,w)
I = I - DSP.conv(ones(w,w)/w^2, I)[3:end-2, 3:end-2];
I = I ./ sqrt.(DSP.conv(I.^2, ones(w,w)/w^2)[3:end-2, 3:end-2])
println(“DSP.conv - multiple threads”)
@btime wallis_fast3(I,w)


run test


To use threading with LoopVectorization you need to use the @tturbo macro (“threaded turbo”). So I assume you could squeeze out even a little more performance out of it :slight_smile:

A point which hasn’t been mentioned here (and which admittedly rarely is relevant) is that the FTT IFFT approach introduces more round-off errors than direct convolution. This can be relevant when the value you estimate, for instance, indirectly from convolutions, is smaller than roughly the eps value of the precision used in your calculations (for example, smaller than 10^{-16} for calculations in Float64s).