Convolution (conv) with same size output

Is there a way of using `conv(a, b)` to produce an output of the same size (as vector a) output? (Like the MATLAB option `conv(a, b, 'same')`)

I was advised by @staticfloat to pad vector to the length `size(a) .+ size(b) .- 1` but when I did the following:

``````signal = rand(100)
kernel = rand(51)
padding_count = Int((length(kernel) - 1) / 2)
``````

But both give me an output of 200 elements. Do I then just have to manually remove the zeros at the end of the result? Is there a better way of doing this?

1 Like

It seems convolution is implemented using `fft()`.

I donât know why this choice was made but it certainly means that in order to get the equivalent of MATLABâs `same` operation youâd need to do some work with the indices of the result.

It is not well documented what is the size of the returned data (Or even if it is applying linear convolution of cyclic convolution using the `fft()` step) and I couldnât figure it out from the code.

Probably if you write the loop yourself using `LoopVectorization.jl` youâll get better performance and be able to customize the indices of the loop to match MATLABâs `full`, `same` or `valid` modes.
An idea of how to apply convolution using `LoopVectorization.jl` is given at the Image Filtering example.

1 Like

The reason fft is used is because it is a much faster way of doing large convolutions (`n log(n)` vs `n^2`).

Well, it is only faster for really large cases and where the kernel and the signal have similar size.
In most cases this is not the case. This is the reason, for instance, that MATLAB chose direct implementation for `conv()` while for cross correlation, which usually deals with signals with similar length, it uses `fft()` based method.
Iâm pretty sure that implementing the vanilla loop with vectorization would beat it easily for most real life cases.

Also, The documentation isnât good in my opinion.
When it says it uses `fft()`, does it expect the user to pad arrays in order to get the linear convolution result?

1 Like

Itâs probably easiest to use the `ImageFiltering` package. Despite its name, itâs useful for all sorts of arrays, not just images.

``````using ImageFiltering

signal = rand(100)
kernel = centered(rand(51))
imfilter(signal,kernel)
``````

EDIT: corrected version which matches the bahavior of MATLAB `conv(..., 'same')` below

1 Like

I think `imfilter()` applies correlation and not convolution. Moreover, I am not sure the `centered()` function wonât yield the result of the `same` in MATLAB. If I get documentation it only shifts the kernel axis but it shouldnât change the output result.

1 Like

I think this is one of those things thatâs pretty field-specific. For instance Iâm generally not convolving with filters that are small enough for the direct implementation to be efficient. As you mention, for similarly-sized inputs the FFT approach is generally better, and when they are very differently-sized you want to use overlap-save convolution, which uses FFTs in a block-wise fashion. The `conv` implementation currently selects between those two algorithms depending on the size of the input, but it would be good for it to also select the direct method when one of the inputs is small enough.

Basically itâs a matter of perspective - some folks say âFFT convolution is only useful for really big inputsâ and other say âdirect convolution is only useful for really small inputsâ. Seems like `conv` could be smart enough to always do the right thing, but someone needs to add that check.

I agree the docs could be improved. I usually expect the convolution of a length-`N` array and a length-`M` array to be length `N+M-1`, but that should be documented.

Though it uses the FFT algorithm, DSP.jlâs `conv` implementation takes care of zero-padding internally, so it does linear convolution, not circular convolution.

To answer the original question, it seems that what you want depends on the characteristics of your filter `b`. First off, Iâm assuming that the idea is for the output to âline upâ with the input. If the filter peaks on the first sample (like `exp.(0:-0.5:-10)`), then you probably want something like `conv(a, b)[1:length(a)]`. But if you have some filter that includes a delay, like `gaussian(21, 0.15)`, then you might want `conv(a,b)[11:end-10]`.

2 Likes

Do you have speed comparison?
Iâd bet that direct method with proper vectorization and efficient code will beat overlap and save.

Iâm not sure how up to date this package is: GitHub - aamini/FastConv.jl: Fast Convolutions in Julia but as you can see it was faster. Also its implementation isnât as optimal as it should be.

Iâd be happy to see numbers.
In the past, using MATLAB, my testing showed you need very long signals and kernels to justify using methods using DFT (Overlap and Save, etcâŚ).

Ah okay, I was indexing like you mentioned in your post and was wondering if there was a direct method. I needed this for a MATLAB based tutorial I was doing in Julia, and I was wondering since MATLAB seemed to find the âsameâ option important I was assuming it had broader application.

Thanks!

Right, so the kernel should be reversed to do convolution. My example also had the different border padding than MATLABâs, so the correct version to get the behavior of MATLABâs `conv(...,'same')` is

``````signal = 1:100
kernel = 1:3
imfilter( signal, reflect(centered(kernel)), Fill(0) )
``````

The output result has the same dimensions as `signal`, regardless of the use of `centered` (you can get a smaller result by passing the padding `Inner()`, which is like MATLABâs `'valid'`). Centering the kernel means you would get the central part of the full correlation.

1 Like

I did a comparison between 3 methods of convolution:

1. Direct (Using MATLABâs `conv()`).
2. Using Overlap and Save (Implemented in MATLAB, no allocation in the loop).
3. In frequency domain (Using `fft()`).

This is the result I got:

We have signal length and kernel length. I only did the case the signal isnât shorter than the kernel.
So the upper triangle of the matrix, in Dark Blue isnât relevant.

In the diagonal and lower triangle we have which method was faster:

1. Light Blue - Direct.
2. Green - Overlap and Save.
3. Yellow - Frequency Domain.

I am not sure what cases you use, but as you can see, there is almost no reason to use Overlap & Save.
In my opinion, as long you have the whole data in memory use either direct (Cases the kernel is shorter than ~400-500 samples in the implementation above) or Frequency Domain.

Iâd even mention that MATLABâs `conv()` is probably much slower than what you can do with `LoopVectorization.jl` in Julia.

This was an answer to Convolution Strategy / Method for the Fastest 1D Convolution.

@ElectronicTeaCup, In the code I posted in the answer to the question you will be able to see which exact samples you need to extract in order to replicate MATLABâs `same`.
Though the optimal thing to do will be implementing direct convolution using `LoopVecotrization.jl` which on the computation only compute the required samples (`full`, `same` or `valid`). The current approach in `DSP.jl` is sub optimal in my opinion.

5 Likes

Overlap-save convolution just turns a convolution with un-equal lengths into a bunch of convolutions with equal-lengths (for each block). So if the blocks are large enough (larger than about 600 samples from your plot) than you want to use FFT processing for each block, not use direct convolution.

But I agree that for short convolutions (seemingly less than about 400 samples from your plot) `conv` should be doing direct convolution. Thereâs actually already some code for direct covolution in DSP.jl, so it would be a pretty good intro PR for someone to include it within `conv`.

As I mentioned before, I think we are using different definitions of âshortâ and âlongâ. Your plot only goes up to N=2000, whereas Iâm frequently convolving a signal of millions of samples with a signal with some hundreds of thousands, which is a common use-case for overlap-save convolution.

@galenlynch did some similar analysis with kernels from 1 to 2^20 that compared FFT to overlap-save, both in CPU time and memory usage:

edit: that plot has different dimensionalities as well, so for this conversation you want the `N=1` (one-dimensional) subplot

There is a closed form equation to the optimal DFT Window for a given length of kernel.
The code I used to create the above image uses the optimal DFT Window.
Still there are no cases Overlap and Save is superior to Direct or Frequency Domain.

Overlap and Save and Overlap and Add were conceived with a constant stream of data in mind.
For data which is loaded to memory either use direct or frequency. I think the cases one would gain from Overlap and Save are edge cases (Probably more related to correlation [Template Matching] than convolution [Filtration]).

In cases of very very long kernel and still order(s) of magnitude longer signal we might see Overlap and Save dominate Frequency Based (Direct method has nothing to sell once the Kernel is bigger than the CPU cache). But I think those cases are more related to template matching (Correlation). Though they are still be implemented using same ideas (As it is flipped kernel) they should have their own dedicated function (`xcorr()` or something).

I think the point where âfastâ (fft) convolution techniques are faster than direct convolution will be with a much smaller kernel size than 400 with well-optimized code. Many sources Iâve seen suggest that the break-even point is a kernel size around 128, while some simple testing with 1D arrays in DSP.jlâs `conv` (fft) and `filt` (direct) functions suggests itâs 58 with our convolution code. As you can see from the plots that Russel linked to, there are many cases where overlap-save is faster than simple fft convolution: basically most cases where the two input vectors are not the same size. I canât speak to why you have different results in your Matlab code. I would be curious to see how DSP.jlâs conv times compare to your Matlab times, if you would be willing to look at that.

As I think Russel pointed out, we already have very fast (and vectorized) direct convolution implemented in DSP.jl as part of the `filt` function for one dimension, but unfortunately I think it may have slightly different behavior than `conv` which prevents us from using that code for `conv` as well (at least that is my recollection). It has been a long-term goal of mine to combine FIR filtering code with convolution code, to allow direct convolution to be used more generally. One difficulty is that I have found it difficult to write fast direct convolution code that works with arbitrary array indexing and an arbitrary number of dimensions, which is currently what `conv` supports. It might be an easy PR to calculate the break-even point and use direct convolution for small kernels when using just one dimension and 1-indexed arrays.

Ultimately, the choice to use fft convolution for `conv` was made a long time ago in Julia base. Even though that `conv` code has since been moved into DSP.jl and modified to use overlap-save when it is faster, we have not yet added a direct convolution branch when that would be faster.

In terms of wishing for better documentation, youâre right that it could be better documented. PRs are, of course, welcome.

1 Like

Well, talking to @ssfrr, I get that Fourier Domain Convolution is using the next power of 2. This is the reason you get the result you get.

The image I posted shows you clearly that Direct Method easily beat Overlap and Save for kernels up to 400 samples. I bet with highly tuned code (Letâs say like in Intel IPP) it will be even higher.

Have a look at my code, implement the Frequency Domain the way I implemented it (Pad with zeros to `M + N - 1` and not `nextpow2(M + N - 1)`) and youâll see OVS doesnât shine as it seems.

The cases OVS is useful wii be the rare cases where the Kernel is huge and the data is still order of magnitude bigger.
As I wrote above, this cases mostly happen for Template Matching (Correlation).
This is the reason MATLAB implements `conv()` in the direct method and for those rare cases it has `xcorr()` which is done using the Frequency Domain.

Conversation continued at DSP.jl issue #355.

2 Likes

OK, this thread died a long time ago but Iâve spent some time trying to figure out what the optimal 2D convolution is and Iâve settled in ImageFilter.jl for my use case (5x5 Wallis Filter applied to a 7000x7000 image). Here is the timing:

naive loop
28.633 s (565838525 allocations: 32.49 GiB)
10.809 s (701927312 allocations: 34.52 GiB)
8.786 s (50063722 allocations: 12.10 GiB)
** 627.990 ms (1890 allocations: 2.56 GiB)**
803.172 ms (15 allocations: 2.19 GiB)
1.266 s (196 allocations: 2.56 GiB)

1 Like

And here is the code in case anyone finds it useful or has comments:

using Statistics
using ImageFiltering
using DSP
using BenchmarkTools
using LoopVectorization, OffsetArrays, Images

function filter_test()
w = 5;
x = 7000;
y = x;
I = rand(x,y)
F = zeros(x,y)

test lesser types

#I = convert(Matrix{Float32}, I)
#F = convert(Matrix{Float32}, F)

define wallis filter

function wallis(chip)
chip_mean = mean(chip)
kernel_sum = length(chip)
chip_std = sqrt(sum((chip .- chip_mean).^2)/kernel_sum)
return out = (chip[ceil(Int,kernel_sum/2)] - chip_mean) / chip_std
end

ww = floor(Int,w/2)

now using 1 core [time = 27s for x = 7000]

println(ânaive loopâ)
@btime for i = (ww+1):1:(x-ww-1), j = (ww+1):1:(x-ww-1)
@inbounds F[i,j] = wallis(I[(i-ww):(i+ww), (j-ww):(j+ww)])
end

now using mutiplt threads (if Julia setup to use multiple threads) [time = 11s for x = 7000]

for j = (ww+1):1:(x-ww-1)
@inbounds F[i,j] = wallis(I[(i-ww):(i+ww), (j-ww):(j+ww)])
end
end

now using mapwindow and mutiplt threads (if Julia setup to use multiple threads) [time = 9s for x = 7000]

@btime mapwindow(wallis, I, (w,w));

now try using imfilter [time = ~0.75s for x = 7000]

function wallis_fast(I,w)
I = I - imfilter(I, ones(w,w)/w^2);
I = I ./ sqrt.(imfilter(I.^2, ones(w,w)/w^2))
end
@btime wallis_fast(I,w)

now try using loop LoopVectorization

function conv2(out::AbstractMatrix, A::AbstractMatrix, kern)
@turbo for J in CartesianIndices(out)
tmp = zero(eltype(out))
for I â CartesianIndices(kern)
@inbounds tmp += A[I + J] * kern[I]
end
out[J] = tmp
end
out
end

function wallis_fast2(I,w)
kern = Images.Kernel.box((w,w))
out = OffsetArray(similar(I, size(I) .- size(kern) .+ 1), -1 .- kern.offsets);

``````I[3:end-2,3:end-2] = I[3:end-2,3:end-2] - OffsetArrays.no_offset_view(conv2(out,I, kern));
I[3:end-2, 3:end-2] = I[3:end-2, 3:end-2] ./ sqrt.(OffsetArrays.no_offset_view(conv2(out,I.^2, kern)))
``````

end

@btime wallis_fast2(I,w)

now try using DSP.conv

function wallis_fast3(I,w)
I = I - DSP.conv(ones(w,w)/w^2, I)[3:end-2, 3:end-2];
I = I ./ sqrt.(DSP.conv(I.^2, ones(w,w)/w^2)[3:end-2, 3:end-2])
end
To use threading with LoopVectorization you need to use the `@tturbo` macro (âthreaded turboâ). So I assume you could squeeze out even a little more performance out of it