# Optimizing Direct 1D Convolution Code

I am trying to create a function for a direct (Not `fft()` based) 1D convolution.

Originally I used the classic implementation:

``````using BenchmarkTools;

function _Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

lenA = length(vA);
lenB = length(vB);

fill!(vO, zero(T));
for idxB in 1:lenB
@simd for idxA in 1:lenA
@inbounds vO[idxA + idxB - 1] += vA[idxA] * vB[idxB];
end
end

return vO;

end

numSamplesA = 1000;
numSamplesB = 15;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

@benchmark _Conv1D!(\$vO, \$vA, \$vB)
``````

The code is pretty fast yet MATLAB’s `conv()` is faster.
It is crucial for me to get at least closer to MATLAB’s.

I read @Elrod 's blog post - Orthogonalize Indices.
I tried to generate a code where the index of the output array is linear:

``````using LinearAlgebra;

function Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

lenA = length(vA);
lenB = length(vB);
lenO = length(vO);
vC   = view(vB, lenB:-1:1);
@simd for ii in 1:lenO
# Rolling vB over vA
startIdxA = max(1, ii - lenB + 1);
endIdxA   = min(lenA, ii);
startIdxC = max(lenB - ii + 1, 1);
endIdxC   = min(lenB, lenO - ii + 1);
@inbounds vO[ii] = dot(view(vA, startIdxA:endIdxA), view(vC, startIdxC:endIdxC));
end

return vO;

end
``````

Yet it made things much slower (~40 times slower).

I was wondering, is there a way to have a faster code?
Make it more SIMD friendly?

1 Like

I think that the loop order plays a big role here. Also, I think your second `Conv1D!` (outer loop over `vO`) cannot SIMD due to the inner loop used by `dot`.

I wrote two implementations using different loop orders. Apart from the SIMD issue, I don’t see how mine are fundamentally different than yours. I believe your first version is more SIMD-friendly since it can accumulate to multiple outputs at once (and I get much better performance for my analog).

``````function conv1y!(z::AbstractVector{T},x::AbstractVector,y::AbstractVector) where T
fill!(z,zero(T))
@inbounds for iy in eachindex(y)
vy = y[iy]
@simd for ix in intersect(eachindex(x), eachindex(z) .- (iy - 1))
z[iy-1+ix] += x[ix] * vy
end
end
return z
end

function conv1z!(z::AbstractVector{T},x::AbstractVector,y::AbstractVector) where T
@inbounds for iz in eachindex(z)
vz::T = zero(T)
@simd for ix in intersect(eachindex(x), (iz + 1) .- eachindex(y))
vz += x[ix] * y[iz+1-ix]
end
z[iz] = vz
end
return z
end

x = randn(1000); y = randn(15); z = similar(x,length(x)+length(y)-1);
@btime conv1y!(\$z,\$x,\$y); # 1.490 μs (0 allocations: 0 bytes)
@btime conv1y!(\$z,\$y,\$x); # 8.500 μs (0 allocations: 0 bytes)
@btime conv1z!(\$z,\$x,\$y); # 34.800 μs (0 allocations: 0 bytes)
@btime conv1z!(\$z,\$y,\$x); # 37.200 μs (0 allocations: 0 bytes)
``````

I won’t promise I got all the index ordering right, but it seemed correct for some short inputs I tried.

I named each function for the outer loop variable. The performance difference is massive. Also notice the big performance difference depending on the argument order for `conv1y!`: the inner SIMD loop should be the longer vector. The function should check this up-front and switch the inputs accordingly, but I left it here for demonstration.

A hyper-optimized implementation might do some sort of blocking for cache efficiency, but that’s another level beyond this.

EDIT: I hoisted `y[iy]` out of the inner loop in `conv1y!` and got better performance. I would have expected this to happen automatically, but it seems it did not. The implementation and timings have been updated.

2 Likes

What’s the cost of using `intersect()`? how come it is faster than the explicit execution of the indices?

I think that the loop order plays a big role here. Also, I think your second `Conv1D!` (outer loop over `vO`) cannot SIMD due to the inner loop used by `dot`.

I totally agree. Hence I was looking for assistance from those with experience and knowledge about generating the SIMD optimized order.

The main reason I chose to use `intersect` is because it seemed more clear. I don’t expect it to be faster or slower than your manual implementation, since it does roughly the same thing. Try `@edit intersect(1:10,5:15)` to look into details, if you want. I mostly used it because I’ve written multidimensional convolution before and it gets much nicer when using `CartesianIndices` and `intersect` rather than `length`, `min`, and `max` to determine valid ranges. It also generalizes more easily to arrays with nonstandard indexing.

Be warned that your `_Conv1D!` will access out-of-bounds memory (and maybe segfault) if your `vO` is too short.

If my `conv1y!` is significantly faster than your `_Cond1D!`, I couldn’t say why. Semantically, they look the same except for my extra check for the indices of `z`. When I wrote it I was thinking there was some other difference, but I can’t notice it any more so maybe there never was. If a significant difference exists, some subtle change must have led to a compiler optimization succeeding/failing. I’ll leave that for you to dig into.

From my analysis I find that your `conv1y!` is faster than `Conv1D!()` yet slower than `_Conv1D!()` which I’d like to beat.

Unless I’ve missed an algorithmic/ordering opportunity somewhere, I think that’s close to as well as you can easily do with idiomatic base Julia. The next step to speed is probably to check out the `LoopVectorization.jl` package. It’s very good at optimizing these sorts of kernels, so may be able to wring out some extra performance.

I don’t know whether MATLAB’s implementation is multi threaded. If `LoopVectorization.jl` can’t get you to parity single-threaded, then MATLAB’s probably is. `LoopVectorization.jl` also has support for multi-threading.

Keep in mind that MATLAB is calling highly optimized code and libraries under the hood, so (apart from threading differences) I would not expect another language to be able to do notably better.

How much faster is MATLAB?

Adding `@tturbo` from LoopVectorization.jl to your inner loop speeds up things by ~25%.

As a side note, DSP.jl’s `conv()` seems to be 16x slower for your example.

It’s best to add it to the outer loop.
You may have to transform your loops and apply masks to make LV happy.
See the `adjA` kernels from SimpleChains.jl as an example:

input (`K2` above) and output (`K3` above) channels are `1` in the examples here.

1 Like

Why are you interested in computing convolutions in this way? Are your arrays super small so the fft is useless? How many times do you need to do this convolutions?

If one array is much smaller than the other, the direct method (or something like winnograd, popular in machine learning) are much faster.

I think using fft is relatively rare in machine learning, where a convolutional layer convolves a large image with a small kernel, but I could be mistaken.

Given a small fixed size kernel, you could view the direct method as being O(N) in input size, vs O(N*log(N)) for the fft.

1 Like

This is what I tried to do with no success. It gave worse performance.
Probably as I did it not as it meant in your blog post.

Of course I need to convert the `dot` into a loop, but beside that, did I miss something?

In general, if you use overlap-add methods or similar, convolving with a filter of length F is O(N \log F) if you use a sequence of small FFTs of length \ge 2F-1 to perform a sequence of zero-padded circular convolutions and then add them up.

However, to minimize the overhead of using FFTs for convolutions of small sizes (\lesssim 128), I think you really want to generate and hard-code the small-size convolutions. (This is much the same as how FFTW generates hard-coded FFTs of small sizes as the base cases of the algorithm. In fact, FFTW’s generator has an undocumented feature allowing it to generate hard-coded convolutions of small sizes, from an experiment long ago. At that time, we found that it was better to switch to Karatsuba’s algorithm for sizes \lesssim 64 or so IIRC.) I’ve often thought it would be useful to have an FFTW-like library for convolutions that knows all of these tricks, but never had the time to work on it.

7 Likes

I also find that direct convolution is much less susceptible to occssionally-inconvenient computational artifacts. For example, direct convolution will always produce integer results from integer inputs, nonnegative results from nonnegative inputs, etc. Since fast convolution involves finite-precision operations in the Fourier domain, these invariants can easily be violated.

Obviously, these sorts of small errors are usually trivial to fix with post-processing. But they still lead me to favor direct convolution for small inputs or where performance is noncritical.

I haven’t reexamined in a bit, but I often find the performance of fast streaming convolution algorithms (OLS/OLA) to be underwhelming. Usually there is a narrow regime where they offer slightly superior performance to either direct or fast convolution, but too small and finicky to get excited about. I tend to only reach for them when a streaming algorithm is operationally necessary. But perhaps a better implementation (than whatever I tried years ago) would have a significant enough benefit in a wide enough regime to be generally useful.

Yes, I think if you don’t hard-code the small-size fast-convolution code, the overhead of a generic-n FFT (or rather, two of them!) will kill any advantage for small sizes. Code-generation/metaprogramming is probably essential here.

``````using LoopVectorization, StrideArraysCore
using StrideArraysCore: static, static_length, zero_offsets

function conv3!(
_c::AbstractVector{T},
_a::AbstractVector{T},
_b::AbstractVector{T},
) where {T}
c = zero_offsets(_c)
a = zero_offsets(_a)
b = zero_offsets(_b)
I = static_length(c)
K = static_length(b)
J = I - K + static(1)
J < K && return conv3!(_c, _b, _a)
@turbo for i = 0:K-2
s = zero(T)
for k = 0:K-1
ib0 = (i >= k)
oa = ib0 ? a[i-k] : zero(T)
s += b[k] * oa
end
c[i] = s
end
@turbo inline=true for i = K-1:J-1
s = zero(T)
for k = 0:K-1
s += b[k] * a[i-k]
end
c[i] = s
end
@turbo for i = J:I-1
s = zero(T)
for k = 0:K-1
ib0 = (i < J+k)
oa = ib0 ? a[i-k] : zero(T)
s += b[k] * oa
end
c[i] = s
end
end
``````

I get

``````julia> @benchmark conv3!(\$vO, \$vA, \$vB)
BenchmarkTools.Trial: 9568 samples with 188 evaluations.
Range (min … max):  546.271 ns … 595.362 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     547.266 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   547.850 ns ±   1.916 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▂▃▅▇███▆▄▂                                  ▁▂▃▃▂▂▁▁▂▁▁     ▂
▅████████████▇▆▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▄▇██████████████▇▇ █
546 ns        Histogram: log(frequency) by time        554 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _Conv1D!(\$vO, \$vA, \$vB)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max):  1.251 μs …  1.526 μs  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     1.255 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   1.256 μs ± 13.999 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

▆█
▃██▇▃▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
1.25 μs        Histogram: frequency by time        1.35 μs <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> vOc = similar(vO);

julia> _Conv1D!(vOc, vA, vB);

julia> conv3!(vO, vA, vB)

julia> vOc ≈ vO
true
``````

This is a modification of the SimpleChains.jl code.

The basic idea is hoist the checks to remove them, so that the main loop doesn’t have them.

I imagine we can probably do better. The main loop looks like

``````.LBB0_27:                               # %L708
#   Parent Loop BB0_26 Depth=1
# =>  This Inner Loop Header: Depth=2
vfmadd231pd     zmm0, zmm8, zmmword ptr [rdi + 8*rax - 448] # zmm0 = (zmm8 * mem) + zmm0
vfmadd231pd     zmm1, zmm8, zmmword ptr [rdi + 8*rax - 384] # zmm1 = (zmm8 * mem) + zmm1
vfmadd231pd     zmm2, zmm8, zmmword ptr [rdi + 8*rax - 320] # zmm2 = (zmm8 * mem) + zmm2
vfmadd231pd     zmm3, zmm8, zmmword ptr [rdi + 8*rax - 256] # zmm3 = (zmm8 * mem) + zmm3
vfmadd231pd     zmm4, zmm8, zmmword ptr [rdi + 8*rax - 192] # zmm4 = (zmm8 * mem) + zmm4
vfmadd231pd     zmm5, zmm8, zmmword ptr [rdi + 8*rax - 128] # zmm5 = (zmm8 * mem) + zmm5
vfmadd231pd     zmm6, zmm8, zmmword ptr [rdi + 8*rax - 64] # zmm6 = (zmm8 * mem) + zmm6
vfmadd231pd     zmm7, zmm8, zmmword ptr [rdi + 8*rax] # zmm7 = (zmm8 * mem) + zmm7
dec     rax
jne     .LBB0_27
``````

However, it’s easy to see why it’s around twice as fast as the original `_Conv1D!` implementation:

``````for idxA in 1:lenA
vO[idxA + idxB - 1] += vA[idxA] * vB[idxB];
end
``````

is going to have 2 loads per fma. Twice as many loads, twice as slow.

2 Likes

@Elrod , If I got your code properly you’re handling the edge cases specifically.
If I use regular vectors (No offsets or tricky indexing), may I skip using the `StrideArraysCore` package?

1 vs 0-based Indexing

Oh, I see.
I thought that `zero_offsets` was about removing existing offsets :-).

Yes, which makes it zero-based 1 Like

OK, so based on @Elrod case, which optimizes the head and tail (Basically the edges, the non valid part, of the convolution) I created:

``````function ___Conv1D!( vO :: Vector{T}, vA :: Vector{T}, vB :: Vector{T} ) where {T <: Real}
# Based on https://discourse.julialang.org/t/97658/15
# Doesn't require `StrideArraysCore` (Uses 1 based index vs. 0 based index)

J = length(vA);
K = length(vB); #<! Assumed to be the Kernel

# Optimized for the case the kernel is in vB (Shorter)
J < K && return ___Conv1D!(vO, vB, vA);

I = J + K - 1; #<! Output length

@turbo for ii in 0:(K - 1) #<! Head
sumVal = zero(T);
for kk in 1:K
ib0 = (ii >= kk);
oa = ib0 ? vA[ii - kk + 1] : zero(T);
sumVal += vB[kk] * oa;
end
vO[ii] = sumVal;
end
@turbo inline=true for ii in K:(J - 1) #<! Middle
sumVal = zero(T);
for kk in 1:K
sumVal += vB[kk] * vA[ii - kk + 1];
end
vO[ii] = sumVal;
end
@turbo for ii in J:I #<! Tail
sumVal = zero(T);
for kk in 1:K
ib0 = (ii < J + kk);
oa = ib0 ? vA[ii - kk + 1] : zero(T);
sumVal += vB[kk] * oa;
end
vO[ii] = sumVal;
end
return vO
end
``````

So far, it is the fastest :-).