Optimizing Direct 1D Convolution Code

I am trying to create a function for a direct (Not fft() based) 1D convolution.

Originally I used the classic implementation:

using BenchmarkTools;

function _Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);

    fill!(vO, zero(T));
    for idxB in 1:lenB
        @simd for idxA in 1:lenA
            @inbounds vO[idxA + idxB - 1] += vA[idxA] * vB[idxB];
        end
    end

    return vO;

end

numSamplesA = 1000;
numSamplesB = 15;

vA = rand(numSamplesA);
vB = rand(numSamplesB);
vO = rand(numSamplesA + numSamplesB - 1);

@benchmark _Conv1D!($vO, $vA, $vB)

The code is pretty fast yet MATLAB’s conv() is faster.
It is crucial for me to get at least closer to MATLAB’s.

I read @Elrod 's blog post - Orthogonalize Indices.
I tried to generate a code where the index of the output array is linear:

using LinearAlgebra;

function Conv1D!( vO :: Array{T, 1}, vA :: Array{T, 1}, vB :: Array{T, 1} ) :: Array{T, 1} where {T <: Real}

    lenA = length(vA);
    lenB = length(vB);
    lenO = length(vO);
    vC   = view(vB, lenB:-1:1);
    @simd for ii in 1:lenO
        # Rolling vB over vA
        startIdxA = max(1, ii - lenB + 1);
        endIdxA   = min(lenA, ii);
        startIdxC = max(lenB - ii + 1, 1);
        endIdxC   = min(lenB, lenO - ii + 1);
        @inbounds vO[ii] = dot(view(vA, startIdxA:endIdxA), view(vC, startIdxC:endIdxC));
    end

    return vO;

end

Yet it made things much slower (~40 times slower).

I was wondering, is there a way to have a faster code?
Make it more SIMD friendly?

1 Like

I think that the loop order plays a big role here. Also, I think your second Conv1D! (outer loop over vO) cannot SIMD due to the inner loop used by dot.

I wrote two implementations using different loop orders. Apart from the SIMD issue, I don’t see how mine are fundamentally different than yours. I believe your first version is more SIMD-friendly since it can accumulate to multiple outputs at once (and I get much better performance for my analog).

function conv1y!(z::AbstractVector{T},x::AbstractVector,y::AbstractVector) where T
	fill!(z,zero(T))
	@inbounds for iy in eachindex(y)
		vy = y[iy]
		@simd for ix in intersect(eachindex(x), eachindex(z) .- (iy - 1))
			z[iy-1+ix] += x[ix] * vy
		end
	end
	return z
end

function conv1z!(z::AbstractVector{T},x::AbstractVector,y::AbstractVector) where T
	@inbounds for iz in eachindex(z)
		vz::T = zero(T)
		@simd for ix in intersect(eachindex(x), (iz + 1) .- eachindex(y))
			vz += x[ix] * y[iz+1-ix]
		end
		z[iz] = vz
	end
	return z
end

x = randn(1000); y = randn(15); z = similar(x,length(x)+length(y)-1);
@btime conv1y!($z,$x,$y); # 1.490 ΞΌs (0 allocations: 0 bytes)
@btime conv1y!($z,$y,$x); # 8.500 ΞΌs (0 allocations: 0 bytes)
@btime conv1z!($z,$x,$y); # 34.800 ΞΌs (0 allocations: 0 bytes)
@btime conv1z!($z,$y,$x); # 37.200 ΞΌs (0 allocations: 0 bytes)

I won’t promise I got all the index ordering right, but it seemed correct for some short inputs I tried.

I named each function for the outer loop variable. The performance difference is massive. Also notice the big performance difference depending on the argument order for conv1y!: the inner SIMD loop should be the longer vector. The function should check this up-front and switch the inputs accordingly, but I left it here for demonstration.

A hyper-optimized implementation might do some sort of blocking for cache efficiency, but that’s another level beyond this.

EDIT: I hoisted y[iy] out of the inner loop in conv1y! and got better performance. I would have expected this to happen automatically, but it seems it did not. The implementation and timings have been updated.

2 Likes

What’s the cost of using intersect()? how come it is faster than the explicit execution of the indices?

I think that the loop order plays a big role here. Also, I think your second Conv1D! (outer loop over vO) cannot SIMD due to the inner loop used by dot.

I totally agree. Hence I was looking for assistance from those with experience and knowledge about generating the SIMD optimized order.

The main reason I chose to use intersect is because it seemed more clear. I don’t expect it to be faster or slower than your manual implementation, since it does roughly the same thing. Try @edit intersect(1:10,5:15) to look into details, if you want. I mostly used it because I’ve written multidimensional convolution before and it gets much nicer when using CartesianIndices and intersect rather than length, min, and max to determine valid ranges. It also generalizes more easily to arrays with nonstandard indexing.

Be warned that your _Conv1D! will access out-of-bounds memory (and maybe segfault) if your vO is too short.

If my conv1y! is significantly faster than your _Cond1D!, I couldn’t say why. Semantically, they look the same except for my extra check for the indices of z. When I wrote it I was thinking there was some other difference, but I can’t notice it any more so maybe there never was. If a significant difference exists, some subtle change must have led to a compiler optimization succeeding/failing. I’ll leave that for you to dig into.

From my analysis I find that your conv1y! is faster than Conv1D!() yet slower than _Conv1D!() which I’d like to beat.

Unless I’ve missed an algorithmic/ordering opportunity somewhere, I think that’s close to as well as you can easily do with idiomatic base Julia. The next step to speed is probably to check out the LoopVectorization.jl package. It’s very good at optimizing these sorts of kernels, so may be able to wring out some extra performance.

I don’t know whether MATLAB’s implementation is multi threaded. If LoopVectorization.jl can’t get you to parity single-threaded, then MATLAB’s probably is. LoopVectorization.jl also has support for multi-threading.

Keep in mind that MATLAB is calling highly optimized code and libraries under the hood, so (apart from threading differences) I would not expect another language to be able to do notably better.

How much faster is MATLAB?

Adding @tturbo from LoopVectorization.jl to your inner loop speeds up things by ~25%.

As a side note, DSP.jl’s conv() seems to be 16x slower for your example.

It’s best to add it to the outer loop.
You may have to transform your loops and apply masks to make LV happy.
See the adjA kernels from SimpleChains.jl as an example:

input (K2 above) and output (K3 above) channels are 1 in the examples here.

1 Like

Why are you interested in computing convolutions in this way? Are your arrays super small so the fft is useless? How many times do you need to do this convolutions?

If one array is much smaller than the other, the direct method (or something like winnograd, popular in machine learning) are much faster.

I think using fft is relatively rare in machine learning, where a convolutional layer convolves a large image with a small kernel, but I could be mistaken.

Given a small fixed size kernel, you could view the direct method as being O(N) in input size, vs O(N*log(N)) for the fft.

1 Like

This is what I tried to do with no success. It gave worse performance.
Probably as I did it not as it meant in your blog post.

Of course I need to convert the dot into a loop, but beside that, did I miss something?

In general, if you use overlap-add methods or similar, convolving with a filter of length F is O(N \log F) if you use a sequence of small FFTs of length \ge 2F-1 to perform a sequence of zero-padded circular convolutions and then add them up.

However, to minimize the overhead of using FFTs for convolutions of small sizes (\lesssim 128), I think you really want to generate and hard-code the small-size convolutions. (This is much the same as how FFTW generates hard-coded FFTs of small sizes as the base cases of the algorithm. In fact, FFTW’s generator has an undocumented feature allowing it to generate hard-coded convolutions of small sizes, from an experiment long ago. At that time, we found that it was better to switch to Karatsuba’s algorithm for sizes \lesssim 64 or so IIRC.) I’ve often thought it would be useful to have an FFTW-like library for convolutions that knows all of these tricks, but never had the time to work on it.

8 Likes

I also find that direct convolution is much less susceptible to occssionally-inconvenient computational artifacts. For example, direct convolution will always produce integer results from integer inputs, nonnegative results from nonnegative inputs, etc. Since fast convolution involves finite-precision operations in the Fourier domain, these invariants can easily be violated.

Obviously, these sorts of small errors are usually trivial to fix with post-processing. But they still lead me to favor direct convolution for small inputs or where performance is noncritical.

I haven’t reexamined in a bit, but I often find the performance of fast streaming convolution algorithms (OLS/OLA) to be underwhelming. Usually there is a narrow regime where they offer slightly superior performance to either direct or fast convolution, but too small and finicky to get excited about. I tend to only reach for them when a streaming algorithm is operationally necessary. But perhaps a better implementation (than whatever I tried years ago) would have a significant enough benefit in a wide enough regime to be generally useful.

Yes, I think if you don’t hard-code the small-size fast-convolution code, the overhead of a generic-n FFT (or rather, two of them!) will kill any advantage for small sizes. Code-generation/metaprogramming is probably essential here.

using LoopVectorization, StrideArraysCore
using StrideArraysCore: static, static_length, zero_offsets

function conv3!(
  _c::AbstractVector{T},
  _a::AbstractVector{T},
  _b::AbstractVector{T},
) where {T}
  c = zero_offsets(_c)
  a = zero_offsets(_a)
  b = zero_offsets(_b)
  I = static_length(c)
  K = static_length(b)
  J = I - K + static(1)
  J < K && return conv3!(_c, _b, _a)
  @turbo for i = 0:K-2
    s = zero(T)
    for k = 0:K-1
      ib0 = (i >= k)
      oa = ib0 ? a[i-k] : zero(T)
      s += b[k] * oa
    end
    c[i] = s
  end
  @turbo inline=true for i = K-1:J-1
    s = zero(T)
    for k = 0:K-1
      s += b[k] * a[i-k]
    end
    c[i] = s
  end
  @turbo for i = J:I-1
    s = zero(T)
    for k = 0:K-1
      ib0 = (i < J+k)
      oa = ib0 ? a[i-k] : zero(T)
      s += b[k] * oa
    end
    c[i] = s
  end
end

I get

julia> @benchmark conv3!($vO, $vA, $vB)
BenchmarkTools.Trial: 9568 samples with 188 evaluations.
 Range (min … max):  546.271 ns … 595.362 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     547.266 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   547.850 ns Β±   1.916 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–‚β–ƒβ–…β–‡β–ˆβ–ˆβ–ˆβ–†β–„β–‚                                  ▁▂▃▃▂▂▁▁▂▁▁     β–‚
  β–…β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒβ–ƒβ–β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡ β–ˆ
  546 ns        Histogram: log(frequency) by time        554 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _Conv1D!($vO, $vA, $vB)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.251 ΞΌs …  1.526 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.255 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.256 ΞΌs Β± 13.999 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–†β–ˆ                                                         
  β–ƒβ–ˆβ–ˆβ–‡β–ƒβ–‚β–β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚ β–‚
  1.25 ΞΌs        Histogram: frequency by time        1.35 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> vOc = similar(vO);

julia> _Conv1D!(vOc, vA, vB);

julia> conv3!(vO, vA, vB)

julia> vOc β‰ˆ vO
true

This is a modification of the SimpleChains.jl code.

The basic idea is hoist the checks to remove them, so that the main loop doesn’t have them.

I imagine we can probably do better. The main loop looks like

.LBB0_27:                               # %L708
                                        #   Parent Loop BB0_26 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vbroadcastsd    zmm8, qword ptr [rdx]
        vfmadd231pd     zmm0, zmm8, zmmword ptr [rdi + 8*rax - 448] # zmm0 = (zmm8 * mem) + zmm0
        vfmadd231pd     zmm1, zmm8, zmmword ptr [rdi + 8*rax - 384] # zmm1 = (zmm8 * mem) + zmm1
        vfmadd231pd     zmm2, zmm8, zmmword ptr [rdi + 8*rax - 320] # zmm2 = (zmm8 * mem) + zmm2
        vfmadd231pd     zmm3, zmm8, zmmword ptr [rdi + 8*rax - 256] # zmm3 = (zmm8 * mem) + zmm3
        vfmadd231pd     zmm4, zmm8, zmmword ptr [rdi + 8*rax - 192] # zmm4 = (zmm8 * mem) + zmm4
        vfmadd231pd     zmm5, zmm8, zmmword ptr [rdi + 8*rax - 128] # zmm5 = (zmm8 * mem) + zmm5
        vfmadd231pd     zmm6, zmm8, zmmword ptr [rdi + 8*rax - 64] # zmm6 = (zmm8 * mem) + zmm6
        vfmadd231pd     zmm7, zmm8, zmmword ptr [rdi + 8*rax] # zmm7 = (zmm8 * mem) + zmm7
        add     rdx, 8
        dec     rax
        jne     .LBB0_27

So we have 1 broadcast and loads of loads+fmas.
However, it’s easy to see why it’s around twice as fast as the original _Conv1D! implementation:

for idxA in 1:lenA
    vO[idxA + idxB - 1] += vA[idxA] * vB[idxB];
end

is going to have 2 loads per fma. Twice as many loads, twice as slow.

2 Likes

@Elrod , If I got your code properly you’re handling the edge cases specifically.
If I use regular vectors (No offsets or tricky indexing), may I skip using the StrideArraysCore package?

1 vs 0-based Indexing

Oh, I see.
I thought that zero_offsets was about removing existing offsets :-).

Yes, which makes it zero-based :wink:

1 Like

OK, so based on @Elrod case, which optimizes the head and tail (Basically the edges, the non valid part, of the convolution) I created:

function ___Conv1D!( vO :: Vector{T}, vA :: Vector{T}, vB :: Vector{T} ) where {T <: Real}
    # Based on https://discourse.julialang.org/t/97658/15
    # Doesn't require `StrideArraysCore` (Uses 1 based index vs. 0 based index)

    J = length(vA);
    K = length(vB); #<! Assumed to be the Kernel
    
    # Optimized for the case the kernel is in vB (Shorter)
    J < K && return ___Conv1D!(vO, vB, vA);
    
    I = J + K - 1; #<! Output length
	
    @turbo for ii in 0:(K - 1) #<! Head
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii >= kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
    @turbo inline=true for ii in K:(J - 1) #<! Middle
        sumVal = zero(T);
        for kk in 1:K
            sumVal += vB[kk] * vA[ii - kk + 1];
        end
        vO[ii] = sumVal;
    end
    @turbo for ii in J:I #<! Tail
        sumVal = zero(T);
        for kk in 1:K
            ib0 = (ii < J + kk);
            oa = ib0 ? vA[ii - kk + 1] : zero(T);
            sumVal += vB[kk] * oa;
        end
        vO[ii] = sumVal;
    end
	return vO
end

So far, it is the fastest :-).