# Optimizing Direct 2D Convolution Code

I have implemented 2D convolution using direct calculation:

``````using BenchmarkTools;

function _Conv2D!( mO :: Matrix{T}, mI :: Matrix{T}, mK :: Matrix{T} ) where {T <: AbstractFloat}

numRowsI, numColsI = size(mI);
numRowsK, numColsK = size(mK);

for jj ∈ 1:(numColsK - 1), ii ∈ 1:(numRowsK - 1) #<! Top Left
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj >= nn) && (ii >= mm);
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ 1:(numColsK - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Left
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj >= nn);
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ 1:(numColsK - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Left
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj >= nn) && (ii < numRowsI + mm);;
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsK:(numColsI - 1), ii ∈ 1:(numRowsK - 1) #<! Top Middle
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (ii >= mm);
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsK:(numColsI - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Middle
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
@inbounds sumVal += mK[mm, nn] * mI[ii - mm + 1, jj - nn + 1];
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsK:(numColsI - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Middle
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (ii < numRowsI + mm);;
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ 1:(numRowsK - 1) #<! Top Right
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj < numColsI + nn) && (ii >= mm);
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Right
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj < numColsI + nn);
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Right
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
ib0 = (jj < numColsI + nn) && (ii < numRowsI + mm);;
@inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
@inbounds sumVal += mK[mm, nn] * oa;
end
mO[ii, jj] = sumVal;
end

end

mA = rand(1000, 1000);
mK = rand(9, 9);
mO = zeros(size(mA) .+ size(mK) .- (1, 1));

@btime _Conv2D!(\$mO, \$mA, \$mK);
``````

I wonder how can I optimize the code farther?
On MATLAB, the direct built in implementation `conv2()` executes the above in `6 [Milli Sec]`.
The Julia implementation above takes `60 [Milli Sec]`.
I’d be happy for tips to narrow down the difference (MATLAB’s code is probably SIMD vectroized).

Moreover, by `@btime` output it seems to allocate while there is no allocation in the code. How is that?

``````julia> @benchmark _Conv2D!(\$mO, \$mA, \$mK)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 6.086 s (1.46% GC) to evaluate,
with a memory estimate of 6.07 GiB, over 407121552 allocations.
``````

I’m using `Julia 1.9.4` on Windows.

Remark: The concept of the loop is heavily influenced by a past code of @Elrod .
Remark: The allocation issue was a typo of mine as noted by @Dan and @Elrod . Fixed now.

the easy speedup is to move your boundary handling outside the main loop. also if you are convolving with a small array, you might want to use a StaticArray so the compiler knows the size.

Could you elaborate?

I think Oscar means it is better to keep the loops going over the bulk of the 2D arrays as simple as possible (and as fast as possible) without any conditionals checking for edge cases.
The edges can be handled seperately outside the ‘hot’ loops.

2 Likes

Could you give a concrete example?
I can see how to make the loop on `mm` and `nn` depends on `ii` and `jj`.

Yet I refrain from this as it was a guideline by @Elrod that doing so will hurt the vectorization.

There is a reference to `mA` in the function which is a global. It’s a typo.

1 Like
1. The compiler knows nothing about your arrays, so the loops will be slow. Use StaticArrays for an easy win.

2. Use conditionals only for the corners of each matrix you are looping over. You don’t need to check every position. Knowing their size at compile time helps calculations here too.

Stencils.jl does these things, and your code would be something like:

``````mapstensil(kernelproduct, Kernel(Window(4), mK), mO)
``````

(I’m not sure that’s exactly what your code does but it looks like it - kernelproduct just does non-recursive dot product on the kernel and each window)

Its threaded too, so on a laptop I get:

``````julia> @time mapstencil(kernelproduct, Kernel(Window(4), mK), mO)
0.034262 seconds (241 allocations: 7.765 MiB, 0.00% compilation time)
``````

and that is also allocating the new array.

It seems there is something else wrong with your code (like type instability) to get this timing, it shouldn’t be that much slower.

Just handle the edges by padding the arrays with the desired boundary conditions.

2 Likes

It allocates because `mA` is a global variable.
You should probably rename the second input argument, `mI`, to `mA`.

1 Like

I missed that! Now the allocations are solved.
Thanks for catching that.

Most of your hot loops look like this, with a branch in the central part, preventing vectorization.

Indeed.
For that I have the `valid` variant of the code:

``````function _Conv2DValid!( mO :: Matrix{T}, mI :: Matrix{T}, mK :: Matrix{T} ) where {T <: AbstractFloat}

numRowsI, numColsI = size(mI);
numRowsK, numColsK = size(mK);

for jj ∈ 1:(numColsI - numColsK + 1)
@turbo for ii in 1:(numRowsI - numRowsK + 1)
sumVal = zero(T);
for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
@inbounds sumVal += mK[mm, nn] * mI[ii - mm + numRowsK, jj - nn + numColsK];
end
mO[ii, jj] = sumVal;
end
end

end
``````

Yet I wonder if there is something to optimize for the `full` case.
Specifically beyond adding `@turbo` on the `ii` loop.

maybe using linear indices can speed this `_Conv2DValid!` up. that is, if the input are vectors

Its not always clear that this is better after the allocation. In Stencils.jl `Conditional` (runtime checks without padding) can sometimes beat or be very close to `Halo` (physical padding)

1 Like

I’m assuming you are doing this over and over, if you care about performance. Whereas you only need to allocate padding once. Ideally you adjust your code so that the padding is built-in from the beginning.