Optimizing Direct 2D Convolution Code

I have implemented 2D convolution using direct calculation:

using BenchmarkTools;

function _Conv2D!( mO :: Matrix{T}, mI :: Matrix{T}, mK :: Matrix{T} ) where {T <: AbstractFloat}

    numRowsI, numColsI = size(mI);
    numRowsK, numColsK = size(mK);

    for jj ∈ 1:(numColsK - 1), ii ∈ 1:(numRowsK - 1) #<! Top Left
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj >= nn) && (ii >= mm);
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ 1:(numColsK - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Left
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj >= nn);
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ 1:(numColsK - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Left
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj >= nn) && (ii < numRowsI + mm);;
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsK:(numColsI - 1), ii ∈ 1:(numRowsK - 1) #<! Top Middle
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (ii >= mm);
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsK:(numColsI - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Middle
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            @inbounds sumVal += mK[mm, nn] * mI[ii - mm + 1, jj - nn + 1];
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsK:(numColsI - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Middle
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (ii < numRowsI + mm);;
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ 1:(numRowsK - 1) #<! Top Right
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj < numColsI + nn) && (ii >= mm);
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ numRowsK:(numRowsI - 1) #<! Middle Right
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj < numColsI + nn);
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

    for jj ∈ numColsI:(numColsI + numColsK - 1), ii ∈ numRowsI:(numRowsI + numRowsK - 1) #<! Bottom Right
        sumVal = zero(T);
        for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
            ib0 = (jj < numColsI + nn) && (ii < numRowsI + mm);;
            @inbounds oa = ib0 ? mI[ii - mm + 1, jj - nn + 1] : zero(T);
            @inbounds sumVal += mK[mm, nn] * oa;
        end
        mO[ii, jj] = sumVal;
    end

end

mA = rand(1000, 1000);
mK = rand(9, 9);
mO = zeros(size(mA) .+ size(mK) .- (1, 1));

@btime _Conv2D!($mO, $mA, $mK);

I wonder how can I optimize the code farther?
On MATLAB, the direct built in implementation conv2() executes the above in 6 [Milli Sec].
The Julia implementation above takes 60 [Milli Sec].
I’d be happy for tips to narrow down the difference (MATLAB’s code is probably SIMD vectroized).

Moreover, by @btime output it seems to allocate while there is no allocation in the code. How is that?

julia> @benchmark _Conv2D!($mO, $mA, $mK)
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 6.086 s (1.46% GC) to evaluate,
 with a memory estimate of 6.07 GiB, over 407121552 allocations.

I’m using Julia 1.9.4 on Windows.

Remark: The concept of the loop is heavily influenced by a past code of @Elrod .
Remark: The allocation issue was a typo of mine as noted by @Dan and @Elrod . Fixed now.

the easy speedup is to move your boundary handling outside the main loop. also if you are convolving with a small array, you might want to use a StaticArray so the compiler knows the size.

Could you elaborate?

I think Oscar means it is better to keep the loops going over the bulk of the 2D arrays as simple as possible (and as fast as possible) without any conditionals checking for edge cases.
The edges can be handled seperately outside the ‘hot’ loops.

2 Likes

Could you give a concrete example?
I can see how to make the loop on mm and nn depends on ii and jj.

Yet I refrain from this as it was a guideline by @Elrod that doing so will hurt the vectorization.

There is a reference to mA in the function which is a global. It’s a typo.

1 Like
  1. The compiler knows nothing about your arrays, so the loops will be slow. Use StaticArrays for an easy win.

  2. Use conditionals only for the corners of each matrix you are looping over. You don’t need to check every position. Knowing their size at compile time helps calculations here too.

Stencils.jl does these things, and your code would be something like:

mapstensil(kernelproduct, Kernel(Window(4), mK), mO)

(I’m not sure that’s exactly what your code does but it looks like it - kernelproduct just does non-recursive dot product on the kernel and each window)

Its threaded too, so on a laptop I get:

julia> @time mapstencil(kernelproduct, Kernel(Window(4), mK), mO)
  0.034262 seconds (241 allocations: 7.765 MiB, 0.00% compilation time)

and that is also allocating the new array.

It seems there is something else wrong with your code (like type instability) to get this timing, it shouldn’t be that much slower.

Just handle the edges by padding the arrays with the desired boundary conditions.

2 Likes

It allocates because mA is a global variable.
You should probably rename the second input argument, mI, to mA.

1 Like

I missed that! Now the allocations are solved.
Thanks for catching that.

Most of your hot loops look like this, with a branch in the central part, preventing vectorization.

Indeed.
For that I have the valid variant of the code:

function _Conv2DValid!( mO :: Matrix{T}, mI :: Matrix{T}, mK :: Matrix{T} ) where {T <: AbstractFloat}

    numRowsI, numColsI = size(mI);
    numRowsK, numColsK = size(mK);

    for jj ∈ 1:(numColsI - numColsK + 1)
        @turbo for ii in 1:(numRowsI - numRowsK + 1)
            sumVal = zero(T);
            for nn ∈ 1:numColsK, mm ∈ 1:numRowsK
                @inbounds sumVal += mK[mm, nn] * mI[ii - mm + numRowsK, jj - nn + numColsK];
            end
            mO[ii, jj] = sumVal;
        end
    end

end

Yet I wonder if there is something to optimize for the full case.
Specifically beyond adding @turbo on the ii loop.

maybe using linear indices can speed this _Conv2DValid! up. that is, if the input are vectors

Its not always clear that this is better after the allocation. In Stencils.jl Conditional (runtime checks without padding) can sometimes beat or be very close to Halo (physical padding)

1 Like

I’m assuming you are doing this over and over, if you care about performance. Whereas you only need to allocate padding once. Ideally you adjust your code so that the padding is built-in from the beginning.