PaddedViews very slow

I’m sure to be doing something silly, but I can’t seem to get PaddedViews.jl up to speed. Consider the LoopVectorization.jl image filtering example

using LoopVectorization, OffsetArrays, Images, PaddedViews
kern = Images.Kernel.gaussian((1, 1), (3, 3))
A = rand(130,130);
Apadded = PaddedView(0.,rand(128,128),(130,130),(2,2))
out = OffsetArray(similar(A, size(A) .- size(kern) .+ 1), -1 .- kern.offsets);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
    @turbo for J in CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kern)
            tmp += A[I + J] * kern[I]
        end
        out[J] = tmp
    end
    out
end

Filtering A takes 18.501 μs (using @btime), but filtering Apadded takes 312.320 μs!

I tried manually unrolling the inner loop and using @inbounds @simd instead of @turbo but the timing was still 19.550 μs vs 111.862 μs. Am I making an error here? I thought this would be the main use cases for PaddedView.

(Side question, how do you automate unrolling inner loops for @simd? Doing it manually wasn’t fun.)

2 Likes

Here’s one way:

julia> using Tullio, BenchmarkTools

julia> out1 = copy(@btime filter2davx!($out, $A, $kern));
  17.833 μs (0 allocations: 0 bytes)

julia> out2 = copy(@btime @tullio $out[x,y] = $A[x+i, y+j] * $kern[i,j]);
  18.458 μs (0 allocations: 0 bytes)

julia> out1 ≈ out2
true

julia> out3 = copy(@btime filter2davx!($out, $Apadded, $kern));
  267.291 μs (0 allocations: 0 bytes)

julia> Adata = Apadded.data;

julia> out4 = copy(@btime @tullio $out[x,y] = $Adata[pad(x+i,2), pad(y+j,2)] * $kern[i,j]);
  26.791 μs (50 allocations: 2.94 KiB)

julia> out3 ≈ out4
true

julia> @less getindex(Apadded, 1, 1)

I’m not sure this is ideal. It moves the branch on whether the index is in-range from a function out to code that @turbo can see; probably you could do that by hand too. The indexing is then of Adata which is an ordinary dense array. The other approach would be to break the iteration space explicitly into bulk/boundary, which might be faster.

1 Like

That avoids PaddedView’s internals completely, so there would not be much of a reason to use it.

Also, does @tullio work with CartesianIndices? My code is multidimensional, so I would prefer to use those.

It does not. But you may take this as a proof of concept for one approach. It’s not without overhead – the padded case above is using 4 threads. I believe LV is turning if-else statements on indices into masked loading of the underlying array.

Tullio doesn’t know about types, so it can’t see the PaddedView directly. But LV does, so could in principle make this translation automatically. Still skipping the internals in favour of a re-implementation, though. I think that, right now, it notices that the type is not something it understands (via LoopVectorization.check_args(Apadded)) and so falls back to simple loops.

Thanks! Certainly worth looking into!

I suspect that paying the upfront cost of allocating extra memory for non-virtual halos would be worth it for the performance gain of avoiding branches in the inner loop, but Julia doesn’t have a widely-adopted package for efficiently handling halos. There’s padarray from ImageFiltering.jl, but that lacks the in-place version that’s needed to avoid allocations and doesn’t support several useful boundary conditions. Oceananigans uses halos for some BCs, but it’s meant for their own data structures and doesn’t seem to be intended for broader use.

2 Likes

Good point. No idea if this is optimised at all, but an upper bound of the cost is:

julia> @btime Matrix($A);
  4.351 μs (2 allocations: 132.11 KiB)
1 Like

Yup. This is what I’m doing now. I was trying to push out some of the Linear Algebra stuff for broader use, but it seems like allocating the bigger array and defining a “bulk” region to iterate inside is the fastest choice.