PaddedViews very slow

weymouth · August 25, 2021, 1:07pm

I’m sure to be doing something silly, but I can’t seem to get PaddedViews.jl up to speed. Consider the LoopVectorization.jl image filtering example

using LoopVectorization, OffsetArrays, Images, PaddedViews
kern = Images.Kernel.gaussian((1, 1), (3, 3))
A = rand(130,130);
Apadded = PaddedView(0.,rand(128,128),(130,130),(2,2))
out = OffsetArray(similar(A, size(A) .- size(kern) .+ 1), -1 .- kern.offsets);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
    @turbo for J in CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kern)
            tmp += A[I + J] * kern[I]
        end
        out[J] = tmp
    end
    out
end

Filtering A takes 18.501 μs (using @btime), but filtering Apadded takes 312.320 μs!

I tried manually unrolling the inner loop and using @inbounds @simd instead of @turbo but the timing was still 19.550 μs vs 111.862 μs. Am I making an error here? I thought this would be the main use cases for PaddedView.

(Side question, how do you automate unrolling inner loops for @simd? Doing it manually wasn’t fun.)

mcabbott · August 25, 2021, 1:44pm

Here’s one way:

julia> using Tullio, BenchmarkTools

julia> out1 = copy(@btime filter2davx!($out, $A, $kern));
  17.833 μs (0 allocations: 0 bytes)

julia> out2 = copy(@btime @tullio $out[x,y] = $A[x+i, y+j] * $kern[i,j]);
  18.458 μs (0 allocations: 0 bytes)

julia> out1 ≈ out2
true

julia> out3 = copy(@btime filter2davx!($out, $Apadded, $kern));
  267.291 μs (0 allocations: 0 bytes)

julia> Adata = Apadded.data;

julia> out4 = copy(@btime @tullio $out[x,y] = $Adata[pad(x+i,2), pad(y+j,2)] * $kern[i,j]);
  26.791 μs (50 allocations: 2.94 KiB)

julia> out3 ≈ out4
true

julia> @less getindex(Apadded, 1, 1)

I’m not sure this is ideal. It moves the branch on whether the index is in-range from a function out to code that @turbo can see; probably you could do that by hand too. The indexing is then of Adata which is an ordinary dense array. The other approach would be to break the iteration space explicitly into bulk/boundary, which might be faster.

weymouth · August 25, 2021, 2:08pm

That avoids PaddedView’s internals completely, so there would not be much of a reason to use it.

Also, does @tullio work with CartesianIndices? My code is multidimensional, so I would prefer to use those.

mcabbott · August 25, 2021, 2:41pm

It does not. But you may take this as a proof of concept for one approach. It’s not without overhead – the padded case above is using 4 threads. I believe LV is turning if-else statements on indices into masked loading of the underlying array.

Tullio doesn’t know about types, so it can’t see the PaddedView directly. But LV does, so could in principle make this translation automatically. Still skipping the internals in favour of a re-implementation, though. I think that, right now, it notices that the type is not something it understands (via LoopVectorization.check_args(Apadded)) and so falls back to simple loops.

weymouth · August 25, 2021, 2:50pm

Thanks! Certainly worth looking into!

stillyslalom · August 25, 2021, 2:55pm

I suspect that paying the upfront cost of allocating extra memory for non-virtual halos would be worth it for the performance gain of avoiding branches in the inner loop, but Julia doesn’t have a widely-adopted package for efficiently handling halos. There’s padarray from ImageFiltering.jl, but that lacks the in-place version that’s needed to avoid allocations and doesn’t support several useful boundary conditions. Oceananigans uses halos for some BCs, but it’s meant for their own data structures and doesn’t seem to be intended for broader use.

mcabbott · August 25, 2021, 2:58pm

Good point. No idea if this is optimised at all, but an upper bound of the cost is:

julia> @btime Matrix($A);
  4.351 μs (2 allocations: 132.11 KiB)

weymouth · August 25, 2021, 3:07pm

Yup. This is what I’m doing now. I was trying to push out some of the Linear Algebra stuff for broader use, but it seems like allocating the bigger array and defining a “bulk” region to iterate inside is the fastest choice.

Topic		Replies	Views
Can't PaddedView be used to pad a grey scale or coloured image? Visualization	5	624	March 5, 2018
@simd with array views General Usage performance , simd	5	1239	January 4, 2019
[ANN] ArrayPadding.jl - periodic, symmetric, smooth or constants padding Package Announcements package , announcement	17	1755	February 27, 2024
Experiments with LoopVectorization and convolutions Performance simd , loopvectorization , for-loop , convolution	24	836	December 3, 2024
Tullio boundary error for shifted stencil Performance tullio	11	528	July 29, 2021

PaddedViews very slow

Related topics