I’m sure to be doing something silly, but I can’t seem to get PaddedViews.jl
up to speed. Consider the LoopVectorization.jl
image filtering example
using LoopVectorization, OffsetArrays, Images, PaddedViews
kern = Images.Kernel.gaussian((1, 1), (3, 3))
A = rand(130,130);
Apadded = PaddedView(0.,rand(128,128),(130,130),(2,2))
out = OffsetArray(similar(A, size(A) .- size(kern) .+ 1), -1 .- kern.offsets);
function filter2davx!(out::AbstractMatrix, A::AbstractMatrix, kern)
@turbo for J in CartesianIndices(out)
tmp = zero(eltype(out))
for I ∈ CartesianIndices(kern)
tmp += A[I + J] * kern[I]
end
out[J] = tmp
end
out
end
Filtering A
takes 18.501 μs (using @btime
), but filtering Apadded
takes 312.320 μs!
I tried manually unrolling the inner loop and using @inbounds @simd
instead of @turbo
but the timing was still 19.550 μs vs 111.862 μs. Am I making an error here? I thought this would be the main use cases for PaddedView.
(Side question, how do you automate unrolling inner loops for @simd
? Doing it manually wasn’t fun.)
2 Likes
Here’s one way:
julia> using Tullio, BenchmarkTools
julia> out1 = copy(@btime filter2davx!($out, $A, $kern));
17.833 μs (0 allocations: 0 bytes)
julia> out2 = copy(@btime @tullio $out[x,y] = $A[x+i, y+j] * $kern[i,j]);
18.458 μs (0 allocations: 0 bytes)
julia> out1 ≈ out2
true
julia> out3 = copy(@btime filter2davx!($out, $Apadded, $kern));
267.291 μs (0 allocations: 0 bytes)
julia> Adata = Apadded.data;
julia> out4 = copy(@btime @tullio $out[x,y] = $Adata[pad(x+i,2), pad(y+j,2)] * $kern[i,j]);
26.791 μs (50 allocations: 2.94 KiB)
julia> out3 ≈ out4
true
julia> @less getindex(Apadded, 1, 1)
I’m not sure this is ideal. It moves the branch on whether the index is in-range from a function out to code that @turbo
can see; probably you could do that by hand too. The indexing is then of Adata
which is an ordinary dense array. The other approach would be to break the iteration space explicitly into bulk/boundary, which might be faster.
1 Like
That avoids PaddedView’s internals completely, so there would not be much of a reason to use it.
Also, does @tullio
work with CartesianIndices? My code is multidimensional, so I would prefer to use those.
It does not. But you may take this as a proof of concept for one approach. It’s not without overhead – the padded case above is using 4 threads. I believe LV is turning if-else statements on indices into masked loading of the underlying array.
Tullio doesn’t know about types, so it can’t see the PaddedView directly. But LV does, so could in principle make this translation automatically. Still skipping the internals in favour of a re-implementation, though. I think that, right now, it notices that the type is not something it understands (via LoopVectorization.check_args(Apadded)
) and so falls back to simple loops.
Thanks! Certainly worth looking into!
I suspect that paying the upfront cost of allocating extra memory for non-virtual halos would be worth it for the performance gain of avoiding branches in the inner loop, but Julia doesn’t have a widely-adopted package for efficiently handling halos. There’s padarray
from ImageFiltering.jl, but that lacks the in-place version that’s needed to avoid allocations and doesn’t support several useful boundary conditions. Oceananigans uses halos for some BCs, but it’s meant for their own data structures and doesn’t seem to be intended for broader use.
2 Likes
Good point. No idea if this is optimised at all, but an upper bound of the cost is:
julia> @btime Matrix($A);
4.351 μs (2 allocations: 132.11 KiB)
1 Like
Yup. This is what I’m doing now. I was trying to push out some of the Linear Algebra stuff for broader use, but it seems like allocating the bigger array and defining a “bulk” region to iterate inside is the fastest choice.