Clamped View

I am trying to define a performant read-only view into an array where the index is clamped to lie within the array axis. The motivation for this is to apply stencils like finite differences and convolutions without needing padding or offsets or subarrays. Here’s a quick version, but it’s much slower than using an OffsetArray. (PaddedViews has similar performance to mine.)

struct ClampedView{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
    parent :: A
    @inline ClampedView(parent::AbstractArray{T,N}) where {T,N} = new{T,N,typeof(parent)}(parent)
end
Base.parent(A::ClampedView) = A.parent
Base.size(A::ClampedView) = size(parent(A))
Base.@propagate_inbounds @inline Base.getindex(A::ClampedView{<:Any,N}, I::Vararg{Int,N}) where N = 
    @inbounds parent(A)[clamp.(I,axes(parent(A)))...]

@inline δ(i) = CartesianIndex(ntuple(j -> j==i ? 1 : 0, 2))
laplace!(b,a) = @inbounds @simd for I ∈ CartesianIndices(b)
    b[I] = a[I-δ(1)]+a[I+δ(1)]+a[I-δ(2)]+a[I+δ(2)]-4a[I]
end

N = (2^10,2^10)
a = ClampedView(rand(Float32,N));
a[CartesianIndex(-1,1)] # works
b = zeros(Float32,N);
laplace!(b,a); # works

using OffsetArrays: Origin
aPadded = Origin(0)(rand(Float32,N.+2));

using BenchmarkTools
@btime laplace!(b,a)       #1.163 ms (0 allocations: 0 bytes)
@btime laplace!(b,aPadded) #135.400 μs (0 allocations: 0 bytes)

(Edited based on @mikmoore 's suggestion)

A view of a subarray is essentially the same cost as the original array. Is it possible to get similar performance for this clamped view idea? What is the secret-sauce that allows views to be so fast?

1 Like

I2 and I1 are global non-constants. Move them inside the function definition.

Your use of @inbounds in laplace! is unsafe because you haven’t validated the bounds of a. Instead, your ClampedView is always technically @inbounds in its access. However, you haven’t quite set up its getindex correctly, so I don’t believe it’s actually eliding the inner bounds checks. Further, internally it can still access out-of-bounds if the parent array does not have 1:N axes. I’m trying to find a good documentation of its use but the best I’ve found is this bit.

In any case, try something like

Base.getindex(A::ClampedView{<:Any,N}, I::Vararg{Int,N}) where N = @inbounds parent(A)[clamp.(I,axes(parent(A)))...]

You will never need to @inbounds an access to a ClampedView because it always checks the parent array and ensures it’s accessing the closest valid index. Meanwhile, with the above definition it is always using @inbounds to access the parent array at the clamped index.

1 Like

Hoping you get something simple and performant. I can remember a couple of times where I wanted something like this.

1 Like

Thanks for the feedback! The globals was just from writing the MWE too quickly. I’ve replaced them with inlined functions in the code above.

Your version of the clamping is certainly better and I like that it will work with general axes. The timing using ClampedView and OffsetArray now match when using laplace! without @inbounds. But with that macro applied (which is safe in the context I will be using the function), the OffsetArray becomes 8x faster, but the ClampedView has no speed up.

@btime laplace!(b,a)       #1.163 ms (0 allocations: 0 bytes)
@btime laplace!(b,aPadded) #135.400 μs (0 allocations: 0 bytes)

Indeed, in a sense the ClampedView always does bounds checks because it has to move the indices inward if they aren’t legal. The OffsetArray is padded out to excess size, so (in this setting) it is always accessed inbounds and no checking of any kind is necessary.

I’m surprised at the 8x performance difference you see, however. I’m suspecting this has something to do with SIMD but don’t have a great idea what’s going wrong or how to fix it. You might add a @fastmath to your b[I] = ... line to allow it to re-order the summation, but I don’t expect that to change things significantly.

At the tedious end of your options, you could replace your CartesianIndices(b) loop with a double-loop over the rows/columns where you handle the first and last row/column specially (outside of the loop) and can simply use a normal Array for a with inbounds access for the interior of the array.

Actually, to my surprise, @simd is not helping here at all! Same timings if I take it off. @fastmath speeds up both loops a little, but doesn’t address the question.

So it must be the clamp that’s slowing us down. Either directly or because of it blocking some compiler optimizations.

But this brings me back to thinking about how views work. How is the shift in indices accomplished so well with views and can it be replicated here?

Small aside: KernelAbstractions.jl is pretty good at closing the gap between ClampedView and OffsetArray

using KernelAbstractions,CUDA,CUDA.CUDAKernels,Adapt
CUDA.allowscalar(false)

@kernel function lap_kernel(b, a)
    I = @index(Global, Cartesian)
    @fastmath @inbounds b[I] = a[I-δ(1)]+a[I+δ(1)]+a[I-δ(2)]+a[I+δ(2)]-4a[I]
end
lap_CUDA!(b,a) = lap_kernel(CUDABackend(), 64)(b, a, ndrange=size(b))
lap_CPU!(b,a) = lap_kernel(CPU(), 64)(b, a, ndrange=size(b))

@btime lap_CPU!(b,a)       #215.500 μs (0 allocations: 0 bytes)
@btime lap_CPU!(b,aPadded) # 60.000 μs (0 allocations: 0 bytes) ! 3.5x faster !

b = adapt(CuArray,b);
a = adapt(CuArray,a);
aPadded = adapt(CuArray,aPadded);

@btime lap_CUDA!(b,a)       #2.067 μs (0 allocations: 0 bytes)
@btime lap_CUDA!(b,aPadded) #2.178 μs (0 allocations: 0 bytes)

CUDA is happily burning through both at the same speed.

My experiences is that boundary checks inside inner loops are always going to be slow, and are inflexible as well. What if you want to change to periodic boundary conditions, for example? Better to use padding/ghost elements, which allows you to completely decouple the boundary conditions from the inner loops.

See also Arrays with periodic boundaries - #4 by stevengj or Seemingly unnecessary allocation within for loop - #8 by stevengj

The shift in indices for views is not conditional — it’s just a constant offset. Indeed, the compiler can probably pull the offset completely out of an inner loop by simply offsetting the data pointer.

2 Likes

I get what you mean, but
getindex(A,I)=parent(A)[mod.(I,axes(A))...]
is a gorgeous way to do periodic conditions. It also saves me from going through and updating ghost values every iteration, and it works regardless of the stencil size. Not for every use-case, but it would clearly have its uses.

I suspect you are right. Applying a shift or skip to an index doesn’t seem like much less work than a min/max or mod, so the compiler might be avoiding that math completely somehow with views.

However, I’m happy to be proven wrong… Anyone have ideas on how to help the compiler figure this out?

Would that it were so simple. Conditionals are expensive. Mod is expensive. And both inhibit SIMD vectorization.

But it’s just as fast on the GPU as OffsetArrays. Are GPUs better at this?

Maybe your GPU calculation is limited by the speed of memory access.

1 Like

Yes, branches are expensive and prohibit SIMD – but not all conditionals are branches. Most min/max operations can be computed branchlessly via cmov/blend-style instructions (if not native min/max) and do not intrinsically restrict SIMD. mod appears to be branchless (but still expensive, maybe complicated enough to not SIMD) for fixed divisors but is branchy for variable divisors, so would likely prohibit SIMD for compile-time unknown divisors.

However, I imagine the more fundamental issue might be that the index modifications break SIMD by producing irregular memory access patterns.

Dividing the kernel into a boundary component and an interior component would at least allow the interior a good shot at vectorization.

1 Like

True, and this is what I’m doing now. It’s just a lot of code which distracts from the main point and introduces potential errors.

1 Like