Why is there a performance penalty in evaluating the axes of an OffsetArray?

Currently evaluating the axes of an OffsetArray is considerably slower than that of an Array as the former uses a custom axis type that wraps the axes of the parent. I am trying to understand how to improve the performance of this operation.

To give an example (using the master branch of OffsetArrays.jl):

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> XO = OffsetArray(X, -1, -2, -3, 1, 2, 3);

julia> @btime axes($X);
  4.286 ns (0 allocations: 0 bytes)

julia> @btime axes($XO);
  9.056 ns (0 allocations: 0 bytes)

Axes in either case are constructed using a map, and the performance of the map is identical as expected.

# Axes for an Array
julia> @btime map(Base.OneTo, size($X));
  4.288 ns (0 allocations: 0 bytes)

# Axes for an OffsetArray
julia> @btime map(OffsetArrays.IdOffsetRange, axes(parent($XO)), $XO.offsets);
  9.310 ns (0 allocations: 0 bytes)

The performance penalty appears to arise in constructing the type, and not in evaluating the axes of the parent. We may check this as

julia> axOp = axes(parent(XO));

julia> @btime map(OffsetArrays.IdOffsetRange, $axOp, $XO.offsets);
  9.036 ns (0 allocations: 0 bytes)

We check the performance of the constructors:

julia> @btime OffsetArrays.IdOffsetRange($(Ref(Base.OneTo(4)))[], $(Ref(0))[]);
  3.217 ns (0 allocations: 0 bytes)

julia> @btime Base.OneTo($(Ref(4))[]);
  2.717 ns (0 allocations: 0 bytes)

# Constructing tuples of these types
julia> @btime ntuple(x->$(Ref(Base.OneTo(4)))[], 6);
  3.221 ns (0 allocations: 0 bytes)

julia> @btime ntuple(x->$(Ref(OffsetArrays.IdOffsetRange(Base.OneTo(4),0)))[], 6);
  4.261 ns (0 allocations: 0 bytes)

I’m not sure if there’s much difference here. Why is the map so slow, and how to improve the performance of this operation?

1 Like

Please try a more involved benchmark, I am not sure that timings on the order of nanoseconds are very meaningful.

1 Like

Yep. LICM generally makes this kind of overhead irrelevant for any real-world example.

https://compileroptimizations.com/category/hoisting.htm

1 Like

There seems to be some overhead to doing writes to random locations.

Also a few places in DynamicGrids.jl gave me small performance improvements indexing into the parent instead of the OffsetArray.

Would be good to document those, as they may be fixable with an @inline. Hoisting requires a place to hoist to (i.e., a caller that accesses the array multiple times), and it has to be in the same compiled blob.

2 Likes

Sure I’ll look into it next time I’m working on that.