There’s a 9x performance hit in the allocating and a 3.5x hit in the in-place version. What is the reason for this difference? Incidentally why is there any allocation in the in-place version?
One way to avoid the performance issue seems to be by working with the parent arrays directly, if we know a-priori that the axes are identical.
Subtracting the offset might account for a couple of nanoseconds, but the difference is much larger than this. Moreover, that can’t account for the memory allocation. This was pretty tricky to investigate, but the issue turns out to be that the compiler generates worse code because some of the OffsetArray manipulations are sufficiently complex that the compiler’s reasoning is not as successful as for simpler cases.
For example, broadcasting calls length on the axes of the array; when those axes are Base.OneTo (like for Array), length is trivial, but for a general UnitRange it uses checked arithmetic because it can’t otherwise be sure the result won’t overflow. Checked arithmetic is much slower, and moreover its extra complexity nixes some of the inlining. The lack of inlining (due to this and other issues), in turn, means that the Extruded wrapper does not get elided. Since we currently have to heap-allocate any wrapper that references heap-allocated memory, this accounts for the memory allocation. (If the compiler can elide the wrapper due to sufficient inlining, this ceases to be an issue.)
Trying to fix all this seemed to require a pretty major redesign of OffsetArrays. For reference, here’s master after that pull request 89 (referenced above) got merged: