Why is there a performance hit on broadcasting with OffsetArrays?

Please have a look at this example:

julia> a = zeros(3,3); b = zeros(1:3,1:3);

julia> @btime @. $a + $a;
  49.681 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  440.101 ns (4 allocations: 288 bytes)

julia> @btime @. $a += $a; 
  16.124 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  55.822 ns (2 allocations: 96 bytes)

There’s a 9x performance hit in the allocating and a 3.5x hit in the in-place version. What is the reason for this difference? Incidentally why is there any allocation in the in-place version?
One way to avoid the performance issue seems to be by working with the parent arrays directly, if we know a-priori that the axes are identical.

julia> @btime OffsetArray(parent($b) + parent($b),axes($b)...);
  66.628 ns (2 allocations: 192 bytes)

Most of this is eliminated by https://github.com/JuliaArrays/OffsetArrays.jl/pull/89. There’s still a 2x difference on my laptop, which I ran out of time to investigate.

4 Likes

Maybe because there is an additional operation to subtract the offset?

Subtracting the offset might account for a couple of nanoseconds, but the difference is much larger than this. Moreover, that can’t account for the memory allocation. This was pretty tricky to investigate, but the issue turns out to be that the compiler generates worse code because some of the OffsetArray manipulations are sufficiently complex that the compiler’s reasoning is not as successful as for simpler cases.

For example, broadcasting calls length on the axes of the array; when those axes are Base.OneTo (like for Array), length is trivial, but for a general UnitRange it uses checked arithmetic because it can’t otherwise be sure the result won’t overflow. Checked arithmetic is much slower, and moreover its extra complexity nixes some of the inlining. The lack of inlining (due to this and other issues), in turn, means that the Extruded wrapper does not get elided. Since we currently have to heap-allocate any wrapper that references heap-allocated memory, this accounts for the memory allocation. (If the compiler can elide the wrapper due to sufficient inlining, this ceases to be an issue.)

Trying to fix all this seemed to require a pretty major redesign of OffsetArrays. For reference, here’s master after that pull request 89 (referenced above) got merged:

julia> using OffsetArrays, BenchmarkTools

julia> a = zeros(3,3); b = zeros(1:3,1:3);

julia> @btime @. $a + $a;
  42.915 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  94.399 ns (4 allocations: 288 bytes)

julia> @btime @. $a += $a;
  15.320 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  51.763 ns (2 allocations: 96 bytes)

And here’s https://github.com/JuliaArrays/OffsetArrays.jl/pull/90:

julia> @btime @. $a + $a;
  42.999 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  52.319 ns (2 allocations: 192 bytes)

julia> @btime @. $a += $a;
  15.308 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  21.121 ns (0 allocations: 0 bytes)

So there’s still a small hit, but it’s on the order of ~30% rather than 3x.

9 Likes