Why is there a performance hit on broadcasting with OffsetArrays?

jishnub · December 12, 2019, 1:10pm

Please have a look at this example:

julia> a = zeros(3,3); b = zeros(1:3,1:3);

julia> @btime @. $a + $a;
  49.681 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  440.101 ns (4 allocations: 288 bytes)

julia> @btime @. $a += $a; 
  16.124 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  55.822 ns (2 allocations: 96 bytes)

There’s a 9x performance hit in the allocating and a 3.5x hit in the in-place version. What is the reason for this difference? Incidentally why is there any allocation in the in-place version?
One way to avoid the performance issue seems to be by working with the parent arrays directly, if we know a-priori that the axes are identical.

julia> @btime OffsetArray(parent($b) + parent($b),axes($b)...);
  66.628 ns (2 allocations: 192 bytes)

tim.holy · December 12, 2019, 3:37pm

Most of this is eliminated by https://github.com/JuliaArrays/OffsetArrays.jl/pull/89. There’s still a 2x difference on my laptop, which I ran out of time to investigate.

sairus7 · December 12, 2019, 4:41pm

Maybe because there is an additional operation to subtract the offset?

tim.holy · December 15, 2019, 3:20pm

Subtracting the offset might account for a couple of nanoseconds, but the difference is much larger than this. Moreover, that can’t account for the memory allocation. This was pretty tricky to investigate, but the issue turns out to be that the compiler generates worse code because some of the OffsetArray manipulations are sufficiently complex that the compiler’s reasoning is not as successful as for simpler cases.

For example, broadcasting calls length on the axes of the array; when those axes are Base.OneTo (like for Array), length is trivial, but for a general UnitRange it uses checked arithmetic because it can’t otherwise be sure the result won’t overflow. Checked arithmetic is much slower, and moreover its extra complexity nixes some of the inlining. The lack of inlining (due to this and other issues), in turn, means that the Extruded wrapper does not get elided. Since we currently have to heap-allocate any wrapper that references heap-allocated memory, this accounts for the memory allocation. (If the compiler can elide the wrapper due to sufficient inlining, this ceases to be an issue.)

Trying to fix all this seemed to require a pretty major redesign of OffsetArrays. For reference, here’s master after that pull request 89 (referenced above) got merged:

julia> using OffsetArrays, BenchmarkTools

julia> a = zeros(3,3); b = zeros(1:3,1:3);

julia> @btime @. $a + $a;
  42.915 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  94.399 ns (4 allocations: 288 bytes)

julia> @btime @. $a += $a;
  15.320 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  51.763 ns (2 allocations: 96 bytes)

And here’s https://github.com/JuliaArrays/OffsetArrays.jl/pull/90:

julia> @btime @. $a + $a;
  42.999 ns (1 allocation: 160 bytes)

julia> @btime @. $b + $b;
  52.319 ns (2 allocations: 192 bytes)

julia> @btime @. $a += $a;
  15.308 ns (0 allocations: 0 bytes)

julia> @btime @. $b += $b;
  21.121 ns (0 allocations: 0 bytes)

So there’s still a small hit, but it’s on the order of ~30% rather than 3x.

Topic		Replies	Views
Performance of OffsetArrays General Usage	9	2197	July 12, 2018
Why is there a performance penalty in evaluating the axes of an OffsetArray? Performance question , performance	5	543	October 18, 2020
OffsetArrays vs Arrays speed New to Julia	4	396	June 17, 2023
Offset array package with offsets in the type? General Usage	12	652	May 27, 2022
Memory allocation in broadcast assignment Performance question	3	863	July 24, 2020

Why is there a performance hit on broadcasting with OffsetArrays?

Related topics