Performance of `zero`, `oneunit` etc. in constructing unsigned ranges

maxkapur · April 14, 2022, 6:10am

In my code, I often need to iterate over the range 0:m or 1:m where the type of each range element is a primitive type T, usually UInt8 or UInt16.

m itself is already of type T.

This leaves me with many options for specifying the ranges:

For the zero-based range, I can use T(0):m or zero(T):m
For the one-based range, I can use T(1):m or one(T):m or oneunit(T):m

From the following discussions

I have gleaned that oneunit and zero are counterparts, and the one function is the odd one out: it is designed to serve as the multiplicative identity. So, for range construction, I should probably be using either T(0):m and T(1):m, or zero(T):m and oneunit(T):m, right?

The following benchmark is perplexing (v1.8.0b1).


julia> T = UInt16
UInt16

julia> m = T(25)
0x0019

julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 893 evaluations.
 Range (min … max):  125.346 ns …  1.205 μs  ┊ GC (min … max): 0.00% … 89.07%
 Time  (median):     126.688 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   136.751 ns ± 29.922 ns  ┊ GC (mean ± σ):  0.23% ±  1.54%

  █▄▆▃▂▂▂▃▁▁▁▁▁   ▁▁                                           ▁
  ████████████████████▇███▇▇▇▇▇▇▆▆▆▅▆▅▆▅▅▄▄▃▄▃▄▃▃▄▅▅▅▃▆▄▄▃▇▄▅▆ █
  125 ns        Histogram: log(frequency) by time       254 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(zero(T):m)
BenchmarkTools.Trial: 10000 samples with 763 evaluations.
 Range (min … max):  164.857 ns …  1.753 μs  ┊ GC (min … max): 0.00% … 88.43%
 Time  (median):     169.464 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   180.464 ns ± 39.168 ns  ┊ GC (mean ± σ):  0.16% ±  1.24%

  █▆▆▂▂▃▂▃▂▂▂▂▂▁▁▁▁▁ ▁▁▁▁                                      ▂
  ████████████████████████████▇▇█▇▇▅▇▆▆▅▇▇▇▅▅▆▅▅▅▅▅▄▃▅▅▆▆▃▄▃▁▃ █
  165 ns        Histogram: log(frequency) by time       326 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(T(1):m)
BenchmarkTools.Trial: 10000 samples with 896 evaluations.
 Range (min … max):  124.862 ns …  1.361 μs  ┊ GC (min … max): 0.00% … 90.30%
 Time  (median):     126.137 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.762 ns ± 25.122 ns  ┊ GC (mean ± σ):  0.17% ±  1.26%

  █▃ ▅▅▃   ▂▁ ▂▁▁▂  ▂▁                   ▁                     ▁
  ████████████████████████▇▇▇▇██▇▇▇▇▇▇▇▇███▇▆▆▆▆▆▅▆▅▅▅▅▆▅▅▄▅▅▅ █
  125 ns        Histogram: log(frequency) by time       200 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 550 evaluations.
 Range (min … max):  208.671 ns …  2.465 μs  ┊ GC (min … max): 0.00% … 88.79%
 Time  (median):     213.423 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   224.345 ns ± 42.964 ns  ┊ GC (mean ± σ):  0.18% ±  1.26%

  █▂▇▅▃▃▁▁ ▂▁▂▂▁▁▁  ▁ ▁                                        ▁
  ███████████████████████▇▇▆▇█████▇▇▇▇████▇▆▇▇▇▇▆▆▇▆▅▅▅▅▄▅▆▄▄▅ █
  209 ns        Histogram: log(frequency) by time       332 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(oneunit(T):m)
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
 Range (min … max):  124.437 ns …  1.447 μs  ┊ GC (min … max): 0.00% … 89.94%
 Time  (median):     125.773 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   135.538 ns ± 32.727 ns  ┊ GC (mean ± σ):  0.26% ±  1.55%

  █▄▆▃▃▃▂▃▁▁▂▂▂▁  ▁▁▁▁▁  ▁                                     ▂
  █████████████████████████▇▇▇▇▆▇▆▇▆▆▅▇▆▅▅▅▅▅▅▃▅▁▄▄▄▆▃▄▃▅▄▅▅▅▄ █
  124 ns        Histogram: log(frequency) by time       253 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

In summary:

T(0), T(1), and oneunit(T) are all very fast
one(T) is slowest for this task (as expected)
The surprising part: zero(T) is about 30% slower than T(0)

What’s going on here? What’s the best way to create these ranges?

Another weird discovery:

julia> @benchmark sum(zero(0):m)
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  29.512 ns …  1.634 μs  ┊ GC (min … max): 0.00% … 97.40%
 Time  (median):     30.435 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   34.947 ns ± 37.935 ns  ┊ GC (mean ± σ):  2.58% ±  2.39%

  ▇█▅▂▁▃▁    ▁▄▄▃▃▁          ▁▁                               ▁
  ███████▆▅▅███████▇▇███▅▅▆▅███▇▆▅▆▆▅▃▂▅▄▅▆▅▅▅▅▅▄▅▅▄▃▄▅▅▄▄▄▅▄ █
  29.5 ns      Histogram: log(frequency) by time      73.8 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.

This is not really doing the same thing as the others, because zero(0):m gets promoted to a normal Int64 range, but it seems strange that Julia can add up Int64s faster than UInt16s, right?

Sukera · April 14, 2022, 6:35am

You probably want to const that alias or use the setup capabilities of @benchmark:

Click me for benchmarks

julia> using BenchmarkTools

julia> T = UInt16
UInt16

julia> const T_const = UInt16
UInt16

julia> m = T(25)
0x0019

julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
 Range (min … max):  141.601 ns …  1.099 μs  ┊ GC (min … max): 0.00% … 83.46%
 Time  (median):     142.897 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   147.828 ns ± 19.837 ns  ┊ GC (mean ± σ):  0.18% ±  1.43%

  ▇█▁  ▅▅▅▃▁▁                                                  ▁
  ███▅▆███████▅▅▅▄▆▅▅▇▆▇▇▇▆▇▇▇▇█▇▇▇▇██▇▆▅▅█▇▇▆▄▅▄▃▄▃▄▆▅▆█▇▆▆▅▇ █
  142 ns        Histogram: log(frequency) by time       196 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(T_const(0):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  21.611 ns …  1.189 μs  ┊ GC (min … max): 0.00% … 93.67%
 Time  (median):     23.002 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.580 ns ± 18.417 ns  ┊ GC (mean ± σ):  1.16% ±  1.64%

  ▃▆█▇▄▃▂▁  ▁▁ ▅▄▃▃▁  ▁ ▁ ▁▁ ▂▃▁▁▁                            ▂
  ████████▇████████████▇██████████▆▆▅▆▆▅▆▅▆▅▅▅▅▄▄▅▄▅▅▅▅▅▅▃▄▄▅ █
  21.6 ns      Histogram: log(frequency) by time      45.8 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  1.637 ns … 33.581 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.817 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.800 ns ±  0.495 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                         ▅█▂                  
  ▅▅▄▃▂▁▂▁▁▁▁▁▃▄▃▂▂▁▁▁▁▁▁▁▂▃▅▃▃▂▁▁▁▁▁▁▁▁▂███▇▅▂▁▁▁▁▁▁▁▁▁▃▄▃▃ ▃
  1.64 ns        Histogram: frequency by time         1.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 842 evaluations.
 Range (min … max):  140.952 ns …  1.031 μs  ┊ GC (min … max): 0.00% … 82.26%
 Time  (median):     143.641 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   146.574 ns ± 17.405 ns  ┊ GC (mean ± σ):  0.17% ±  1.44%

   ▅██▆▄▂▃▅▆▅▅▄▃▂▁                                             ▂
  ▇█████████████████▇▇▇▅▆▆▆▅▅▅▄▅▅▄▅▆▅▅▆▅▅▆▅▆▆▄▅▅▄▇▆▆▆▅▄▂▅▆▇▇▇▇ █
  141 ns        Histogram: log(frequency) by time       181 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(one(T_const):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min … max):  21.799 ns … 966.815 ns  ┊ GC (min … max): 0.00% … 93.62%
 Time  (median):     22.685 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.908 ns ±  16.343 ns  ┊ GC (mean ± σ):  1.08% ±  1.64%

  ▁▆▇█▇▆▄▂               ▁▄▄▃▂▃▃▁                   ▁▂▃▂▁ ▁▁   ▂
  ████████▇▆▆▄▅▅▅▅▄▆████▆████████▇██▇▇▇█▇▆▆▇▇▇▇██████████████▇ █
  21.8 ns       Histogram: log(frequency) by time      34.6 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

If it’s not const or you don’t specify it using setup, the compiler has to insert global lookups for it and can’t constant fold it into the function. If then even m is a known constant/literal, it can even fold the whole computation as in the third-to-last benchmark.

That’s because this doesn’t have the global lookup for T included - zero is already known & a constant (as all function bindings are).

TL;DR: Don’t benchmark with non-const global variables/avoid globals.

Sukera · April 14, 2022, 6:42am

I guess another question is - are the benchmarks you’re doing here representative of how this looks & behaves in your code? Have you profiled this and determined that it’s the cause of an unexpected slowdown?

I’m asking because as long as your code is type stable (i.e. T is a known constant at compile time, either through inference or being const) you shouldn’t see any difference between these different methods of specifying the ranges.

maxkapur · April 14, 2022, 6:54am

You’re right. It was a type stability thing in the REPL.

I saw a speedup in my code after changing one to oneunit, and created this benchmark to see if I might squeeze anything better out of it. But in the code T is a parameter in of the function arguments. Along the lines of

function foo(m::T, n::T) where {T <: Unsigned}
    sum(zero(T):m) + sum(one(T):n)
end

DNF · April 14, 2022, 7:10am

This looks to me to just be faulty benchmarks. Sums of ranges are converted to a version of m*(m+1)÷2, and should take approximately 1ns to evaluate.

Benchmarking with non-const will not give correct results.

Sukera · April 14, 2022, 7:26am

Yes - that’s also why the @benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25)) benchmark in my collapsed code above gives ~1ns for this - nothing has to be looked up in global scope, whereas the first benchmark had to look up both T (and its associated conversion function) and m while the second one “only” looked up m. The last one is fastest because when everything is a constant/known, it can either be folded completely or inlined completely, allowing a closed form solution.

Topic		Replies	Views
Type restriction on UnitRange Internals & Design question	17	1494	October 21, 2017
Why the function sum1 is faster than builtin sum Performance question , profiling	14	1041	September 21, 2023
Why is there no isless(::Int64, ::UnitRange{Int64})? General Usage range , unitrange	45	966	April 9, 2025
Faster zeros with calloc Performance array	26	3904	March 23, 2022
Using a AbstractUnitRange inside a struct General Usage	10	333	July 12, 2021

Performance of `zero`, `oneunit` etc. in constructing unsigned ranges

What’s going on here? What’s the best way to create these ranges?

Related topics