# Performance of `zero`, `oneunit` etc. in constructing unsigned ranges

In my code, I often need to iterate over the range `0:m` or `1:m` where the type of each range element is a primitive type `T`, usually `UInt8` or `UInt16`.

`m` itself is already of type `T`.

This leaves me with many options for specifying the ranges:

• For the zero-based range, I can use `T(0):m` or `zero(T):m`
• For the one-based range, I can use `T(1):m` or `one(T):m` or `oneunit(T):m`

From the following discussions

https://github.com/JuliaLang/julia/issues/24853

I have gleaned that `oneunit` and `zero` are counterparts, and the `one` function is the odd one out: it is designed to serve as the multiplicative identity. So, for range construction, I should probably be using either `T(0):m` and `T(1):m`, or `zero(T):m` and `oneunit(T):m`, right?

The following benchmark is perplexing (v1.8.0b1).

``````
julia> T = UInt16
UInt16

julia> m = T(25)
0x0019

julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 893 evaluations.
Range (min β¦ max):  125.346 ns β¦  1.205 ΞΌs  β GC (min β¦ max): 0.00% β¦ 89.07%
Time  (median):     126.688 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   136.751 ns Β± 29.922 ns  β GC (mean Β± Ο):  0.23% Β±  1.54%

βββββββββββββ   ββ                                           β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
125 ns        Histogram: log(frequency) by time       254 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(zero(T):m)
BenchmarkTools.Trial: 10000 samples with 763 evaluations.
Range (min β¦ max):  164.857 ns β¦  1.753 ΞΌs  β GC (min β¦ max): 0.00% β¦ 88.43%
Time  (median):     169.464 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   180.464 ns Β± 39.168 ns  β GC (mean Β± Ο):  0.16% Β±  1.24%

ββββββββββββββββββ ββββ                                      β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
165 ns        Histogram: log(frequency) by time       326 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(T(1):m)
BenchmarkTools.Trial: 10000 samples with 896 evaluations.
Range (min β¦ max):  124.862 ns β¦  1.361 ΞΌs  β GC (min β¦ max): 0.00% β¦ 90.30%
Time  (median):     126.137 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   134.762 ns Β± 25.122 ns  β GC (mean Β± Ο):  0.17% Β±  1.26%

ββ βββ   ββ ββββ  ββ                   β                     β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
125 ns        Histogram: log(frequency) by time       200 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 550 evaluations.
Range (min β¦ max):  208.671 ns β¦  2.465 ΞΌs  β GC (min β¦ max): 0.00% β¦ 88.79%
Time  (median):     213.423 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   224.345 ns Β± 42.964 ns  β GC (mean Β± Ο):  0.18% Β±  1.26%

ββββββββ βββββββ  β β                                        β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
209 ns        Histogram: log(frequency) by time       332 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(oneunit(T):m)
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
Range (min β¦ max):  124.437 ns β¦  1.447 ΞΌs  β GC (min β¦ max): 0.00% β¦ 89.94%
Time  (median):     125.773 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   135.538 ns Β± 32.727 ns  β GC (mean Β± Ο):  0.26% Β±  1.55%

ββββββββββββββ  βββββ  β                                     β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
124 ns        Histogram: log(frequency) by time       253 ns <

Memory estimate: 16 bytes, allocs estimate: 1.
``````

In summary:

• `T(0)`, `T(1)`, and `oneunit(T)` are all very fast
• `one(T)` is slowest for this task (as expected)
• The surprising part: `zero(T)` is about 30% slower than `T(0)`

## Whatβs going on here? Whatβs the best way to create these ranges?

Another weird discovery:

``````julia> @benchmark sum(zero(0):m)
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
Range (min β¦ max):  29.512 ns β¦  1.634 ΞΌs  β GC (min β¦ max): 0.00% β¦ 97.40%
Time  (median):     30.435 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   34.947 ns Β± 37.935 ns  β GC (mean Β± Ο):  2.58% Β±  2.39%

βββββββ    ββββββ          ββ                               β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
29.5 ns      Histogram: log(frequency) by time      73.8 ns <

Memory estimate: 32 bytes, allocs estimate: 1.
``````

This is not really doing the same thing as the others, because `zero(0):m` gets promoted to a normal `Int64` range, but it seems strange that Julia can add up `Int64`s faster than `UInt16`s, right?

You probably want to `const` that alias or use the `setup` capabilities of `@benchmark`:

Click me for benchmarks
``````julia> using BenchmarkTools

julia> T = UInt16
UInt16

julia> const T_const = UInt16
UInt16

julia> m = T(25)
0x0019

julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
Range (min β¦ max):  141.601 ns β¦  1.099 ΞΌs  β GC (min β¦ max): 0.00% β¦ 83.46%
Time  (median):     142.897 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   147.828 ns Β± 19.837 ns  β GC (mean Β± Ο):  0.18% Β±  1.43%

βββ  ββββββ                                                  β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
142 ns        Histogram: log(frequency) by time       196 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(T_const(0):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max):  21.611 ns β¦  1.189 ΞΌs  β GC (min β¦ max): 0.00% β¦ 93.67%
Time  (median):     23.002 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   25.580 ns Β± 18.417 ns  β GC (mean Β± Ο):  1.16% Β±  1.64%

ββββββββ  ββ βββββ  β β ββ βββββ                            β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
21.6 ns      Histogram: log(frequency) by time      45.8 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max):  1.637 ns β¦ 33.581 ns  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     1.817 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   1.800 ns Β±  0.495 ns  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
1.64 ns        Histogram: frequency by time         1.9 ns <

Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 842 evaluations.
Range (min β¦ max):  140.952 ns β¦  1.031 ΞΌs  β GC (min β¦ max): 0.00% β¦ 82.26%
Time  (median):     143.641 ns              β GC (median):    0.00%
Time  (mean Β± Ο):   146.574 ns Β± 17.405 ns  β GC (mean Β± Ο):  0.17% Β±  1.44%

βββββββββββββββ                                             β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
141 ns        Histogram: log(frequency) by time       181 ns <

Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(one(T_const):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max):  21.799 ns β¦ 966.815 ns  β GC (min β¦ max): 0.00% β¦ 93.62%
Time  (median):     22.685 ns               β GC (median):    0.00%
Time  (mean Β± Ο):   24.908 ns Β±  16.343 ns  β GC (mean Β± Ο):  1.08% Β±  1.64%

ββββββββ               ββββββββ                   βββββ ββ   β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
21.8 ns       Histogram: log(frequency) by time      34.6 ns <

Memory estimate: 16 bytes, allocs estimate: 1.
``````

If itβs not `const` or you donβt specify it using `setup`, the compiler has to insert global lookups for it and canβt constant fold it into the function. If then even `m` is a known constant/literal, it can even fold the whole computation as in the third-to-last benchmark.

Thatβs because this doesnβt have the global lookup for `T` included - `zero` is already known & a constant (as all function bindings are).

TL;DR: Donβt benchmark with non-`const` global variables/avoid globals.

1 Like

I guess another question is - are the benchmarks youβre doing here representative of how this looks & behaves in your code? Have you profiled this and determined that itβs the cause of an unexpected slowdown?

Iβm asking because as long as your code is type stable (i.e. `T` is a known constant at compile time, either through inference or being `const`) you shouldnβt see any difference between these different methods of specifying the ranges.

Youβre right. It was a type stability thing in the REPL.

I saw a speedup in my code after changing `one` to `oneunit`, and created this benchmark to see if I might squeeze anything better out of it. But in the code `T` is a parameter in of the function arguments. Along the lines of

``````function foo(m::T, n::T) where {T <: Unsigned}
sum(zero(T):m) + sum(one(T):n)
end
``````

This looks to me to just be faulty benchmarks. Sums of ranges are converted to a version of `m*(m+1)Γ·2`, and should take approximately 1ns to evaluate.

Benchmarking with non-const will not give correct results.

1 Like

Yes - thatβs also why the `@benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25))` benchmark in my collapsed code above gives ~1ns for this - nothing has to be looked up in global scope, whereas the first benchmark had to look up both `T` (and its associated conversion function) and `m` while the second one βonlyβ looked up `m`. The last one is fastest because when everything is a constant/known, it can either be folded completely or inlined completely, allowing a closed form solution.

2 Likes