In my code, I often need to iterate over the range 0:m
or 1:m
where the type of each range element is a primitive type T
, usually UInt8
or UInt16
.
m
itself is already of type T
.
This leaves me with many options for specifying the ranges:
For the zero-based range, I can use T(0):m
or zero(T):m
For the one-based range, I can use T(1):m
or one(T):m
or oneunit(T):m
From the following discussions
opened 11:04PM - 29 Nov 17 UTC
speculative
We may want to revise `zero` to match how `one` works. This would also mean we w⦠ould need to create a `zerounit`.
```julia
julia> using Dates
julia> one(Minute)
1
julia> oneunit(Minute)
1 minute
julia> zero(Minute)
0 minutes
julia> zerounit(Minute)
ERROR: UndefVarError: zerounit not defined
```
Following this reasoning, shouldnβt we also have zerounit for consistency? zero can be defined as the additive identity.
I find it very strange that we need to write code with a mix of oneunit and zero instead of oneunit and zerounit.
I have gleaned that oneunit
and zero
are counterparts, and the one
function is the odd one out: it is designed to serve as the multiplicative identity. So, for range construction, I should probably be using either T(0):m
and T(1):m
, or zero(T):m
and oneunit(T):m
, right?
The following benchmark is perplexing (v1.8.0b1).
julia> T = UInt16
UInt16
julia> m = T(25)
0x0019
julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 893 evaluations.
Range (min β¦ max): 125.346 ns β¦ 1.205 ΞΌs β GC (min β¦ max): 0.00% β¦ 89.07%
Time (median): 126.688 ns β GC (median): 0.00%
Time (mean Β± Ο): 136.751 ns Β± 29.922 ns β GC (mean Β± Ο): 0.23% Β± 1.54%
βββββββββββββ ββ β
ββββββββββββββββββββββββββββββββββ
ββ
ββ
β
ββββββββββ
β
β
ββββββββ
β β
125 ns Histogram: log(frequency) by time 254 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(zero(T):m)
BenchmarkTools.Trial: 10000 samples with 763 evaluations.
Range (min β¦ max): 164.857 ns β¦ 1.753 ΞΌs β GC (min β¦ max): 0.00% β¦ 88.43%
Time (median): 169.464 ns β GC (median): 0.00%
Time (mean Β± Ο): 180.464 ns Β± 39.168 ns β GC (mean Β± Ο): 0.16% Β± 1.24%
ββββββββββββββββββ ββββ β
ββββββββββββββββββββββββββββββββββ
ββββ
ββββ
β
ββ
β
β
β
β
βββ
β
βββββββ β
165 ns Histogram: log(frequency) by time 326 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(T(1):m)
BenchmarkTools.Trial: 10000 samples with 896 evaluations.
Range (min β¦ max): 124.862 ns β¦ 1.361 ΞΌs β GC (min β¦ max): 0.00% β¦ 90.30%
Time (median): 126.137 ns β GC (median): 0.00%
Time (mean Β± Ο): 134.762 ns Β± 25.122 ns β GC (mean Β± Ο): 0.17% Β± 1.26%
ββ β
β
β ββ ββββ ββ β β
ββββββββββββββββββββββββββββββββββββββββββββββββ
ββ
β
β
β
ββ
β
ββ
β
β
β
125 ns Histogram: log(frequency) by time 200 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 550 evaluations.
Range (min β¦ max): 208.671 ns β¦ 2.465 ΞΌs β GC (min β¦ max): 0.00% β¦ 88.79%
Time (median): 213.423 ns β GC (median): 0.00%
Time (mean Β± Ο): 224.345 ns Β± 42.964 ns β GC (mean Β± Ο): 0.18% Β± 1.26%
ββββ
ββββ βββββββ β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β
β
ββ
ββββ
β
209 ns Histogram: log(frequency) by time 332 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(oneunit(T):m)
BenchmarkTools.Trial: 10000 samples with 898 evaluations.
Range (min β¦ max): 124.437 ns β¦ 1.447 ΞΌs β GC (min β¦ max): 0.00% β¦ 89.94%
Time (median): 125.773 ns β GC (median): 0.00%
Time (mean Β± Ο): 135.538 ns Β± 32.727 ns β GC (mean Β± Ο): 0.26% Β± 1.55%
ββββββββββββββ βββββ β β
ββββββββββββββββββββββββββββββββββββ
βββ
β
β
β
β
β
ββ
βββββββββ
ββ
β
β
β β
124 ns Histogram: log(frequency) by time 253 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
In summary:
T(0)
, T(1)
, and oneunit(T)
are all very fast
one(T)
is slowest for this task (as expected)
The surprising part: zero(T)
is about 30% slower than T(0)
Whatβs going on here? Whatβs the best way to create these ranges?
Another weird discovery:
julia> @benchmark sum(zero(0):m)
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
Range (min β¦ max): 29.512 ns β¦ 1.634 ΞΌs β GC (min β¦ max): 0.00% β¦ 97.40%
Time (median): 30.435 ns β GC (median): 0.00%
Time (mean Β± Ο): 34.947 ns Β± 37.935 ns β GC (mean Β± Ο): 2.58% Β± 2.39%
βββ
ββββ ββββββ ββ β
βββββββββ
β
βββββββββββββ
β
ββ
ββββββ
βββ
βββ
ββ
ββ
β
β
β
β
ββ
β
ββββ
β
ββββ
β β
29.5 ns Histogram: log(frequency) by time 73.8 ns <
Memory estimate: 32 bytes, allocs estimate: 1.
This is not really doing the same thing as the others, because zero(0):m
gets promoted to a normal Int64
range, but it seems strange that Julia can add up Int64
s faster than UInt16
s, right?
Sukera
April 14, 2022, 6:35am
2
maxkapur:
T = UInt16
You probably want to const
that alias or use the setup
capabilities of @benchmark
:
Click me for benchmarks
julia> using BenchmarkTools
julia> T = UInt16
UInt16
julia> const T_const = UInt16
UInt16
julia> m = T(25)
0x0019
julia> @benchmark sum(T(0):m)
BenchmarkTools.Trial: 10000 samples with 839 evaluations.
Range (min β¦ max): 141.601 ns β¦ 1.099 ΞΌs β GC (min β¦ max): 0.00% β¦ 83.46%
Time (median): 142.897 ns β GC (median): 0.00%
Time (mean Β± Ο): 147.828 ns Β± 19.837 ns β GC (mean Β± Ο): 0.18% Β± 1.43%
βββ β
β
β
βββ β
ββββ
βββββββββ
β
β
βββ
β
ββββββββββββββββββββ
β
ββββββ
βββββββ
ββββββ
β β
142 ns Histogram: log(frequency) by time 196 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(T_const(0):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max): 21.611 ns β¦ 1.189 ΞΌs β GC (min β¦ max): 0.00% β¦ 93.67%
Time (median): 23.002 ns β GC (median): 0.00%
Time (mean Β± Ο): 25.580 ns Β± 18.417 ns β GC (mean Β± Ο): 1.16% Β± 1.64%
ββββββββ ββ β
ββββ β β ββ βββββ β
βββββββββββββββββββββββββββββββββββ
βββ
ββ
ββ
β
β
β
βββ
ββ
β
β
β
β
β
ββββ
β
21.6 ns Histogram: log(frequency) by time 45.8 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min β¦ max): 1.637 ns β¦ 33.581 ns β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.817 ns β GC (median): 0.00%
Time (mean Β± Ο): 1.800 ns Β± 0.495 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
β
ββ
β
β
βββββββββββββββββββββββββ
βββββββββββββββββ
ββββββββββββββ β
1.64 ns Histogram: frequency by time 1.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark sum(one(T):m)
BenchmarkTools.Trial: 10000 samples with 842 evaluations.
Range (min β¦ max): 140.952 ns β¦ 1.031 ΞΌs β GC (min β¦ max): 0.00% β¦ 82.26%
Time (median): 143.641 ns β GC (median): 0.00%
Time (mean Β± Ο): 146.574 ns Β± 17.405 ns β GC (mean Β± Ο): 0.17% Β± 1.44%
β
βββββββ
ββ
β
ββββ β
ββββββββββββββββββββββ
ββββ
β
β
ββ
β
ββ
ββ
β
ββ
β
ββ
ββββ
β
ββββββ
βββ
βββββ β
141 ns Histogram: log(frequency) by time 181 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
julia> @benchmark sum(one(T_const):m)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min β¦ max): 21.799 ns β¦ 966.815 ns β GC (min β¦ max): 0.00% β¦ 93.62%
Time (median): 22.685 ns β GC (median): 0.00%
Time (mean Β± Ο): 24.908 ns Β± 16.343 ns β GC (mean Β± Ο): 1.08% Β± 1.64%
ββββββββ ββββββββ βββββ ββ β
βββββββββββββ
β
β
β
ββββββββββββββββββββββββββββββββββββββββββββ β
21.8 ns Histogram: log(frequency) by time 34.6 ns <
Memory estimate: 16 bytes, allocs estimate: 1.
If itβs not const
or you donβt specify it using setup
, the compiler has to insert global lookups for it and canβt constant fold it into the function. If then even m
is a known constant/literal, it can even fold the whole computation as in the third-to-last benchmark.
maxkapur:
This is not really doing the same thing as the others, because zero(0):m
gets promoted to a normal Int64
range, but it seems strange that Julia can add up Int64
s faster than UInt16
s, right?
Thatβs because this doesnβt have the global lookup for T
included - zero
is already known & a constant (as all function bindings are).
TL;DR: Donβt benchmark with non-const
global variables/avoid globals.
1 Like
Sukera
April 14, 2022, 6:42am
3
I guess another question is - are the benchmarks youβre doing here representative of how this looks & behaves in your code? Have you profiled this and determined that itβs the cause of an unexpected slowdown?
Iβm asking because as long as your code is type stable (i.e. T
is a known constant at compile time, either through inference or being const
) you shouldnβt see any difference between these different methods of specifying the ranges.
Youβre right. It was a type stability thing in the REPL.
I saw a speedup in my code after changing one
to oneunit
, and created this benchmark to see if I might squeeze anything better out of it. But in the code T
is a parameter in of the function arguments. Along the lines of
function foo(m::T, n::T) where {T <: Unsigned}
sum(zero(T):m) + sum(one(T):n)
end
DNF
April 14, 2022, 7:10am
5
This looks to me to just be faulty benchmarks. Sums of ranges are converted to a version of m*(m+1)Γ·2
, and should take approximately 1ns to evaluate.
Benchmarking with non-const will not give correct results.
1 Like
Sukera
April 14, 2022, 7:26am
6
Yes - thatβs also why the @benchmark sum(Mytype(0):n) setup=(Mytype=UInt16; n=Mytype(25))
benchmark in my collapsed code above gives ~1ns for this - nothing has to be looked up in global scope, whereas the first benchmark had to look up both T
(and its associated conversion function) and m
while the second one βonlyβ looked up m
. The last one is fastest because when everything is a constant/known, it can either be folded completely or inlined completely, allowing a closed form solution.
2 Likes