# Range (statistics)

I often need to compute a statistical range (difference between maximum and minimum).
Is there a function in the `Statistics` standard library to compute the range? Or should we use: ` maximum(x) - minimum(x)`?

Since `extrema` is slightly faster than calling both functions separately, you could do this

`rangeof(x) = (tmp = extrema(x); return tmp - tmp)`

or

`rangeof(x) = ((low, high) = extrema(x); return high - low)`

1 Like
``````julia> using BenchmarkTools

julia> datarange(x) = -(-(extrema(x)...))
datarange (generic function with 1 method)

julia> @btime datarange(\$x)
161.395 ns (0 allocations: 0 bytes)
9.91438063817977

julia> @btime maximum(\$x) - minimum(\$x)
257.181 ns (0 allocations: 0 bytes)
9.91438063817977
``````
1 Like

Is this computation done for scaling a given set of samples? If yes, you may be interested in using TableTransforms.jl’s MinMax or Scale:

https://juliaml.github.io/TableTransforms.jl/stable/transforms/#MinMax

1 Like

Thanks, so I guess I better define one of the above range functions in my startup.jl, if there is nothing ready-made.

Fyi, I found this minmax algorithm here, which seems to perform better than the previous solutions:

``````function range2(x)
min = max = first(x)
for xi in view(x, 2:length(x))
min > xi ? min = xi : max < xi && (max = xi)
end
return max - min
end

x = 10 * rand(100)
@btime range2(\$x)        # 123 ns (0 allocs: 0 bytes)
``````

Since range is one of the important measures of variability, an optimized version should be available in the standard library, IMHO.

For performance, branches (as generated by `? :` operator) are bad. An alternate version:

``````function range3(x)
min = typemax(eltype(x))
max = typemin(eltype(x))
for xi in x
min = ifelse(min > xi, xi, min)
max = ifelse(max < xi, xi, max)
end
return max - min
end
``````

is faster on my machine.

``````julia> @btime range3(\$x)        # 123
98.743 ns (0 allocations: 0 bytes)
``````
4 Likes

Thanks @Dan for the insights, I see 40% speedup here!

This just replicates the functionality of `extrema` followed by a subtraction. If this is faster than `extrema`, that means there’s something wrong with the latter.

3 Likes

Hmm, I amend my statement a bit. `extrema` handles `NaN`s, which `range3` does not (nor does it handle empty collections):

``````julia> range3([1.0, NaN, 2.0])
1.0

julia> extrema([1.0, NaN, 2.0])
(NaN, NaN)

julia> range3(Float64[])
-Inf

julia> range3(Int[])
1
``````

If you want this to be in a stdlib, it should be robust against things like this.

3 Likes

Fixing this does not entail so much overhead:

``````function range4(x)
isempty(x) && error("Does not support empty vectors")
min = typemax(eltype(x))
max = typemin(eltype(x))
hasnan = false
for xi in x
hasnan |= isnan(xi)
min = ifelse(min > xi, xi, min)
max = ifelse(max < xi, xi, max)
end
hasnan && error("Does not support NaNs in vectors")
return max - min
end
``````
``````julia> x = 10 * rand(100);

julia> y = 10 * rand(1000000);

julia> @btime range4(\$x);
107.822 ns (0 allocations: 0 bytes)

julia> @btime extrema(\$x);
318.175 ns (0 allocations: 0 bytes)

julia> @btime range4(\$y);
1.397 ms (0 allocations: 0 bytes)

julia> @btime extrema(\$y);
3.285 ms (0 allocations: 0 bytes)
``````

There is a ~3X gap here (persisting into large vectors), and this trivial (and popular) operation could probably be optimized further (SIMD, threads etc.)

2 Likes

Can you take a look at `Base.extrema` and see if you can tell what, if anything, is wrong with it, performance-wise?

Because I really think the right approach is to use `extrema` for this, and fix any problems there.

2 Likes

Dan, sorry to ask, but is `hasnan` required?
Could we write `range4()` as follows:

``````function range4b(x)
isempty(x) && error("Does not support empty vectors")
min = typemax(eltype(x))
max = typemin(eltype(x))
for xi in x
isnan(xi) && error("Does not support NaNs in vectors")
min = ifelse(min > xi, xi, min)
max = ifelse(max < xi, xi, max)
end
return max - min
end
``````

You probably want to avoid branches.

You wouldn’t want to read the vector from memory twice. The memory transfers are critical for performance.

Whoops, the code changed… but now DNF is right, the branch is bad.

Sorry, you seem to have captured a temporary change in the code, the original snippet is restored.

But benchmarking is the best disinfectant. You are welcome to post results (my machine’s results are up there already).

Had a quick glance, but will look some more. Basically, there is an attempt to capture many reduce-like operations in one template, and perhaps the overhead of `reduce` with support to `dims` option is taking a toll.

On the NaN / empty-vec front, I think actually a vector type which prohibits NaNs can be useful (i.e. throws when operations generate a NaN). With this type, specialized cases can be dispatched.

In your examples which do not contain `NaN`, for the small array case, `range4b()` is ~5% faster, and for the large array case, `range4()` is ~2% faster.

1 Like

…and compared to `extrema`? is it still 3X?