Range (statistics)

I often need to compute a statistical range (difference between maximum and minimum).
Is there a function in the Statistics standard library to compute the range? Or should we use: maximum(x) - minimum(x)?

Since extrema is slightly faster than calling both functions separately, you could do this

rangeof(x) = (tmp = extrema(x); return tmp[2] - tmp[1])

or

rangeof(x) = ((low, high) = extrema(x); return high - low)

1 Like
julia> using BenchmarkTools

julia> datarange(x) = -(-(extrema(x)...))
datarange (generic function with 1 method)

julia> @btime datarange($x)
  161.395 ns (0 allocations: 0 bytes)
9.91438063817977

julia> @btime maximum($x) - minimum($x)
  257.181 ns (0 allocations: 0 bytes)
9.91438063817977
1 Like

Is this computation done for scaling a given set of samples? If yes, you may be interested in using TableTransforms.jl’s MinMax or Scale:

https://juliaml.github.io/TableTransforms.jl/stable/transforms/#MinMax

1 Like

Thanks, so I guess I better define one of the above range functions in my startup.jl, if there is nothing ready-made.

Fyi, I found this minmax algorithm here, which seems to perform better than the previous solutions:

function range2(x)
    min = max = first(x)
    for xi in view(x, 2:length(x))
        min > xi ? min = xi : max < xi && (max = xi)
    end
    return max - min
end

x = 10 * rand(100)
@btime range2($x)        # 123 ns (0 allocs: 0 bytes)

Since range is one of the important measures of variability, an optimized version should be available in the standard library, IMHO.

For performance, branches (as generated by ? : operator) are bad. An alternate version:

function range3(x)
    min = typemax(eltype(x))
    max = typemin(eltype(x))
    for xi in x
        min = ifelse(min > xi, xi, min)
        max = ifelse(max < xi, xi, max)
    end
    return max - min
end

is faster on my machine.

julia> @btime range3($x)        # 123
  98.743 ns (0 allocations: 0 bytes)
4 Likes

Thanks @Dan for the insights, I see 40% speedup here!

This just replicates the functionality of extrema followed by a subtraction. If this is faster than extrema, that means there’s something wrong with the latter.

3 Likes

Hmm, I amend my statement a bit. extrema handles NaNs, which range3 does not (nor does it handle empty collections):

julia> range3([1.0, NaN, 2.0])
1.0

julia> extrema([1.0, NaN, 2.0])
(NaN, NaN)

julia> range3(Float64[])
-Inf

julia> range3(Int[])
1

If you want this to be in a stdlib, it should be robust against things like this.

3 Likes

Fixing this does not entail so much overhead:

function range4(x)
    isempty(x) && error("Does not support empty vectors")
    min = typemax(eltype(x))
    max = typemin(eltype(x))
    hasnan = false
    for xi in x
        hasnan |= isnan(xi)
        min = ifelse(min > xi, xi, min)
        max = ifelse(max < xi, xi, max)
    end
    hasnan && error("Does not support NaNs in vectors")
    return max - min
end
julia> x = 10 * rand(100);

julia> y = 10 * rand(1000000);

julia> @btime range4($x);
  107.822 ns (0 allocations: 0 bytes)

julia> @btime extrema($x);
  318.175 ns (0 allocations: 0 bytes)

julia> @btime range4($y);
  1.397 ms (0 allocations: 0 bytes)

julia> @btime extrema($y);
  3.285 ms (0 allocations: 0 bytes)

There is a ~3X gap here (persisting into large vectors), and this trivial (and popular) operation could probably be optimized further (SIMD, threads etc.)

2 Likes

Can you take a look at Base.extrema and see if you can tell what, if anything, is wrong with it, performance-wise?

Because I really think the right approach is to use extrema for this, and fix any problems there.

2 Likes

Dan, sorry to ask, but is hasnan required?
Could we write range4() as follows:

function range4b(x)
    isempty(x) && error("Does not support empty vectors")
    min = typemax(eltype(x))
    max = typemin(eltype(x))
    for xi in x
        isnan(xi) && error("Does not support NaNs in vectors")
        min = ifelse(min > xi, xi, min)
        max = ifelse(max < xi, xi, max)
    end
    return max - min
end

You probably want to avoid branches.

You wouldn’t want to read the vector from memory twice. The memory transfers are critical for performance.

Whoops, the code changed… but now DNF is right, the branch is bad.

Sorry, you seem to have captured a temporary change in the code, the original snippet is restored.

But benchmarking is the best disinfectant. You are welcome to post results (my machine’s results are up there already).

Had a quick glance, but will look some more. Basically, there is an attempt to capture many reduce-like operations in one template, and perhaps the overhead of reduce with support to dims option is taking a toll.

On the NaN / empty-vec front, I think actually a vector type which prohibits NaNs can be useful (i.e. throws when operations generate a NaN). With this type, specialized cases can be dispatched.

In your examples which do not contain NaN, for the small array case, range4b() is ~5% faster, and for the large array case, range4() is ~2% faster.

1 Like

…and compared to extrema? is it still 3X?