Statistics.var wasting time calculating the mean?

I was trying to understand why Statistics.jl has both var and varm functions (since var already accepts a mean parameter). Doing @less var([1,2,3]) I see:

var(A::AbstractArray; corrected::Bool=true, mean=nothing, dims=:) = _var(A, corrected, mean, dims)

_var(A::AbstractArray, corrected::Bool, mean, dims) =
    varm(A, something(mean, Statistics.mean(A, dims=dims)); corrected=corrected, dims=dims)

_var(A::AbstractArray, corrected::Bool, mean, ::Colon) =
    real(varm(A, something(mean, Statistics.mean(A)); corrected=corrected))

Am I missing something here, or is the method recalculating the mean even when it is provided? Isn’t it wasteful?

Bonus question: I’m still not clear why Statistics exposes both var and varm

It looks like the default value for mean in the first method is nothing. It is eventually passed to

varm(A, something(mean, Statistics.mean(A, dims=dims)); corrected=corrected, dims=dims)

If the the first argument in something is nothing, the result is

julia> something(nothing, 2)
2

if the first argument in something is not nothing, the result is:

julia> something(3, 2)
3

So it appears that the mean is not recalculated if it is provided.

I’m not sure why there are two methods with the same functionality. My guess is that varm might be deprecated at some point, or might eventually be for internal use only. Good question.

1 Like

I believe these functions existed before Julia had keyword arguments.

3 Likes

Since something is a function and not control flow, I think it must evaluate both arguments, even if it then chooses to return the first. So to me it looks like it still recalculates the mean, which is surprising.

1 Like

This is quite easy to test for yourself:

julia> using Statistics, BenchmarkTools

julia> let v = randn(1_000)
           @btime var($v)
           @btime var($v, mean=$(mean(v)))
           @btime varm($v, $(mean(v)))
           @btime mean($v)
       end;
  197.689 ns (0 allocations: 0 bytes)
  195.053 ns (0 allocations: 0 bytes)
  117.394 ns (0 allocations: 0 bytes)
  79.235 ns (0 allocations: 0 bytes)

Yes, it’s quite clear that var(v) and var(v; mean=meanv) have essentially identical timings, whereas the difference between their times and the time for varm(v, meanv) is essentially exactly the time it takes to calculate mean(v).

I’m sure a PR to Statistics.jl rectifying this would be quite welcome.

7 Likes

I assumed that something contained control flow logic to prevent unnecessary evaluations. Good catch!

Edit: Oh man. I just realized that control flow doesn’t even matter because the mean has to be calculated before it’s passed to something. :laughing: Perhaps that’s why it was lurking there for years.

1 Like

That makes sense. And there’s already an issue about this: https://github.com/JuliaLang/Statistics.jl/issues/5.

1 Like

@Mason thanks for the benchmarks!

I’ll make a PR (in a few days).

I believe this is the result of constant prop of nothing, which should statically dispatch to the appropriate something method.