Bug in var() when supplying means as vectors

I’m trying to standardize (subtract the mean and divide by the standard deviation) of each column in a matrix. In doing so, I think I’v discovered a bug in the var() function when supplying the means.

The following code calculates the variance of the second column of Y as NaN or Inf about 1% of the time.

using Statistics

n = 1000000
failed = 0

for i = 1:n

    Y = [-0.1541210046743307 -1.4703177132816831 0.13467071528166455 -0.09852533308058764 0.877991436467067;
        1.0834881229604125 -3.027334287373694 -1.2882941558820766 -1.5294166969571168 -0.7046862823383356;
        -0.3527617642771778 -1.9171259707603048 0.38208240655404657 -0.1838590266087666 0.18088977119725;
        -0.05385169300191628 -0.5331876236894679 0.13781372390866994 -0.1991690640034743 -0.4252461450819313;
        0.847451616756052 0.13212361707950138 -1.3483017839678413 -0.6442240850236565 -0.49114238748086886]

    μ = vec(mean(Y, dims=1))
    σ2 = var(Y, dims = 1, mean = μ)

    if isnan(σ2[2]) || σ2[2] == Inf
        global failed += 1
    end

end

@show failed / n

While the same code without the vec() wrapping the means does not result in the same error.

More critically, if I add a column to the array and run the code, I get an EXCEPTION_ACCESS_VIOLATION.

The code

using Statistics

n = 1000000
failed = 0

for i = 1:n

    Y = [-0.1541210046743307 -1.4703177132816831 0.13467071528166455 -0.09852533308058764 0.877991436467067 1.0;
        1.0834881229604125 -3.027334287373694 -1.2882941558820766 -1.5294166969571168 -0.7046862823383356 2.0;
        -0.3527617642771778 -1.9171259707603048 0.38208240655404657 -0.1838590266087666 0.18088977119725 3.0;
        -0.05385169300191628 -0.5331876236894679 0.13781372390866994 -0.1991690640034743 -0.4252461450819313 4.0;
        0.847451616756052 0.13212361707950138 -1.3483017839678413 -0.6442240850236565 -0.49114238748086886 5.0]

    μ = vec(mean(Y, dims=1))
    σ2 = var(Y, dims = 1, mean = μ)

    if isnan(σ2[2]) || σ2[2] == Inf
        global failed += 1
    end

end

@show failed / n

produces the following error:

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x1dac3b1f – getindex at .\array.jl:788 [inlined]
getindex at .\multidimensional.jl:543 [inlined]
centralize_sumabs2! at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:247
in expression starting at C:\Users\gabeh\OneDrive\Documents\Running Scripts\test_nan.jl:6
getindex at .\array.jl:788 [inlined]
getindex at .\multidimensional.jl:543 [inlined]
centralize_sumabs2! at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:247
#varm!#10 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:269
varm!##kw at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:265 [inlined]
_varm at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:299 [inlined]
#varm#11 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:297 [inlined]
varm##kw at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:297 [inlined]
_var at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:337
#var#15 at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:335 [inlined]
var##kw at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Statistics\src\Statistics.jl:335 [inlined]
top-level scope at C:\Users\gabeh\OneDrive\Documents\Running Scripts\test_nan.jl:15
jl_toplevel_eval_flex at /cygdrive/d/buildbot/worker/package_win64/build/src\toplevel.c:808
jl_parse_eval_all at /cygdrive/d/buildbot/worker/package_win64/build/src\ast.c:872
include_string at .\loading.jl:1080
#200 at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:164
withpath at C:\Users\gabeh.julia\packages\CodeTools\kosGY\src\utils.jl:30
withpath at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:9
#199 at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:161 [inlined]
with_logstate at .\logging.jl:398
with_logger at .\logging.jl:505 [inlined]
#198 at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:160 [inlined]
hideprompt at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\repl.jl:140
macro expansion at C:\Users\gabeh.julia\packages\Media\ItEPc\src\dynamic.jl:24 [inlined]
evalall at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:150
jl_apply at /cygdrive/d/buildbot/worker/package_win64/build/src\julia.h:1692 [inlined]
do_apply at /cygdrive/d/buildbot/worker/package_win64/build/src\builtins.c:643
macro expansion at C:\Users\gabeh.julia\packages\Atom\wlPiw\src\eval.jl:39 [inlined]
#172 at .\task.jl:358
unknown function (ip: 000000001DA7BF83)
jl_apply at /cygdrive/d/buildbot/worker/package_win64/build/src\julia.h:1692 [inlined]
start_task at /cygdrive/d/buildbot/worker/package_win64/build/src\task.c:687
Allocations: 309493749 (Pool: 309480613; Big: 13136); GC: 531

Similarly, when I remove the vec() wrapper around the mean, there is no error.

I am using Julia 1.4.0 on a Windows 10 machine. I’ve also run this code on Julia 1.3.0 and 1.3.1 with similar results.

1 Like

Good catch. Looks like there’s a missing check: the means must have a shape compatible with what mean(Y, dims=1) returns, but by calling vec you’re reshaping a 1-row matrix into a column vector, so indexing goes out of bounds. I’ll make a pull request to throw an error instead.

1 Like

See Add missing shape checks for the means argument to var[m] and std[m] by nalimilan · Pull Request #32 · JuliaStats/Statistics.jl · GitHub.

Thanks!

You might also make use of the available standardization:

https://juliastats.org/StatsBase.jl/stable/transformations/