Nanmean options?

Up until Julia 0.6 mean ignored NaN’s. What are the options for those of us that want a simple nanmean function?

Somewhere I remember seeing an alternative function in Base but I haven’t been able to find it again.

There were two options mentioned in the commit linked above: NanMath.jl and DataArrays.jl. NaNMath looks like a good option if there’s nothing available in Base but DataArrays looks like overkill for a simple nanmean function.

Matlab (just for those searching for nanmean usage like in Matlab)

You can define a function, but

mean(filter(!isnan, x))

is already very compact.

Yes, but that doesn’t work if x is more that 1D and you want to apply mean to one dimension, e. g.

mean(filter(!isnan, x), 2)

Define nanmean as I suggested, then nanmean(X, d, dims) with mapslices.

1 Like

OK. It would be nice if this was done in NaNMath so it was a full replacement for the functions it defines.

1 Like

Here is a concise example based on @Tamas_Papp 's suggestions.

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,y)

julia> y = [NaN 2 3 4;5 6 NaN 8;9 10 11 12]
3×4 Array{Float64,2}:
 NaN     2.0    3.0   4.0
   5.0   6.0  NaN     8.0
   9.0  10.0   11.0  12.0

julia> nanmean(y)
7.0

julia> nanmean(y,1)
1×4 Array{Float64,2}:
 7.0  6.0  7.0  8.0

5 Likes

Is mean(filter(!isnan, x) fast compared to filtering inside the accumulator? Does it require more memory?

mapslices solution is awesome but seems obscure to me, compared to nanmean. I always need nanmean for processing geophysical data, and would find it useful in StatsBase.

Also note, like Matlab’s, this nanmean along multiple dimensions is not associative in general, but depends on order of operations on dimensions,
nanmean(nanmean((x,1),2) != nanmean(nanmean((x,2),1).

Some useful comments are posted after the closed issue Fail to ignore NaN when calculating mean of an array #4552. Note the appropriate place for comments is here in Discourse. So to move the thread here:

@timholy’s Images.jl and meanfinite have fast, flexible dimensioned averages that ignore non-finite values.

I have reworked meanfinite as condmean in ConditionalMean to accumulate subject to an arbitrary condition (which could be to ignore sentinel values, i.e. NaN, -999). It also can average a callable function of the array. nanmean() and nanstd() methods are provided. Tests and pull requests are welcome.

Now this devolves into a ease-of-use complaint: The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I find different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it’s impossible hard to tell what works, what’s supported, deprecated, or broken.

I sympathize with the frustration about this issue, but note that things are in transition now because of the anticipated efficiency gain for small unions in v0.7. Hopefully the semantics will settle soon after the next stable release.

2 Likes

thanks daneel. here is a more generic version of your code above:

using Statistics, Test

_nanfunc(f, A, ::Colon) = f(filter(!isnan, A))
_nanfunc(f, A, dims) = mapslices(a->_nanfunc(f,a,:), A, dims=dims)
nanfunc(f, A; dims=:) = _nanfunc(f, A, dims)

A = [1 2 3; 4 5 6; 7 8 9; NaN 11 12]

@test isapprox(nanfunc(mean, A), mean(filter(!isnan, A)))
@test nanfunc(mean, A, dims=1) == [4.0 6.5 7.5]
@test nanfunc(mean, A, dims=2) == transpose([2.0 5.0 8.0 11.5])

@test isapprox(nanfunc(var, A), var(filter(!isnan, A)))
@test nanfunc(var, A, dims=1) == [9.0 15.0 15.0]
@test nanfunc(var, A, dims=2) == transpose([1.0 1.0 1.0 0.5])
1 Like

nanmean(y,1) throws an error for me:

MethodError: no method matching mapslices(::typeof(nanmean), ::Array{Float64,2}, ::Int64)

Good catch. At some point mapslices changed how it inputs dimensions. Apparently I never use that case in my code so I hadn’t noticed it. Fortunately the fix is simple, ydims=y

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,dims=y)
1 Like

When using mean(filter(!isnan, A), dims=2) for a 2D matrix A, the filter flattens it into a vector before taking the mean. How can I do a mean over the columns of matrix ignoring the NaN values?

1 Like

Maybe GitHub - brenhinkeller/NaNStatistics.jl: Fast summary statistics, histograms, and binning – ignoring NaNs could help here.

1 Like

Cool! That worked for me, thanks! Seems like this should be achievable with the skipmissing feature in Statistics, but it doesn’t seem to accept both skipmissing and dims.

1 Like

Thanks, this thread is a lifesaver and such a feature does feel like a conspicuous absence! Particularly the lack of support for using both skipmissing and dims= as @thompsonmj mentions.

In my case I needed to find the minimum value in each column of a 2d array mat, ignoring zeros. I find that all three of the methods below work:

mat = rand(0:5, 5, 11)

1: mapslices(x -> minimum(skipmissing(x)), replace(mat, 0 => missing), dims = 1)
2: mapslices(x -> minimum(filter(!isnan, x)), replace(mat, 0 => NaN), dims = 1)
3: mapslices(x -> minimum(filter(!iszero, x)), mat, dims = 1)

However, these approaches only work if there are no columns which only contain zeros. Which will lead to the error:

ERROR: MethodError: reducing over an empty collection is not allowed

There is a huge performance penalty when doing this with anonymous functions, e.g.

@elapsed minimum(mat, dims = 1) ~ 3.8735e-5s
@elapsed mapslices(x -> minimum(filter(!iszero, x)), mat, dims = 1) ~ 0.159694082s

Defining the function beforehand helps, e.g.

nzmin(x) = minimum(filter(!iszero, x))
nzmin(x,y) = mapslices(nzmin, mat, dims = y)
@elapsed nzmin(mat,1) ~ 0.000339807s

or using a modified version of the generalised function @bjarthur outlines:

_nzfunc(f, A, ::Colon) = f(filter(!iszero, A))
_nzfunc(f, A, dims) = mapslices(a->_nzfunc(f,a,:), A, dims=dims)
nzfunc(f, A; dims=:) = _nzfunc(f, A, dims)

@elapsed nzfunc(minimum, mat, dims=1) ~ 0.000850458

Which incurs a small penalty in passing a function through, relative to defining a specific function.

I’m new new to Julia, so may have missed something here!

1 Like
  1. minimum(filter(!iszero, x)) materializes an actual filtered array in memory, making it not that efficient. The skip function (from Skipper.jl) would be better here.

  2. mapslices seem to be suboptimal, regular map() on eachcol(mat) can be faster.

  3. For measuring time taken by fast pieces of code (<< 1 sec), use BenchmarkTools.jl instead of @time/@elapsed.

Comparison:

# baseline:
julia> @btime minimum($mat, dims = 1)
  215.762 ns (3 allocations: 240 bytes)

julia> using Skipper

# skip zeros using mapslices
julia> @btime mapslices(x -> minimum(skip(iszero, x)), $mat, dims=1)
  499.139 ns (15 allocations: 624 bytes)

# skip zeros using map(eachcol)
julia> @btime map(x -> minimum(skip(iszero, x)), eachcol($mat))
  64.626 ns (1 allocation: 144 bytes)
3 Likes

Thank you very much @aplavin!! I didn’t know about Skipper, just what I needed!