Nanmean options?

Up until Julia 0.6 mean ignored NaN’s. What are the options for those of us that want a simple nanmean function?

Somewhere I remember seeing an alternative function in Base but I haven’t been able to find it again.

There were two options mentioned in the commit linked above: NanMath.jl and DataArrays.jl. NaNMath looks like a good option if there’s nothing available in Base but DataArrays looks like overkill for a simple nanmean function.

Matlab (just for those searching for nanmean usage like in Matlab)

You can define a function, but

mean(filter(!isnan, x))

is already very compact.

Yes, but that doesn’t work if x is more that 1D and you want to apply mean to one dimension, e. g.

mean(filter(!isnan, x), 2)

Define nanmean as I suggested, then nanmean(X, d, dims) with mapslices.

OK. It would be nice if this was done in NaNMath so it was a full replacement for the functions it defines.

1 Like

Here is a concise example based on @Tamas_Papp 's suggestions.

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,y)

julia> y = [NaN 2 3 4;5 6 NaN 8;9 10 11 12]
3×4 Array{Float64,2}:
 NaN     2.0    3.0   4.0
   5.0   6.0  NaN     8.0
   9.0  10.0   11.0  12.0

julia> nanmean(y)

julia> nanmean(y,1)
1×4 Array{Float64,2}:
 7.0  6.0  7.0  8.0


Is mean(filter(!isnan, x) fast compared to filtering inside the accumulator? Does it require more memory?

mapslices solution is awesome but seems obscure to me, compared to nanmean. I always need nanmean for processing geophysical data, and would find it useful in StatsBase.

Also note, like Matlab’s, this nanmean along multiple dimensions is not associative in general, but depends on order of operations on dimensions,
nanmean(nanmean((x,1),2) != nanmean(nanmean((x,2),1).

Some useful comments are posted after the closed issue Fail to ignore NaN when calculating mean of an array #4552. Note the appropriate place for comments is here in Discourse. So to move the thread here:

@timholy’s Images.jl and meanfinite have fast, flexible dimensioned averages that ignore non-finite values.

I have reworked meanfinite as condmean in ConditionalMean to accumulate subject to an arbitrary condition (which could be to ignore sentinel values, i.e. NaN, -999). It also can average a callable function of the array. nanmean() and nanstd() methods are provided. Tests and pull requests are welcome.

Now this devolves into a ease-of-use complaint: The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I find different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it’s impossible hard to tell what works, what’s supported, deprecated, or broken.

I sympathize with the frustration about this issue, but note that things are in transition now because of the anticipated efficiency gain for small unions in v0.7. Hopefully the semantics will settle soon after the next stable release.


thanks daneel. here is a more generic version of your code above:

using Statistics, Test

_nanfunc(f, A, ::Colon) = f(filter(!isnan, A))
_nanfunc(f, A, dims) = mapslices(a->_nanfunc(f,a,:), A, dims=dims)
nanfunc(f, A; dims=:) = _nanfunc(f, A, dims)

A = [1 2 3; 4 5 6; 7 8 9; NaN 11 12]

@test isapprox(nanfunc(mean, A), mean(filter(!isnan, A)))
@test nanfunc(mean, A, dims=1) == [4.0 6.5 7.5]
@test nanfunc(mean, A, dims=2) == transpose([2.0 5.0 8.0 11.5])

@test isapprox(nanfunc(var, A), var(filter(!isnan, A)))
@test nanfunc(var, A, dims=1) == [9.0 15.0 15.0]
@test nanfunc(var, A, dims=2) == transpose([1.0 1.0 1.0 0.5])

nanmean(y,1) throws an error for me:

MethodError: no method matching mapslices(::typeof(nanmean), ::Array{Float64,2}, ::Int64)

Good catch. At some point mapslices changed how it inputs dimensions. Apparently I never use that case in my code so I hadn’t noticed it. Fortunately the fix is simple, ydims=y

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,dims=y)

When using mean(filter(!isnan, A), dims=2) for a 2D matrix A, the filter flattens it into a vector before taking the mean. How can I do a mean over the columns of matrix ignoring the NaN values?

Maybe GitHub - brenhinkeller/NaNStatistics.jl: Fast summary statistics, histograms, and binning – ignoring NaNs could help here.

1 Like

Cool! That worked for me, thanks! Seems like this should be achievable with the skipmissing feature in Statistics, but it doesn’t seem to accept both skipmissing and dims.