 # Nanmean options?

Up until Julia 0.6 `mean` ignored NaN’s. What are the options for those of us that want a simple `nanmean` function?

Somewhere I remember seeing an alternative function in Base but I haven’t been able to find it again.

There were two options mentioned in the commit linked above: NanMath.jl and DataArrays.jl. NaNMath looks like a good option if there’s nothing available in Base but DataArrays looks like overkill for a simple nanmean function.

Matlab (just for those searching for nanmean usage like in Matlab)

You can define a function, but

``````mean(filter(!isnan, x))
``````

is already very compact.

Yes, but that doesn’t work if x is more that 1D and you want to apply mean to one dimension, e. g.

``````mean(filter(!isnan, x), 2)
``````

Define `nanmean` as I suggested, then `nanmean(X, d, dims)` with `mapslices`.

OK. It would be nice if this was done in `NaNMath` so it was a full replacement for the functions it defines.

1 Like

Here is a concise example based on @Tamas_Papp 's suggestions.

``````nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,y)
``````

``````julia> y = [NaN 2 3 4;5 6 NaN 8;9 10 11 12]
3×4 Array{Float64,2}:
NaN     2.0    3.0   4.0
5.0   6.0  NaN     8.0
9.0  10.0   11.0  12.0

julia> nanmean(y)
7.0

julia> nanmean(y,1)
1×4 Array{Float64,2}:
7.0  6.0  7.0  8.0

``````
3 Likes

Is `mean(filter(!isnan, x)` fast compared to filtering inside the accumulator? Does it require more memory?

`mapslices` solution is awesome but seems obscure to me, compared to `nanmean`. I always need `nanmean` for processing geophysical data, and would find it useful in StatsBase.

Also note, like Matlab’s, this `nanmean` along multiple dimensions is not associative in general, but depends on order of operations on dimensions,
`nanmean(nanmean((x,1),2) != nanmean(nanmean((x,2),1)`.

Some useful comments are posted after the closed issue Fail to ignore NaN when calculating mean of an array #4552. Note the appropriate place for comments is here in Discourse. So to move the thread here:

@timholy’s Images.jl and `meanfinite` have fast, flexible dimensioned averages that ignore non-finite values.

I have reworked `meanfinite` as `condmean` in ConditionalMean to accumulate subject to an arbitrary condition (which could be to ignore sentinel values, i.e. `NaN`, -999). It also can average a callable function of the array. `nanmean()` and `nanstd()` methods are provided. Tests and pull requests are welcome.

Now this devolves into a ease-of-use complaint: The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include `NA`, `missing`, `Null`, Unions thereof, and sentinel values for numeric types (e.g. `NaN`).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I find different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it’s impossible hard to tell what works, what’s supported, deprecated, or broken.

I sympathize with the frustration about this issue, but note that things are in transition now because of the anticipated efficiency gain for small unions in `v0.7`. Hopefully the semantics will settle soon after the next stable release.

2 Likes

thanks daneel. here is a more generic version of your code above:

``````using Statistics, Test

_nanfunc(f, A, ::Colon) = f(filter(!isnan, A))
_nanfunc(f, A, dims) = mapslices(a->_nanfunc(f,a,:), A, dims=dims)
nanfunc(f, A; dims=:) = _nanfunc(f, A, dims)

A = [1 2 3; 4 5 6; 7 8 9; NaN 11 12]

@test isapprox(nanfunc(mean, A), mean(filter(!isnan, A)))
@test nanfunc(mean, A, dims=1) == [4.0 6.5 7.5]
@test nanfunc(mean, A, dims=2) == transpose([2.0 5.0 8.0 11.5])

@test isapprox(nanfunc(var, A), var(filter(!isnan, A)))
@test nanfunc(var, A, dims=1) == [9.0 15.0 15.0]
@test nanfunc(var, A, dims=2) == transpose([1.0 1.0 1.0 0.5])
``````

nanmean(y,1) throws an error for me:

MethodError: no method matching mapslices(::typeof(nanmean), ::Array{Float64,2}, ::Int64)

Good catch. At some point `mapslices` changed how it inputs dimensions. Apparently I never use that case in my code so I hadn’t noticed it. Fortunately the fix is simple, `y``dims=y`

``````nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,dims=y)
``````