Nanmean options?

Daneel · July 21, 2017, 8:09am

Up until Julia 0.6 mean ignored NaN’s. What are the options for those of us that want a simple nanmean function?

Somewhere I remember seeing an alternative function in Base but I haven’t been able to find it again.

There were two options mentioned in the commit linked above: NanMath.jl and DataArrays.jl. NaNMath looks like a good option if there’s nothing available in Base but DataArrays looks like overkill for a simple nanmean function.

Matlab (just for those searching for nanmean usage like in Matlab)

Tamas_Papp · July 21, 2017, 2:03pm

You can define a function, but

mean(filter(!isnan, x))

is already very compact.

bobportmann · July 21, 2017, 2:32pm

Yes, but that doesn’t work if x is more that 1D and you want to apply mean to one dimension, e. g.

mean(filter(!isnan, x), 2)

Tamas_Papp · July 21, 2017, 4:34pm

Define nanmean as I suggested, then nanmean(X, d, dims) with mapslices.

bobportmann · July 21, 2017, 6:07pm

OK. It would be nice if this was done in NaNMath so it was a full replacement for the functions it defines.

Daneel · July 26, 2017, 11:45am

Here is a concise example based on @Tamas_Papp 's suggestions.

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,y)

julia> y = [NaN 2 3 4;5 6 NaN 8;9 10 11 12]
3×4 Array{Float64,2}:
 NaN     2.0    3.0   4.0
   5.0   6.0  NaN     8.0
   9.0  10.0   11.0  12.0

julia> nanmean(y)
7.0

julia> nanmean(y,1)
1×4 Array{Float64,2}:
 7.0  6.0  7.0  8.0

deszoeke · March 5, 2018, 6:36pm

Is mean(filter(!isnan, x) fast compared to filtering inside the accumulator? Does it require more memory?

mapslices solution is awesome but seems obscure to me, compared to nanmean. I always need nanmean for processing geophysical data, and would find it useful in StatsBase.

Also note, like Matlab’s, this nanmean along multiple dimensions is not associative in general, but depends on order of operations on dimensions,
nanmean(nanmean((x,1),2) != nanmean(nanmean((x,2),1).

deszoeke · May 18, 2018, 5:27pm

Some useful comments are posted after the closed issue Fail to ignore NaN when calculating mean of an array #4552. Note the appropriate place for comments is here in Discourse. So to move the thread here:

@timholy’s Images.jl and meanfinite have fast, flexible dimensioned averages that ignore non-finite values.

I have reworked meanfinite as condmean in ConditionalMean to accumulate subject to an arbitrary condition (which could be to ignore sentinel values, i.e. NaN, -999). It also can average a callable function of the array. nanmean() and nanstd() methods are provided. Tests and pull requests are welcome.

Now this devolves into a ease-of-use complaint: The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).

This diversity and changing architecture leads to severe usability problems. Searching the discussions, I find different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it’s ~~impossible~~ hard to tell what works, what’s supported, deprecated, or broken.

Tamas_Papp · May 18, 2018, 5:40pm

I sympathize with the frustration about this issue, but note that things are in transition now because of the anticipated efficiency gain for small unions in v0.7. Hopefully the semantics will settle soon after the next stable release.

bjarthur · June 8, 2020, 8:17pm

thanks daneel. here is a more generic version of your code above:

using Statistics, Test

_nanfunc(f, A, ::Colon) = f(filter(!isnan, A))
_nanfunc(f, A, dims) = mapslices(a->_nanfunc(f,a,:), A, dims=dims)
nanfunc(f, A; dims=:) = _nanfunc(f, A, dims)

A = [1 2 3; 4 5 6; 7 8 9; NaN 11 12]

@test isapprox(nanfunc(mean, A), mean(filter(!isnan, A)))
@test nanfunc(mean, A, dims=1) == [4.0 6.5 7.5]
@test nanfunc(mean, A, dims=2) == transpose([2.0 5.0 8.0 11.5])

@test isapprox(nanfunc(var, A), var(filter(!isnan, A)))
@test nanfunc(var, A, dims=1) == [9.0 15.0 15.0]
@test nanfunc(var, A, dims=2) == transpose([1.0 1.0 1.0 0.5])

Nash · April 8, 2021, 7:31am

nanmean(y,1) throws an error for me:

MethodError: no method matching mapslices(::typeof(nanmean), ::Array{Float64,2}, ::Int64)

Daneel · April 8, 2021, 8:20am

Good catch. At some point mapslices changed how it inputs dimensions. Apparently I never use that case in my code so I hadn’t noticed it. Fortunately the fix is simple, y ⟹ dims=y

nanmean(x) = mean(filter(!isnan,x))
nanmean(x,y) = mapslices(nanmean,x,dims=y)

thompsonmj · June 3, 2021, 7:56pm

When using mean(filter(!isnan, A), dims=2) for a 2D matrix A, the filter flattens it into a vector before taking the mean. How can I do a mean over the columns of matrix ignoring the NaN values?

aramirezreyes · June 3, 2021, 8:33pm

Maybe GitHub - brenhinkeller/NaNStatistics.jl: Fast summary statistics, histograms, and binning – ignoring NaNs could help here.

thompsonmj · June 3, 2021, 8:57pm

Cool! That worked for me, thanks! Seems like this should be achievable with the skipmissing feature in Statistics, but it doesn’t seem to accept both skipmissing and dims.

ZekeMarshall · February 11, 2024, 10:55am

Thanks, this thread is a lifesaver and such a feature does feel like a conspicuous absence! Particularly the lack of support for using both skipmissing and dims= as @thompsonmj mentions.

In my case I needed to find the minimum value in each column of a 2d array mat, ignoring zeros. I find that all three of the methods below work:

mat = rand(0:5, 5, 11)

1: mapslices(x -> minimum(skipmissing(x)), replace(mat, 0 => missing), dims = 1)
2: mapslices(x -> minimum(filter(!isnan, x)), replace(mat, 0 => NaN), dims = 1)
3: mapslices(x -> minimum(filter(!iszero, x)), mat, dims = 1)

However, these approaches only work if there are no columns which only contain zeros. Which will lead to the error:

ERROR: MethodError: reducing over an empty collection is not allowed

There is a huge performance penalty when doing this with anonymous functions, e.g.

@elapsed minimum(mat, dims = 1) ~ 3.8735e-5s
@elapsed mapslices(x -> minimum(filter(!iszero, x)), mat, dims = 1) ~ 0.159694082s

Defining the function beforehand helps, e.g.

nzmin(x) = minimum(filter(!iszero, x))
nzmin(x,y) = mapslices(nzmin, mat, dims = y)
@elapsed nzmin(mat,1) ~ 0.000339807s

or using a modified version of the generalised function @bjarthur outlines:

_nzfunc(f, A, ::Colon) = f(filter(!iszero, A))
_nzfunc(f, A, dims) = mapslices(a->_nzfunc(f,a,:), A, dims=dims)
nzfunc(f, A; dims=:) = _nzfunc(f, A, dims)

@elapsed nzfunc(minimum, mat, dims=1) ~ 0.000850458

Which incurs a small penalty in passing a function through, relative to defining a specific function.

I’m new new to Julia, so may have missed something here!

aplavin · February 11, 2024, 1:12pm

minimum(filter(!iszero, x)) materializes an actual filtered array in memory, making it not that efficient. The skip function (from Skipper.jl) would be better here.
mapslices seem to be suboptimal, regular map() on eachcol(mat) can be faster.
For measuring time taken by fast pieces of code (<< 1 sec), use BenchmarkTools.jl instead of @time/@elapsed.

Comparison:

# baseline:
julia> @btime minimum($mat, dims = 1)
  215.762 ns (3 allocations: 240 bytes)

julia> using Skipper

# skip zeros using mapslices
julia> @btime mapslices(x -> minimum(skip(iszero, x)), $mat, dims=1)
  499.139 ns (15 allocations: 624 bytes)

# skip zeros using map(eachcol)
julia> @btime map(x -> minimum(skip(iszero, x)), eachcol($mat))
  64.626 ns (1 allocation: 144 bytes)

ZekeMarshall · February 11, 2024, 6:50pm

Thank you very much @aplavin!! I didn’t know about Skipper, just what I needed!

Topic		Replies	Views
Nanmean for 3d array General Usage question	3	799	June 9, 2021
Ignoring NaN in elementwise aggregations General Usage question	7	6346	December 20, 2019
NA-ignoring aggregations General Usage	15	5258	January 5, 2019
Reducing verbosity in array calculations General Usage question , statistics , array , syntax	8	1315	February 5, 2018
Arithmetic operations on multi-dimensional arrays with Missings New to Julia	4	866	June 27, 2018

Nanmean options?

Related topics