Somewhere I remember seeing an alternative function in Base but I haven’t been able to find it again.
There were two options mentioned in the commit linked above: NanMath.jl and DataArrays.jl. NaNMath looks like a good option if there’s nothing available in Base but DataArrays looks like overkill for a simple nanmean function.
Matlab (just for those searching for nanmean usage like in Matlab)
Is mean(filter(!isnan, x) fast compared to filtering inside the accumulator? Does it require more memory?
mapslices solution is awesome but seems obscure to me, compared to nanmean. I always need nanmean for processing geophysical data, and would find it useful in StatsBase.
Also note, like Matlab’s, this nanmean along multiple dimensions is not associative in general, but depends on order of operations on dimensions, nanmean(nanmean((x,1),2) != nanmean(nanmean((x,2),1).
@timholy’s Images.jl and meanfinite have fast, flexible dimensioned averages that ignore non-finite values.
I have reworked meanfinite as condmean in ConditionalMean to accumulate subject to an arbitrary condition (which could be to ignore sentinel values, i.e. NaN, -999). It also can average a callable function of the array. nanmean() and nanstd() methods are provided. Tests and pull requests are welcome.
Now this devolves into a ease-of-use complaint: The unsettledness of standard(s) for how to handle missing data is an impediment to developing useful functions and methods. The long list of breaking, late-breaking, and deprecated ways to do this include NA, missing, Null, Unions thereof, and sentinel values for numeric types (e.g. NaN).
This diversity and changing architecture leads to severe usability problems. Searching the discussions, I find different computer science and data science philosophies and practical reasons for one approach over another. I respect these arguments, but it’s impossible hard to tell what works, what’s supported, deprecated, or broken.
I sympathize with the frustration about this issue, but note that things are in transition now because of the anticipated efficiency gain for small unions in v0.7. Hopefully the semantics will settle soon after the next stable release.
Good catch. At some point mapslices changed how it inputs dimensions. Apparently I never use that case in my code so I hadn’t noticed it. Fortunately the fix is simple, y ⟹ dims=y
When using mean(filter(!isnan, A), dims=2) for a 2D matrix A, the filter flattens it into a vector before taking the mean. How can I do a mean over the columns of matrix ignoring the NaN values?
Cool! That worked for me, thanks! Seems like this should be achievable with the skipmissing feature in Statistics, but it doesn’t seem to accept both skipmissing and dims.