NA-ignoring aggregations

aplavin · December 30, 2018, 4:22pm

I often find myself in a need of aggregation functions which skip NA-like values, e.g. NaN or missing or nothing. There doesn’t seem to be a package yet providing even basic functions like sum, mean, median etc, so I rolled out a very simple wrapper myself, which can skip any specified value. The whole implementation is just a few lines of code:

using JuliennedArrays: julienne

agg(arr, func, dimscode; skip=undef) = _agg(arr, func, dimscode, skip)

_agg(arr, func, dimscode, skip::UndefInitializer) = map(func, julienne(arr, dimscode))
_agg(arr, func, dimscode, skip::Function) = map(a -> func(Iterators.filter(!skip, a)), julienne(arr, dimscode))
_agg(arr, func, dimscode, skip) = _agg(arr, func, dimscode, x -> x == skip)

and it works reasonably fast:

arr = rand(Float64, 100, 100, 100)
@btime agg(arr, mean, (:, *, *), skip=0.0);
@btime agg(arr, mean, (:, *, *));
@btime mean(arr, dims=1);

outputs

  1.317 ms (20007 allocations: 1015.83 KiB)
  352.940 μs (10004 allocations: 703.27 KiB)
  248.873 μs (3 allocations: 78.22 KiB)

One of the things I don’t like here is that I cannot specify aggregation axes by indices (dims=1 above) or names in case of AxisArray. Any reasonably simple way to implement this?

Another question - does it look general and useful enough to be included into some package?

ValdarT · December 30, 2018, 4:43pm

That’s what skipmissing() is there for. The reason you can’t find packages for it is that you shouldn’t need them - unlike in some other languages / frameworks, you don’t have special functions or arguments for functions to take care of this but you have generic helper functions. The upside of the Julia approach is that it’s generic and you don’t need to worry about special casing functions (e.g., handling this in the function definition or having two versions of the same function).

aplavin · December 30, 2018, 4:49pm

skipmissing works only for a very specific case, if I understand you correctly. Namely, when NAs are represented as missing and not NaN (or anything else) and when you need aggregation over the whole array.

ValdarT · December 30, 2018, 5:07pm

missing is what you should be using in Julia to represent missing values (NA in R or NULL in SQL); NaN is a different thing and its arguably bad that some systems conflate the two.

It returns an iterator so yes, it is used when you iterate over the array but how else would you apply an aggregation? As for the “whole” array, just filter the array as appropriate.

aplavin · December 30, 2018, 5:14pm

NA-like values other than missing can occur in various contexts, and one cannot just pretent such cases do not exist. Personally I often experience two scenarios. One is interoperability, where the other side uses NaN or even some sentinel value. Another is when the array represents some measurements, and nonpositive values represent failure to measure, so I want to just skip them when computing something. In both cases I cannot just replace the not-available measurements with missing, because the original values may be used by other computations.

It returns an iterator so yes, it is used when you iterate over the array but how else would you apply an aggregation?

Maybe I just don’t see the obvious, but how to compute mean(arr, dims=1) analogue ignoring missings using just skipmissing?

ValdarT · December 30, 2018, 6:15pm

Yes, of course other kinds of values to represent missing data will occur, I just meant that missing is the generic Julia way to represent and work with them. You can always convert from one type to the other, in case of NaN you can use, e.g., replace!(x -> isnan(x) ? missing : x, arr).

Ahh, now I get what you meant by the “whole array” before. The iterator, indeed, looses the shape of the array. It would actually be very nice if the mean(skipmissing(arr), dims=1) just worked but yes, it currently doesn’t. I guess the way to achieve it is mapslices(x -> mean(skipmissing(x)), arr, dims=1).

ValdarT · December 30, 2018, 6:19pm

By the way, the DataFrames library has many of such use cases nicely covered (although the main reason for it is to work with heterogenous data, of course).

aplavin · December 30, 2018, 6:22pm

You can always convert from one type to the other, in case of NaN you can use, e.g., replace!(x -> isnan(x) ? missing : x, arr) .

This requires making another copy of data, and makes some kinds of interoperability (when the memory layout is important) impossible inplace.

I guess the way to achieve it is mapslices(x -> mean(skipmissing(x)), arr, dims=1).

Yes, I know about that, but it’s much slower - see an older post of mine.

the DataFrames library has many of such use cases nicely covered

I though that dataframes are basically collections of 1D arrays, so don’t see how it can help with n-dimensional reductions.

ValdarT · December 30, 2018, 6:44pm

Hmm, didn’t know about the mapslices performance issue - hopefully it gets improved soon.

Depends on what you’re doing but colwise and by/ aggregate might be helpful depending on the context. I actually added this remark in case someone else stumbles on this thread but probably shouldn’t have as it isn’t directly relevant to your question as you rightly point out.

Paul_Soderlind · December 30, 2018, 7:03pm

I would like to bring up a related issue: how to handle NaN/missing in multivariate settings, for instance, in OLS.

The basic case is to fit the b in Y = X*b + residual, where y is a vector with T elements, X is a TxK matrix and b is a vector with K elements.

In case X[t,:] or Y[t] contains NaNs/missings, then a common approach is to drop row t from both Y and X and the proceed with the fitting.

While it’s pretty straightforward to code this up, a convenience function would be helpful and much appreciated by those doing statistics with Julia. It would be a sort of multivariate generalisation of skipmissing.

ValdarT · December 30, 2018, 8:07pm

There’s also an old issue for this problem open at Missings.jl.

For just dropping rows with missing values a good solution might be to just add dropmissing and completecases implementations for AbstractArray to Missings.jl (behaving the same way they do in DataFrames.jl)?

tkoolen · December 31, 2018, 6:37am

Maybe this calls for a MaskedArray <: AbstractArray? You can kind of do this using MappedArrays:

using MappedArrays
A = rand(Float64, 100, 100, 100)
A[:, :, 1] .= NaN
mean(mappedarray(x -> isnan(x) ? zero(x) : x, A), dims=1)

but the value with which to replace the NA-like value depends on the reduction.

aplavin · December 31, 2018, 11:14am

Replacing NaNs with zeros certainly changes mean value.
I agree that MaskedArray can really be useful for such cases. It probably requires way more Julia-skill to implement than the wrapper function I suggest in the first post. But it should be possible to make it general enough with transparent support of various AbstractArrays like AxisArray and memory-mapped arrays, and without extra memory usage.

tkoolen · December 31, 2018, 1:01pm

D’oh. I had sum initially and then replaced it with mean without thinking.

nalimilan · January 5, 2019, 4:29pm

This PR implements reductions over dimensions for skipmissing, but it has surprisingly received little support so far. It shouldn’t be hard to generalize to other values in the spirit of this other PR:
https://github.com/JuliaLang/julia/pull/28027

aplavin · January 5, 2019, 8:32pm

The general skip function is basically the same as filter, so as I understand the combination of ideas from these two PR is to support outputs of multidimensional filter calls in all aggregations. It’s probably easy for those using mapreduce, but much harder or impossible for functions from many separate libraries or custom ones. So this would certainly solve some of the issues, but the more I think about the more I like another approach.

Indeed, all the dims=... arguments and the proposed support for skipmissing are conceptually separate from the aggregation function implementation itself. So, it would be really convenient to have interface similar to what I propose in the first post here, which makes it easy to apply (1) any aggregation function written without even thinking of missing values or n-dim arrays (2) slice-wise to arrays of any dimensionality (3) keeping the AbstractArray class intact (not all dims-supporting functions do that automatically) and (4) easily skipping invalid values.
Of course, individual functions should have the option to manually implement the n-dimensional case for performance, but the general interface is very important.
Personally I use the agg function from the first post a lot recently. Even through I don’t like a few things about it, I didn’t find anything better.

Topic		Replies	Views
Ignoring NaN in elementwise aggregations General Usage question	7	6270	December 20, 2019
Statistics.mean() function with a Matrix containing missing values New to Julia	8	1167	February 6, 2023
Why are missing values not ignored by default? Internals & Design data , missing-values	330	8361	January 17, 2024
DataFrames, aggregate with missings Data dataframes	2	560	May 4, 2020
Arithmetic operations on multi-dimensional arrays with Missings New to Julia	4	834	June 27, 2018

NA-ignoring aggregations

Related topics