NA-ignoring aggregations


#1

I often find myself in a need of aggregation functions which skip NA-like values, e.g. NaN or missing or nothing. There doesn’t seem to be a package yet providing even basic functions like sum, mean, median etc, so I rolled out a very simple wrapper myself, which can skip any specified value. The whole implementation is just a few lines of code:

using JuliennedArrays: julienne

agg(arr, func, dimscode; skip=undef) = _agg(arr, func, dimscode, skip)

_agg(arr, func, dimscode, skip::UndefInitializer) = map(func, julienne(arr, dimscode))
_agg(arr, func, dimscode, skip::Function) = map(a -> func(Iterators.filter(!skip, a)), julienne(arr, dimscode))
_agg(arr, func, dimscode, skip) = _agg(arr, func, dimscode, x -> x == skip)

and it works reasonably fast:

arr = rand(Float64, 100, 100, 100)
@btime agg(arr, mean, (:, *, *), skip=0.0);
@btime agg(arr, mean, (:, *, *));
@btime mean(arr, dims=1);

outputs

  1.317 ms (20007 allocations: 1015.83 KiB)
  352.940 μs (10004 allocations: 703.27 KiB)
  248.873 μs (3 allocations: 78.22 KiB)

One of the things I don’t like here is that I cannot specify aggregation axes by indices (dims=1 above) or names in case of AxisArray. Any reasonably simple way to implement this?

Another question - does it look general and useful enough to be included into some package?


How to calculate a weighted mean with missing observations
#2

That’s what skipmissing() is there for. The reason you can’t find packages for it is that you shouldn’t need them - unlike in some other languages / frameworks, you don’t have special functions or arguments for functions to take care of this but you have generic helper functions. The upside of the Julia approach is that it’s generic and you don’t need to worry about special casing functions (e.g., handling this in the function definition or having two versions of the same function).


#3

skipmissing works only for a very specific case, if I understand you correctly. Namely, when NAs are represented as missing and not NaN (or anything else) and when you need aggregation over the whole array.


#4

missing is what you should be using in Julia to represent missing values (NA in R or NULL in SQL); NaN is a different thing and its arguably bad that some systems conflate the two.

It returns an iterator so yes, it is used when you iterate over the array but how else would you apply an aggregation? As for the “whole” array, just filter the array as appropriate.


#5

NA-like values other than missing can occur in various contexts, and one cannot just pretent such cases do not exist. Personally I often experience two scenarios. One is interoperability, where the other side uses NaN or even some sentinel value. Another is when the array represents some measurements, and nonpositive values represent failure to measure, so I want to just skip them when computing something. In both cases I cannot just replace the not-available measurements with missing, because the original values may be used by other computations.

It returns an iterator so yes, it is used when you iterate over the array but how else would you apply an aggregation?

Maybe I just don’t see the obvious, but how to compute mean(arr, dims=1) analogue ignoring missings using just skipmissing?


#6

Yes, of course other kinds of values to represent missing data will occur, I just meant that missing is the generic Julia way to represent and work with them. You can always convert from one type to the other, in case of NaN you can use, e.g., replace!(x -> isnan(x) ? missing : x, arr).

Ahh, now I get what you meant by the “whole array” before. The iterator, indeed, looses the shape of the array. It would actually be very nice if the mean(skipmissing(arr), dims=1) just worked but yes, it currently doesn’t. I guess the way to achieve it is mapslices(x -> mean(skipmissing(x)), arr, dims=1).


#7

By the way, the DataFrames library has many of such use cases nicely covered (although the main reason for it is to work with heterogenous data, of course).


#8

You can always convert from one type to the other, in case of NaN you can use, e.g., replace!(x -> isnan(x) ? missing : x, arr) .

This requires making another copy of data, and makes some kinds of interoperability (when the memory layout is important) impossible inplace.

I guess the way to achieve it is mapslices(x -> mean(skipmissing(x)), arr, dims=1).

Yes, I know about that, but it’s much slower - see an older post of mine.

the DataFrames library has many of such use cases nicely covered

I though that dataframes are basically collections of 1D arrays, so don’t see how it can help with n-dimensional reductions.


#9

Hmm, didn’t know about the mapslices performance issue - hopefully it gets improved soon.

Depends on what you’re doing but colwise and by/ aggregate might be helpful depending on the context. I actually added this remark in case someone else stumbles on this thread but probably shouldn’t have as it isn’t directly relevant to your question as you rightly point out.


#10

I would like to bring up a related issue: how to handle NaN/missing in multivariate settings, for instance, in OLS.

The basic case is to fit the b in Y = X*b + residual, where y is a vector with T elements, X is a TxK matrix and b is a vector with K elements.

In case X[t,:] or Y[t] contains NaNs/missings, then a common approach is to drop row t from both Y and X and the proceed with the fitting.

While it’s pretty straightforward to code this up, a convenience function would be helpful and much appreciated by those doing statistics with Julia. It would be a sort of multivariate generalisation of skipmissing.


#11

There’s also an old issue for this problem open at Missings.jl.

For just dropping rows with missing values a good solution might be to just add dropmissing and completecases implementations for AbstractArray to Missings.jl (behaving the same way they do in DataFrames.jl)?


#12

Maybe this calls for a MaskedArray <: AbstractArray? You can kind of do this using MappedArrays:

using MappedArrays
A = rand(Float64, 100, 100, 100)
A[:, :, 1] .= NaN
mean(mappedarray(x -> isnan(x) ? zero(x) : x, A), dims=1)

but the value with which to replace the NA-like value depends on the reduction.


#13

Replacing NaNs with zeros certainly changes mean value.
I agree that MaskedArray can really be useful for such cases. It probably requires way more Julia-skill to implement than the wrapper function I suggest in the first post. But it should be possible to make it general enough with transparent support of various AbstractArrays like AxisArray and memory-mapped arrays, and without extra memory usage.


#14

D’oh. I had sum initially and then replaced it with mean without thinking.


#15

This PR implements reductions over dimensions for skipmissing, but it has surprisingly received little support so far. It shouldn’t be hard to generalize to other values in the spirit of this other PR:


#16

The general skip function is basically the same as filter, so as I understand the combination of ideas from these two PR is to support outputs of multidimensional filter calls in all aggregations. It’s probably easy for those using mapreduce, but much harder or impossible for functions from many separate libraries or custom ones. So this would certainly solve some of the issues, but the more I think about the more I like another approach.

Indeed, all the dims=... arguments and the proposed support for skipmissing are conceptually separate from the aggregation function implementation itself. So, it would be really convenient to have interface similar to what I propose in the first post here, which makes it easy to apply (1) any aggregation function written without even thinking of missing values or n-dim arrays (2) slice-wise to arrays of any dimensionality (3) keeping the AbstractArray class intact (not all dims-supporting functions do that automatically) and (4) easily skipping invalid values.
Of course, individual functions should have the option to manually implement the n-dimensional case for performance, but the general interface is very important.
Personally I use the agg function from the first post a lot recently. Even through I don’t like a few things about it, I didn’t find anything better.