Statistics.mean() function with a Matrix containing missing values

I have a matrix with missing values and I want to calculate the column means. I’m not sure how to drop missing.

using Statistics

vec = [1, missing, 2]
mean(vec)     # is missing
mean(skipmissing(vec)   # give 1.5 that I want in 2 dim

# now I have a Matrix A with missing
A = [1 5
       6 missing]

# unsure how to calculate column means
Statistics.mean(A, dims=1)    # error for missing value
Statistics.mean(skipmissing(A), dims=1)   # error

Everywhere I look I can only find vector examples… I don’t see any options in the mean function to drop missing in 2d. Any help would be appreciated.

2 Likes

skipmissing returns a linear iterator (one single axis) without allocating a copy:

julia> skipmissing(A) |> collect |> size
(3,)

The dims keyword does not work in that case because we have a single axis.

You can make an iterator that goes over each slice you care about (with eachslice or eachcol or eachrow) and then broadcast skipmissing and mean on the resulting iterator of single-axis arrays.

mean.(skipmissing.(eachrow(A)))

There might be a cleaner way to do it. The underlying issue is that skipmissing can not return a multi-axis array because different rows might need a different number of skips due to a different number of missings.

3 Likes

Related issue: sum and mean of skipmissings don't accept the dims kwarg · Issue #40081 · JuliaLang/julia · GitHub

3 Likes

Perfect this line worked:

mean.(skipmissing.(eachrow(A')))

(small edit: I needed column means so I do the transpose of my matrix of number of individuals by number (3k) of SNPs (45k))
Okay this was my worry, I hope the Statistics package improves this soon to deal with missing as this is more work than it needs to be (imo…). Thank you for your help!

1 Like

Thank you very much for you suggestion here, I will read this over now.

There is also eachcol which gives an iterator over columns. It does not really matter whether you transpose or whether you switch from eachrow to eachcol.

Oh shoot, I tried eachcolumn() and didn’t work. Thanks I’ll use eachcol()

Well, you lose some performance in mean.(skipmissing.(eachcol(A))) compared to potential mean(skipmissing(A), dims=1), but the former is more general: substitute any aggregation instead of mean and it’ll work, without special support by the function.

Anyway, there’s a long-stalled PR linked from the issue above (Support mapreduce over dimensions with SkipMissing by nalimilan · Pull Request #28027 · JuliaLang/julia · GitHub), so you may wish to update/promote it if this feature seems important.

3 Likes

Oh I see… Thank you for this information. Well then they must be aware, I’m not much for development, I’m still trying to learn the basics. Julia is kind of a beast compared to R to learn. Thanks for all your help.

2 Likes