Extending `dropmissing` to AbstractArrays

Dan · November 26, 2023, 4:40pm

From another thread, there was discussion of skipmissing, and it prompted me to think dropmissing is a convenient way to get rid of those pesky missings. DataFrames has dropmissing but regular Arrays don’t. How could it be extended? And for higher order tensors? This is the question.

Here is an initial suggestion:

julia> using Random, Missings

julia> Random.seed!(345);

julia> q = [rand() < 0.2 ? missing : rand() for i in 1:6,j in 1:6]
6×6 Matrix{Union{Missing, Float64}}:
 0.956056  0.567035  0.401763  0.457714   0.361413   0.55555
 0.810208  0.595364  0.734469  0.591841   0.75449    0.241072
  missing  0.651234  0.683582  0.362204   0.908998   0.811716
  missing  0.814048  0.752948  0.194032   0.32284     missing
 0.336461  0.167354  0.591084  0.37456    0.0637121  0.16276
 0.574073  0.363261  0.59856   0.0997761  0.364427   0.695076

julia> foldl((r,k)->stack(filter(s->!any(ismissing.(s)), eachslice(r, dims=k))), 1:ndims(q); init=q)
6×4 Matrix{Union{Missing, Float64}}:
 0.956056  0.810208  0.336461   0.574073
 0.567035  0.595364  0.167354   0.363261
 0.401763  0.734469  0.591084   0.59856
 0.457714  0.591841  0.37456    0.0997761
 0.361413  0.75449   0.0637121  0.364427
 0.55555   0.241072  0.16276    0.695076

The foldl removes any ‘slice’ which contains a missing. But, it’s wrong, since it removes all missings on the first round, and depends on the order of dimensions. How can this be fixed?

ADDED:

For example, this is more correct, but still might be wasteful in case of very tall or wide matrices and somewhat allocating:

julia> nonmissingtype(eltype(q)).(
  q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], 
  1:ndims(q))...])
4×4 Matrix{Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

This is less allocating, using a view (which might not be good enough for many downstream usage):

@view q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], 1:ndims(q))...]

Now the problem is speed, as the comprehension reads the matrix ndims times, instead of once.

ParadaCarleton · November 26, 2023, 6:00pm

You might be looking for Filter(!ismissing) (using Transducers.jl).

Dan · November 26, 2023, 6:06pm

Don’t think so:

doesn’t keep the dimensionality of a tensor (which requires dropping some non-missing elements too).

rafael.guerra · November 26, 2023, 8:52pm

One alternative that seems to allocate a bit less:

q[ntuple(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], ndims(q))...]

Dan · November 26, 2023, 9:53pm

Yes. ntuple is better than a Vector. But the crux would be to scan the source Array (tensor) just once to find all the relevant subsets. I’ll try working it out again…

A FEW MINS LATER:
This is a function which goes through tensor once:

function tensordropmissing(Q::AbstractArray{T,N}) where {T,N}
    s = size(Q)
    goodidxs = ntuple(i->trues(s[i]),N)
    for I in CartesianIndices(Q)
        if ismissing(Q[I])
            for k in 1:N
                goodidxs[k][I[k]] = false
            end
        end
    end
    return @view Q[goodidxs...]
end

With it, and q as in OP:

julia> tensordropmissing(q)
4×4 view(::Matrix{Union{Missing, Float64}}, [1, 2, 5, 6], [2, 3, 4, 5]) with eltype Union{Missing, Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

julia> nonmissingtype(eltype(q)).(tensordropmissing(q))
4×4 Matrix{Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

and almost no allocation:

julia> @btime tensordropmissing($q);
  187.650 ns (6 allocations: 384 bytes)

Topic		Replies	Views
Drop rows of an array containing missing values General Usage arrays , missing-values	2	358	October 21, 2023
Possible bug in dropmissing! General Usage	7	1153	June 4, 2019
Dipatch on AbstractArray which contains missing values New to Julia	4	247	July 1, 2022
NA-ignoring aggregations General Usage	15	5178	January 5, 2019
Efficient way to transform Array{Union{Missing,Float64}} to Array{Float64}? General Usage	11	4200	September 6, 2019

Extending `dropmissing` to AbstractArrays

Related topics