Extending `dropmissing` to AbstractArrays

From another thread, there was discussion of skipmissing, and it prompted me to think dropmissing is a convenient way to get rid of those pesky missings. DataFrames has dropmissing but regular Arrays don’t. How could it be extended? And for higher order tensors? This is the question.

Here is an initial suggestion:

julia> using Random, Missings

julia> Random.seed!(345);

julia> q = [rand() < 0.2 ? missing : rand() for i in 1:6,j in 1:6]
6×6 Matrix{Union{Missing, Float64}}:
 0.956056  0.567035  0.401763  0.457714   0.361413   0.55555
 0.810208  0.595364  0.734469  0.591841   0.75449    0.241072
  missing  0.651234  0.683582  0.362204   0.908998   0.811716
  missing  0.814048  0.752948  0.194032   0.32284     missing
 0.336461  0.167354  0.591084  0.37456    0.0637121  0.16276
 0.574073  0.363261  0.59856   0.0997761  0.364427   0.695076

julia> foldl((r,k)->stack(filter(s->!any(ismissing.(s)), eachslice(r, dims=k))), 1:ndims(q); init=q)
6×4 Matrix{Union{Missing, Float64}}:
 0.956056  0.810208  0.336461   0.574073
 0.567035  0.595364  0.167354   0.363261
 0.401763  0.734469  0.591084   0.59856
 0.457714  0.591841  0.37456    0.0997761
 0.361413  0.75449   0.0637121  0.364427
 0.55555   0.241072  0.16276    0.695076

The foldl removes any ‘slice’ which contains a missing. But, it’s wrong, since it removes all missings on the first round, and depends on the order of dimensions. How can this be fixed?

ADDED:

For example, this is more correct, but still might be wasteful in case of very tall or wide matrices and somewhat allocating:

julia> nonmissingtype(eltype(q)).(
  q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], 
  1:ndims(q))...])
4×4 Matrix{Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

This is less allocating, using a view (which might not be good enough for many downstream usage):

@view q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], 1:ndims(q))...]

Now the problem is speed, as the comprehension reads the matrix ndims times, instead of once.

1 Like

You might be looking for Filter(!ismissing) (using Transducers.jl).

Don’t think so:

doesn’t keep the dimensionality of a tensor (which requires dropping some non-missing elements too).

One alternative that seems to allocate a bit less:

q[ntuple(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], ndims(q))...]

Yes. ntuple is better than a Vector. But the crux would be to scan the source Array (tensor) just once to find all the relevant subsets. I’ll try working it out again…

A FEW MINS LATER:
This is a function which goes through tensor once:

function tensordropmissing(Q::AbstractArray{T,N}) where {T,N}
    s = size(Q)
    goodidxs = ntuple(i->trues(s[i]),N)
    for I in CartesianIndices(Q)
        if ismissing(Q[I])
            for k in 1:N
                goodidxs[k][I[k]] = false
            end
        end
    end
    return @view Q[goodidxs...]
end

With it, and q as in OP:

julia> tensordropmissing(q)
4×4 view(::Matrix{Union{Missing, Float64}}, [1, 2, 5, 6], [2, 3, 4, 5]) with eltype Union{Missing, Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

julia> nonmissingtype(eltype(q)).(tensordropmissing(q))
4×4 Matrix{Float64}:
 0.567035  0.401763  0.457714   0.361413
 0.595364  0.734469  0.591841   0.75449
 0.167354  0.591084  0.37456    0.0637121
 0.363261  0.59856   0.0997761  0.364427

and almost no allocation:

julia> @btime tensordropmissing($q);
  187.650 ns (6 allocations: 384 bytes)
2 Likes