From another thread, there was discussion of skipmissing
, and it prompted me to think dropmissing
is a convenient way to get rid of those pesky missing
s. DataFrames has dropmissing
but regular Arrays don’t. How could it be extended? And for higher order tensors? This is the question.
Here is an initial suggestion:
julia> using Random, Missings
julia> Random.seed!(345);
julia> q = [rand() < 0.2 ? missing : rand() for i in 1:6,j in 1:6]
6×6 Matrix{Union{Missing, Float64}}:
0.956056 0.567035 0.401763 0.457714 0.361413 0.55555
0.810208 0.595364 0.734469 0.591841 0.75449 0.241072
missing 0.651234 0.683582 0.362204 0.908998 0.811716
missing 0.814048 0.752948 0.194032 0.32284 missing
0.336461 0.167354 0.591084 0.37456 0.0637121 0.16276
0.574073 0.363261 0.59856 0.0997761 0.364427 0.695076
julia> foldl((r,k)->stack(filter(s->!any(ismissing.(s)), eachslice(r, dims=k))), 1:ndims(q); init=q)
6×4 Matrix{Union{Missing, Float64}}:
0.956056 0.810208 0.336461 0.574073
0.567035 0.595364 0.167354 0.363261
0.401763 0.734469 0.591084 0.59856
0.457714 0.591841 0.37456 0.0997761
0.361413 0.75449 0.0637121 0.364427
0.55555 0.241072 0.16276 0.695076
The foldl
removes any ‘slice’ which contains a missing
. But, it’s wrong, since it removes all missing
s on the first round, and depends on the order of dimensions. How can this be fixed?
ADDED:
For example, this is more correct, but still might be wasteful in case of very tall or wide matrices and somewhat allocating:
julia> nonmissingtype(eltype(q)).(
q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)],
1:ndims(q))...])
4×4 Matrix{Float64}:
0.567035 0.401763 0.457714 0.361413
0.595364 0.734469 0.591841 0.75449
0.167354 0.591084 0.37456 0.0637121
0.363261 0.59856 0.0997761 0.364427
This is less allocating, using a view (which might not be good enough for many downstream usage):
@view q[map(i->[!any(ismissing,s) for s in eachslice(q, dims=i)], 1:ndims(q))...]
Now the problem is speed, as the comprehension reads the matrix ndims
times, instead of once.