MaskArrays.jl is a small package for working with arrays containing missing values. It does this by expressing the array in four pieces:
- A “
base
array” with no missing values. For example, the base array for aMatrix{Union{Float64, Missing}}
will have typeMatrix{Float64}
. - A vector of
imputed
values, which comprise a view into the array at the missing components - An
indices
Dict mapping missing-valuedCartesianIndices
of the original array to their Int index in theimputed
values - A “detachable
buffer
”, which is just another vector that copies intoimputed
upon a call tosync!
.
The buffer is useful for algorithms where imputation needs to be stored in a packed array separate from the base
.
For a little demo, let’s try a very naive approach to sampling a conditional MvNormal
. First we load some packages:
julia> using MaskArrays, Distributions
Now build an MvNormal
julia> begin
z = randn(6,5)
Σ = z' * z
d = MvNormal(Σ)
end
ZeroMeanFullNormal(
dim: 5
μ: 5-element Zeros{Float64}
Σ: [4.965238439513922 3.7198256058163923 … 1.1301380123231672 3.5765175753003065; 3.7198256058163923 4.26724167976455 … -1.3164975530435497 4.019398950615648; … ; 1.1301380123231672 -1.3164975530435497 … 4.283220719896374 -2.4998641240897035; 3.5765175753003065 4.019398950615648 … -2.4998641240897035 8.543626118929604]
)
Sample from it and “forget” some values
julia> begin
x = Vector{Union{Missing, Float64}}(undef, 5)
x .= rand(d)
x[[2,4]] .= missing
x
end
5-element Vector{Union{Missing, Float64}}:
-0.4451772506128395
missing
-0.8788532569308308
missing
-0.5801365606645945
Now we’ll use d
and the known values to sample from the unknowns. So we build a MaskArray
julia> ma = maskarray(x)
5-element MaskArray{Float64,1}:
-0.4451772506128395
7.748604185489348e-304
-0.8788532569308308
7.748604185489348e-304
-0.5801365606645945
Note that we haven’t lost track of what’s known vs unknown:
julia> imputed(ma)
2-element view(::Vector{Float64}, CartesianIndex{1}[CartesianIndex(2,), CartesianIndex(4,)]) with eltype Float64:
7.748604185489348e-304
7.748604185489348e-304
Now for imputation. We’ll do something simple just to demonstrate the concept. And we’ll use TupleVectors.jl because it’s fun
Here we go again, loading packages:
julia> using TupleVectors, Sobol, StatsFuns, StatsBase
Now we’ll sample an iid normal proposal and assign log-weights. I don’t have fancy methods yet for sampling from Distributions
(only MeasureTheory) so for now the inverse transform sampling is explicit.
julia> # Now sample and log-weight the result
ω = SobolHypercube(2)
SobolHypercube{2}(2-dimensional Sobol sequence on [0,1]^2, [0.5, 0.5], Base.RefValue{Int64}(0))
julia> s = @with (;ω) 10000 begin
# A very naive proposal
x = 5 * norminvcdf(rand(ω))
y = 5 * norminvcdf(rand(ω))
imputed(ma) .= [x,y]
# The log-weight
ℓ = logpdf(d, ma)
(;ℓ, x, y)
end
10000-element TupleVector with schema (ℓ = Float64, x = Float64, y = Float64)
(ℓ = -95.0±110.0, x = -0.0±5.0, y = -0.0±5.0)
Finally, we can use the weights to resample and plot the result:
julia> rs = TupleVector(sample(s, Weights(exp.(s.ℓ)), 1000))
1000-element TupleVector with schema (ℓ = Float64, x = Float64, y = Float64)
(ℓ = -7.0±0.97, x = -0.89±0.57, y = 0.7±0.8)
julia> using UnicodePlots
julia> @with TupleVectors.unwrap(rs) begin
scatterplot(x, y)
end
┌────────────────────────────────────────┐
4 │⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢂⠀⠀⠀⠀⠀⠈⠀⠠⢀⠀⠠⣀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠈⠀⠀⠀⠀⠀⠄⠢⢚⠨⡢⠑⢍⠐⢐⡂⢀⠀⢀⢀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⢠⢠⠃⠒⡤⡢⡌⡢⠖⡀⢂⢔⢀⠀⠀⢊⡐⡀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠐⡀⠎⢅⠒⣔⡱⡐⠤⣋⢔⠭⡈⠱⢇⢌⠀⠂⠀⡇⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⢀⠀⠀⢀⠩⠂⡈⡌⡴⠊⢄⡵⠢⢩⡳⠰⢥⢩⡒⡀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠉⠀⠌⢊⣄⢋⢎⠴⣘⠙⡤⠖⢙⡩⡔⡕⡠⡷⠡⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠤⠤⠤⠤⠤⠤⠤⠤⠤⡬⠦⠴⠬⡤⠦⢦⡭⠦⢴⣭⠦⡼⣬⡽⡧⣧⠵⡦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠠⠀⠂⠈⢈⠴⠌⣎⡔⠪⡌⡌⡯⠀⡣⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠃⢊⠤⢊⠁⠴⠘⡗⠂⢂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠁⠀⠀⠀⠀⠀⠀⠐⠀⡇⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀│
-3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
-3 2