[ANN] MaskArrays.jl

cscherrer · May 17, 2021, 9:32pm

MaskArrays.jl is a small package for working with arrays containing missing values. It does this by expressing the array in four pieces:

A “base array” with no missing values. For example, the base array for a Matrix{Union{Float64, Missing}} will have type Matrix{Float64}.
A vector of imputed values, which comprise a view into the array at the missing components
An indices Dict mapping missing-valued CartesianIndices of the original array to their Int index in the imputed values
A “detachable buffer”, which is just another vector that copies into imputed upon a call to sync!.

The buffer is useful for algorithms where imputation needs to be stored in a packed array separate from the base.

For a little demo, let’s try a very naive approach to sampling a conditional MvNormal. First we load some packages:

julia> using MaskArrays, Distributions

Now build an MvNormal

julia> begin
           z = randn(6,5)
           Σ = z' * z
           d = MvNormal(Σ)
       end
ZeroMeanFullNormal(
dim: 5
μ: 5-element Zeros{Float64}
Σ: [4.965238439513922 3.7198256058163923 … 1.1301380123231672 3.5765175753003065; 3.7198256058163923 4.26724167976455 … -1.3164975530435497 4.019398950615648; … ; 1.1301380123231672 -1.3164975530435497 … 4.283220719896374 -2.4998641240897035; 3.5765175753003065 4.019398950615648 … -2.4998641240897035 8.543626118929604]
)

Sample from it and “forget” some values

julia> begin
           x = Vector{Union{Missing, Float64}}(undef, 5)
           x .= rand(d)
           x[[2,4]] .= missing
           x
       end
5-element Vector{Union{Missing, Float64}}:
 -0.4451772506128395
   missing
 -0.8788532569308308
   missing
 -0.5801365606645945

Now we’ll use d and the known values to sample from the unknowns. So we build a MaskArray

julia> ma = maskarray(x)
5-element MaskArray{Float64,1}:
 -0.4451772506128395
  7.748604185489348e-304
 -0.8788532569308308
  7.748604185489348e-304
 -0.5801365606645945

Note that we haven’t lost track of what’s known vs unknown:

julia> imputed(ma)
2-element view(::Vector{Float64}, CartesianIndex{1}[CartesianIndex(2,), CartesianIndex(4,)]) with eltype Float64:
 7.748604185489348e-304
 7.748604185489348e-304

Now for imputation. We’ll do something simple just to demonstrate the concept. And we’ll use TupleVectors.jl because it’s fun

Here we go again, loading packages:

julia> using TupleVectors, Sobol, StatsFuns, StatsBase

Now we’ll sample an iid normal proposal and assign log-weights. I don’t have fancy methods yet for sampling from Distributions (only MeasureTheory) so for now the inverse transform sampling is explicit.

julia> # Now sample and log-weight the result
       ω = SobolHypercube(2)
SobolHypercube{2}(2-dimensional Sobol sequence on [0,1]^2, [0.5, 0.5], Base.RefValue{Int64}(0))

julia> s = @with (;ω) 10000 begin
           # A very naive proposal
           x = 5 * norminvcdf(rand(ω))
           y = 5 * norminvcdf(rand(ω))
           imputed(ma) .= [x,y]
           # The log-weight
           ℓ = logpdf(d, ma)
           (;ℓ, x, y)
       end
10000-element TupleVector with schema (ℓ = Float64, x = Float64, y = Float64)
(ℓ = -95.0±110.0, x = -0.0±5.0, y = -0.0±5.0)

Finally, we can use the weights to resample and plot the result:

julia> rs = TupleVector(sample(s, Weights(exp.(s.ℓ)), 1000))
1000-element TupleVector with schema (ℓ = Float64, x = Float64, y = Float64)
(ℓ = -7.0±0.97, x = -0.89±0.57, y = 0.7±0.8)

julia> using UnicodePlots

julia> @with TupleVectors.unwrap(rs) begin
           scatterplot(x, y)
       end
      ┌────────────────────────────────────────┐ 
    4 │⠀⠀⠀⠀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⢂⠀⠀⠀⠀⠀⠈⠀⠠⢀⠀⠠⣀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠈⠀⠀⠀⠀⠀⠄⠢⢚⠨⡢⠑⢍⠐⢐⡂⢀⠀⢀⢀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⢠⢠⠃⠒⡤⡢⡌⡢⠖⡀⢂⢔⢀⠀⠀⢊⡐⡀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠐⡀⠎⢅⠒⣔⡱⡐⠤⣋⢔⠭⡈⠱⢇⢌⠀⠂⠀⡇⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⢀⠀⠀⢀⠩⠂⡈⡌⡴⠊⢄⡵⠢⢩⡳⠰⢥⢩⡒⡀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠉⠀⠌⢊⣄⢋⢎⠴⣘⠙⡤⠖⢙⡩⡔⡕⡠⡷⠡⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠤⠤⠤⠤⠤⠤⠤⠤⠤⡬⠦⠴⠬⡤⠦⢦⡭⠦⢴⣭⠦⡼⣬⡽⡧⣧⠵⡦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠀⠀⠠⠀⠂⠈⢈⠴⠌⣎⡔⠪⡌⡌⡯⠀⡣⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠃⢊⠤⢊⠁⠴⠘⡗⠂⢂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠁⠀⠀⠀⠀⠀⠀⠐⠀⡇⠊⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠄⠀⠀⠀⠀⠀⠀│ 
   -3 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ 
      └────────────────────────────────────────┘ 
      -3                                       2

oschulz · May 18, 2021, 7:32am

Nice! Hadn’t come across TupleVectors.jl before - how does it compare to TypedTables.jl? Does it also implement the Tables.jl API?

cscherrer · May 18, 2021, 4:44pm

Funny, I think we had talked about TypedTables at one point, and I don’t remember the connection with TupleVectors occurring to me at the time.

TupleVectors was originally just a data structure in SampleChains. But I realized it could be useful on its own, so I broke it out into its own thing. I haven’t done much with TypedTables, but here are some things that seem different:

Columns are assumed to be nested
Lots of GeneralizedGenerated tricks for making things fast, mostly in NestedTuples.jl
Compact Measurements-like representation, important since we usually have a few thousand samples at a time

I have a start on implementing the Tables API, but it’s not complete yet.

If you think TupleVectors could be absorbed into TypedTables without losing features or performance, I’d be open to that and happy to talk more (maybe open an issue if you like)

oschulz · May 18, 2021, 8:11pm

Thanks for the infos!

Columns are assumed to be nested

TypedTables can do that, actualy, I’m using it a lot. I always think of TypedTables as StructsArrays for NamedTuples.

I have a start on implementing the Tables API, but it’s not complete yet.

That would be cool (to me, a vector/array of NamedTuples - ideally stored column-wise, of course, has always seemed the most Julianic kind of table, semantically).

oschulz · May 18, 2021, 8:21pm

Just did.

I’m not a maintainer of TypedTables.jl, just contributed a bit a times. But I’d love so see things like this come together. Personally, I prefer the column-stored array of NamedTuples approach to the DataFrame approach (due to type stability, but also because it seems so natural that a table is semantically just an array of NamedTuples that also behaves like a - and is backed by - a NamedTuple of arrays).

andyferris · May 18, 2021, 11:39pm

Thanks, I have also replied on github.

One thing I’m curious about is the difference between TupleVectors.jl and StructArrays.jl?

I always think of TypedTables as StructsArrays for NamedTuples.

Pretty much. The guiding force is table manipulation should work fine out-of-the-box with few packages and storing data in Vector{NamedTuple}, and TypedTables is just a columnar-storage optimization of that (with some conveniences like a tabular version of show).

cscherrer · May 18, 2021, 11:44pm

It sounds like we’re targeting very similar use cases. Compared to a StructArray, the biggest difference is that it’s more specialized:

Built with a focus on named tuples
Assumes names will be nested
Uses some GeneralizedGenerated tricks, mostly in NestedTuples.jl
Compact show representation, using a user-extensible summarize function

Topic		Replies	Views
[RFC] SentinelMissings.jl Package Announcements	10	1056	February 8, 2019
[ANN] WildcardArrays.jl Package Announcements announcement , arrays , sparsearrays	0	370	April 8, 2024
ANN: JuliaDB.jl Community	40	9647	November 13, 2018
[ANN] Mice.jl - multiple imputation by chained equations in Julia Package Announcements package , announcement , statistics , missing-values	6	993	December 12, 2023
Problems about dealing with missing values, maybe connected to DataFrames.jl Data question	4	755	December 4, 2018

[ANN] MaskArrays.jl

Related topics