[ANN] Mice.jl - multiple imputation by chained equations in Julia

Hi everyone! This is just to introduce Mice.jl. This is a package for missing data handling via multiple imputation by chained equations, heavily based on the R package mice.

I wouldn’t necessarily trust it 100% just yet (it needs a lot more testing!) but I hope this will end up being helpful!

Documentation and examples at https://tom-metherell.github.io/Mice.jl.

20 Likes

Finally someone gets around to doing this, thank you! A few questions:

  • Why the focus on DataFrames? Seems like this should work with Tables.jl?
  • Why do you need Plots as a dependency? Plotting recipes can be defined with Recipes.jl without taking the super heavy Plots dependency, if you really have some functionality that requires loading plots you should look into package extensions
  • What’s the performance on the second run?
  • Relatedly, have you looked into PrecompileTools and other strategies to reduce latency?

Really excited to see a native Julia MICE implementation!

6 Likes

Hi @nilshg! In response:

  • To be honest, I mostly wrote this for myself, so wasn’t focusing on making it compatible with packages that I don’t use :sweat_smile: but will raise this as an issue to work on at some point.
  • Ah! Thanks for letting me know, will fix that
  • In the (super non-rigorous) benchmarking I did, performance plateaued at about 4x the speed of the R package on the 2nd run. But I’ve only been using Julia for about 9 months, so there is probably scope for significant further improvement
  • I did look into it briefly and found that it didn’t make any difference - but again that might be because I’m inexpert rather than because it can’t make a difference.

Thanks for your input and I hope this package will improve significantly with time!

5 Likes

I second that. Opened an issue in the repository.

If you can switch to Tables.jl, we can easily implement the TableTransforms.jl interface and use the mice with tons of other available transforms for tabular data.

2 Likes

Hello, a set of missing values imputers that can produce multiple imputations is also provided by the Imputation sub-module of BetaML.

Currently, the provided imputers are:

  • FeatureBasedImputer: Impute data using the feature (column) mean, optionally normalised by l-norms of the records (rows) (fastest)
  • GMMImputer: Impute data using a Generative (Gaussian) Mixture Model (good trade off)
  • RFImputer: Impute missing data using Random Forests, with optional replicable multiple imputations (most accurate).
  • UniversalImputer: Impute missing data using a vector (one per column) of arbitrary learning models (classifiers/regressors) that implement m = Model([options]), fit!(m,X,Y) and predict(m,X) (not necessarily from BetaML).

Although multiple imputations are provided (in the models with stochastic imputers), then there is no mechanism to pool the results from further analysis with the imputed values.

v0.1.0 (hopefully) fixes a number of the issues that have been raised so far, most importantly dropping the reliance on DataFrames.jl and using Tables.jl instead. We’ve taken a small performance hit (though approximately half of that was because I accidentally missed a line out of the benchmarking code before, oops!). Will attempt to improve performance in the near future :smiley:

5 Likes

Little update - I’ve made some performance improvements, so Mice.jl is now roughly 2x as fast as before (5-6x as fast than the R equivalent in Linux, ~7x as fast in Windows). Next step is probably to add more methods - and of course if anyone has any requests please raise an issue!

12 Likes