How to do multiple imputation on Julia?

Juan · November 19, 2018, 12:54pm

I’ve found the package Impute.jl but it seems it’s only able to use these simple methods:

drop - remove missing
locf - last observation carried forward
nocb - next observation carried backward
interp - linear interpolation of values in vector
fill - replace with a specific value or a function which returns a value given the existing vector with missing values dropped.

How can I do multiple imputation when I have several variables or I want to use more complex methods, for example using: fully conditional specification (chained equations, mice), bayesian methods, random forest, multilevel imputation, nested imputation, censored data, categorical data, survival data…

Tamas_Papp · November 19, 2018, 1:35pm

The same way as you would estimate any parameter.

Juan · November 19, 2018, 1:39pm

I mean without coding all the process on my own but with a package that does the imputation automatically. Other tools do have it (R, Stata, SAS…).

affans · December 20, 2019, 4:39am

Instead of creating a new topic, I thought I’d rather bump this one. I am also looking for a package that implements advanced imputation methods. If there isn’t any, I’d like to get started on it, however only with a mentor. This is because 1) While I have some experience in multiple imputation methods, I don’t have a pure stats background and 2) I don’t know Julia at an advanced level yet to create such a package (taking into consideration how to expose an API, how to work with multiple dispatch, etc…). My vision is to create something equivalent to R package mice.

Any takers?

longemen3000 · December 20, 2019, 6:03am

A good start would be a list of functions to inplemement, and look if there is any Julia implementation (or alternative) . After that, start implementing such functions. I saw that mice follows a book from van Buuren. Using that as a guide?

In particular, i don’t know a lot of multiple inputation, but for what i read on mice’s documentation, they operate on datasets with missing values. For that, a good base is implenting over the Tables.jl interface. For distribution support, there is Distributions.jl

With respect to the design of an API, maybe you can follow mice’s names,but adhering to the Julia syle guide when possible.

Finally, what I see is that you want to implement such package, I would recommend to just go for it and learn doing. For what I saw, this community is one of smart busy people, but they are glad to help whenever and whatever they can.

If you are not yet ready to write such package, you can at least start the conversation about implementing an API for multiple imputation. With a discourse post with an initial proposition, you can encourage other people to express their idea of what they want and even some mockups of an initial implementation.

longemen3000 · December 20, 2019, 6:07am

A high impact title to draw attention here is “Taking multiple imputation seriously”

Tamas_Papp · December 20, 2019, 8:53am

Depending on what is considered advanced, I don’t think this is something one would package — the best method is always tailored to the model.

I think that “imputation” is a historical artifact: conceptually, it is much cleaner to think about missing values as unknown parameters, like all other parameters. It is true that in some special cases it may be advantageous to obtain point estimates, eg with EM, but generally that is not the case.

I would recommend BDA, specifically working up to Chapter 18 on missing data, then the Missing Data and Other Opportunities part of Statistical Rethinking (again, it is at the end ). This is a long ride, but you will be super-well-trained statistician at the end.

As for mentoring: self-study is an option with the above resources, otherwise, if you are enrolled at a university, you should be able to find someone working on Bayesian stats and ask for their help.

nilshg · December 20, 2019, 10:02am

While I agree with this in principle, I think the idea here is to have an “advanced standard” if that makes sense - the R mice package is popular because while it implements a lot of reasonably cutting edge stuff (disclaimer: this isn’t really my area so I’m speculating a bit here), ultimately most users will just call mice(my_data) and get back an imputed data set that looks face valid.

To me this is similar to forecast, where most people would similarly just do auto_arima(my_timeseries) and be done with it. Of course it is always preferable to actually think about the methods one is using, in many contexts the taks of imputation (or forecasting) is only a minor input into a larger analysis pipeline, and having an “advanced standard” that produces reasonably defaults comes in very handy in these cases.

nilshg · December 20, 2019, 10:07am

That said have you reached out to the guys at Invenia who I think are behind Impute.jl? I’m sure they’d be open for PRs implementing some more advanced methods and might be able to help along the way as well.

cc @oxinabox

Juan · December 20, 2019, 12:11pm

@nilshg I did one year ago and they told me they would add multiple imputation to impute.jl but they first need to rewrite the package to use iterators and something else.
Finally I used R for my thesis because it already has important things for me such as multiple imputation and meta-analysis.

baggepinnen · December 20, 2019, 12:34pm

Lowrankmodels.jl have quite nice support for all sorts of imputation.
https://github.com/madeleineudell/LowRankModels.jl

oxinabox · December 20, 2019, 6:02pm

Which has now been done. So it might now be in a position for such a PR.
Though I think it needs some redesign around how the Context works as it tries to be too smart.

but people are busy and its not particularly high priority vs all the other things going on.

Also pending is LowRankModels integration.
So that you can use 1 interfact for LOCF etc and for LowRandModels implementations.

If LowRankModels works for you though, I would just use that directly right now.

Rory-Finnegan · February 19, 2020, 10:54pm

Which has now been done. So it might now be in a position for such a PR.
Though I think it needs some redesign around how the Context works as it tries to be too smart.

Impute.jl does need another refactor to better separate out and simplify the Context (iterator) API. I should have a bit more time to work on this shortly.

Regarding LowRankModels, it doesn’t seem to be very actively developed either and the tests don’t pass (which there is an open PR for). I think that’s probably the first step to integration with Impute.jl.

baggepinnen · February 20, 2020, 11:49pm

I have been working a bit on this lately, there is some functionality that works quite well for time-series and matrices
https://github.com/baggepinnen/TotalLeastSquares.jl#missing-data-imputation
it’s not a complete data-imputation package but in the setting where it works it works well.

Albert_Zevelev · April 25, 2020, 10:57pm

Another vote from me for more advanced data imputation in Julia!
The R pkg Multiple Imputation by Chained Equations (MICE) is super-popular but also super-slow.
This is where Julia can shine!
Data imputation is often one of the first steps in any ML project (or Kaggle competition).
This kind of feature can help make the case: Getting Julia support to Kaggle

sylvaticus · May 2, 2021, 7:46am

For imputation of continuous variables by means of Gaussian Mixture Models (GMM)/Expectation-Maximisation(EM) algorithm there is also predictMissing from BetaML.jl

Note the DecisionTree/RandomForest from the same lib, differently from most other implementations, operates on data with missing values, so perhaps could be use to predict the missing values one dimension at a time…

Topic		Replies	Views
Mice for julia Statistics question , package , data	10	1653	November 9, 2023
[ANN] Mice.jl - multiple imputation by chained equations in Julia Package Announcements package , announcement , statistics , missing-values	6	1041	December 12, 2023
Missing attribution Data	2	333	December 7, 2020
Missing imputation: comparision of BetaML, Python SKL, R Mice Data benchmark , missing-values	1	688	October 5, 2022
How to implement a model with missing observations in Turing? Probabilistic Programming	6	546	March 12, 2023

How to do multiple imputation on Julia?

Related topics