I’ve found the package Impute.jl but it seems it’s only able to use these simple methods:
drop - remove missing
locf - last observation carried forward
nocb - next observation carried backward
interp - linear interpolation of values in vector
fill - replace with a specific value or a function which returns a value given the existing vector with missing values dropped.
How can I do multiple imputation when I have several variables or I want to use more complex methods, for example using: fully conditional specification (chained equations, mice), bayesian methods, random forest, multilevel imputation, nested imputation, censored data, categorical data, survival data…
Instead of creating a new topic, I thought I’d rather bump this one. I am also looking for a package that implements advanced imputation methods. If there isn’t any, I’d like to get started on it, however only with a mentor. This is because 1) While I have some experience in multiple imputation methods, I don’t have a pure stats background and 2) I don’t know Julia at an advanced level yet to create such a package (taking into consideration how to expose an API, how to work with multiple dispatch, etc…). My vision is to create something equivalent to R package mice.
A good start would be a list of functions to inplemement, and look if there is any Julia implementation (or alternative) . After that, start implementing such functions. I saw that mice follows a book from van Buuren. Using that as a guide?
In particular, i don’t know a lot of multiple inputation, but for what i read on mice’s documentation, they operate on datasets with missing values. For that, a good base is implenting over the Tables.jl interface. For distribution support, there is Distributions.jl
With respect to the design of an API, maybe you can follow mice’s names,but adhering to the Julia syle guide when possible.
Finally, what I see is that you want to implement such package, I would recommend to just go for it and learn doing. For what I saw, this community is one of smart busy people, but they are glad to help whenever and whatever they can.
If you are not yet ready to write such package, you can at least start the conversation about implementing an API for multiple imputation. With a discourse post with an initial proposition, you can encourage other people to express their idea of what they want and even some mockups of an initial implementation.
Depending on what is considered advanced, I don’t think this is something one would package — the best method is always tailored to the model.
I think that “imputation” is a historical artifact: conceptually, it is much cleaner to think about missing values as unknown parameters, like all other parameters. It is true that in some special cases it may be advantageous to obtain point estimates, eg with EM, but generally that is not the case.
I would recommend BDA, specifically working up to Chapter 18 on missing data, then the Missing Data and Other Opportunities part of Statistical Rethinking (again, it is at the end ). This is a long ride, but you will be super-well-trained statistician at the end.
As for mentoring: self-study is an option with the above resources, otherwise, if you are enrolled at a university, you should be able to find someone working on Bayesian stats and ask for their help.
While I agree with this in principle, I think the idea here is to have an “advanced standard” if that makes sense - the R mice package is popular because while it implements a lot of reasonably cutting edge stuff (disclaimer: this isn’t really my area so I’m speculating a bit here), ultimately most users will just call mice(my_data) and get back an imputed data set that looks face valid.
To me this is similar to forecast, where most people would similarly just do auto_arima(my_timeseries) and be done with it. Of course it is always preferable to actually think about the methods one is using, in many contexts the taks of imputation (or forecasting) is only a minor input into a larger analysis pipeline, and having an “advanced standard” that produces reasonably defaults comes in very handy in these cases.
That said have you reached out to the guys at Invenia who I think are behind Impute.jl? I’m sure they’d be open for PRs implementing some more advanced methods and might be able to help along the way as well.
@nilshg I did one year ago and they told me they would add multiple imputation to impute.jl but they first need to rewrite the package to use iterators and something else.
Finally I used R for my thesis because it already has important things for me such as multiple imputation and meta-analysis.
Which has now been done. So it might now be in a position for such a PR.
Though I think it needs some redesign around how the Context works as it tries to be too smart.
but people are busy and its not particularly high priority vs all the other things going on.
Also pending is LowRankModels integration.
So that you can use 1 interfact for LOCF etc and for LowRandModels implementations.
If LowRankModels works for you though, I would just use that directly right now.
Which has now been done. So it might now be in a position for such a PR.
Though I think it needs some redesign around how the Context works as it tries to be too smart.
Impute.jl does need another refactor to better separate out and simplify the Context (iterator) API. I should have a bit more time to work on this shortly.
Regarding LowRankModels, it doesn’t seem to be very actively developed either and the tests don’t pass (which there is an open PR for). I think that’s probably the first step to integration with Impute.jl.
Another vote from me for more advanced data imputation in Julia!
The R pkg Multiple Imputation by Chained Equations (MICE) is super-popular but also super-slow.
This is where Julia can shine!
Data imputation is often one of the first steps in any ML project (or Kaggle competition).
This kind of feature can help make the case: Getting Julia support to Kaggle
For imputation of continuous variables by means of Gaussian Mixture Models (GMM)/Expectation-Maximisation(EM) algorithm there is also predictMissing from BetaML.jl
Note the DecisionTree/RandomForest from the same lib, differently from most other implementations, operates on data with missing values, so perhaps could be use to predict the missing values one dimension at a time…