How do you use JuliaDB for tasks such as fitting a regression model with random effects? (like MixedEffects.jl but with datasets larger than memory)
I believe still the MixedModels etc is only compatible with DataFrames, but that’s changing with the Tables.jl / StatsModels changes.
Could add more information, please?
Does it mean we will be soon able to use MixedModels in a different way or that JuliaDB/OnlineStats will be able to use MixedModels?
How can we do it?
IIUC it will be fit(LinearMixedModel, @formula(Y ~ 1 + (1|G)), table)
where table
can be any table that abides to the Tables.jl format, i.e. a DataFrame or a JuliaDB table or whatever.
Do you have any full example?
It takes the input data from Tables and JuliaDB… but what about the fitting algorithm? Can it work with larger than memory datasets or does it need that the working matrices fit on memory?
We need something that performs every operation on disk or by chunks.
I’ve never tried Tables.jl but I’ll try.
What is IIUC?
This link says with JuliaDB and OnlineStats allow users to run analyses like linear regression on data that is too big to fit in memory.
https://juliacomputing.com/blog/2019/02/27/juliadb.html
However, I haven’t seen any example of that.
Do the docs offer what you want?
The LinReg function seems to be able to run a linear regression, is there any example that shows how it is used with JuliaDB?
There should be more examples in the docs! Please, open an issue on GitHub that asks examples to be added to the docs!
Here is a solution:
using JuliaDB, OnlineStats
t = table((x = [1.2, 2.3, 3.4], y = [5.6, 9.1, 12.3]))
reduce(LinReg(), t, select = (:x, :y)) #regressors first, left-hand-side last
How about adding multiple regressors and fixed effects?
If you list symbols in the select, list regressors and lastly the left-hand side variable. Regressors can be indicators, so you can add fixed effects manually. But something like FixedEffectModels
is not supported…
I need that my model also uses random-effects because two of the regressors are categorical with thousands of levels: the identifier of the person and the city. I have repeated measures.
I want to fit a logistic regression with random effects slightly more complex than this:
Treatment ~ age + sex + weight + year + (1|City/ID) , family = “binomial”
I have been using R do do it but it’s really slow and if the dataset is too large I get out of memory errors.
Another option would be using something like photon-ml but I haven 't been able to find any example with code to do it, just theory.
The OnlineStats docs mention logistic regression models, but not one with random effects. It seems there is no off-the-shelf solution for your application, but it seems like an interesting summer project.
If your data does fit into an in-memory DataFrame, there is a way to deal with high-dimensional fixed-effect models and it is very fast and convenient. There is no JuliaDB support, though.
The reason I need to try with JuliaDB or other alternatives is because the dataset is too big to fit on memory: Mixedmodels.jl, lme4 and other packages I’ve tried can’t deal with it.
Another option would be Spark with Photon-ML, but again I couldn’t be able to find examples with code, just abstract theory.
Are there any plans of adding more econometrics functionalities for JuliaDB? It seems most big data solutions are designed for machine learning. However, there are a lot of academic users who wants to run regressions on large data.
That is a question for @joshday
Some of what you want may already exist:
julia> using JuliaDB, OnlineStats
julia> t = table((randn(10^5) for _ in 1:20)...);
julia> o = reduce(LinRegBuilder(), t)
LinRegBuilder of 20 variables
julia> coef(o, x=1:5, y=6)
6-element Array{Float64,1}:
-0.0030005396011994296
-0.002424984309950336
-0.0018409523123960267
-0.005956781422760911
0.00111525992287166
0.0024908719702746666
I particularly think LinRegBuilder
is cool because after you’ve fit it on a dataset you can then fit any regression you want without needing any of the actual data.
Adding things like interaction/quadratic terms are less straightforward at the moment, but I don’t think OnlineStats is the right place for that functionality to live (see StatsModels.jl)