Use of StatsModels?

mcreel · October 30, 2018, 6:12pm

I’m interested in making some of my modeling functions accept dataframes and fomulae as inputs, using the StatsModels package. Is the following the proper way to achieve this?

My plan would be to do something like

using CSV, StatsModels, DataFrames
nerlove = CSV.read("nerlove.csv")
nerlove[:lnC] = log.(nerlove[:cost])
nerlove[:lnQ] = log.(nerlove[:output])
nerlove[:lnPL] = log.(nerlove[:labor])
nerlove[:lnPF] = log.(nerlove[:fuel])
nerlove[:lnPK] = log.(nerlove[:capital])
f = @formula(lnC ~ 1 + lnQ + lnPL + lnPF + lnPK)

At this point, I have a formula and a dataframe. Then, I would call a fitting function, e.g., a linear regression function that accepts a formula and a dataframe, like this:

ols(f,nerlove)

Inside ols(), I would get the dependent variables and regressor matrix doing something like

m = ModelFrame(f, nerlove)
mm = ModelMatrix(m)
y = Float64.(nerlove[f.lhs])
x = mm.m
b = x\y
etc.

I have tried this, and it works. My doubts are whether or not the last block is the best way to create ordinary arrays for computing regression coefficients, etc. I also don’t know how to deal with the possibility that the dataframe has missings in it, which is why I do the type conversion, assuming that there aren’t any. Pointers to examples would be very welcome.

pdeffebach · October 30, 2018, 6:53pm

Is there a reason why you don’t want to use GLM.jl? If you have some custom estimators in mind, then you might want to explore your own regression output type, but GLM is very flexible, and its easy to write a closure to store the outputs you want.

mcreel · October 30, 2018, 7:31pm

Yes, the idea is to use it with more general estimators, OLS is just an example. I eventually would also like to use the table formatting, and other features in GLM, and related packages, instead of my home-cooked stuff.

pkofod · October 30, 2018, 7:38pm

For users, packages like GLM can be great, but for researchers and teachers there can be good reasons to avoid monolithic packages that end up working against you for your specific intents and purposes

pdeffebach · October 30, 2018, 8:23pm

If you want to go all the way for full inter-operability with the rest of the ecosystem, then you can create your own model object <: StatisticalModel that implements the methods described here.

nalimilan · October 31, 2018, 7:10am

IIRC observations with missing values are dropped automatically. Maybe @dave.f.kleinschmidt can confirm. You can find examples in GLM, Econometrics, Microeconometrics and FixedEffectsModels.

BTW, soon it should be possible to put the log calls directly in the formula (thanks to this PR).

mcreel · October 31, 2018, 10:20am

Thanks to all for the suggestions. I now have the basic stuff working, and I think I see where to look to find examples.

dave.f.kleinschmidt · October 31, 2018, 12:44pm

Yup, any row with a missing in a column involved in the formula are dropped. That happens (currently) in the ModelFrame constructor.

Note that you can get “automatic” support for formula/dataframe argument by specifying RegressionModel or StatisticalModel as the supertype of your model types. Then this fit method will apply automatically. That’s the approach that GLM takes. However, the downside is that what is returned is actually a DataFrameRegressionModel wrapper, not an instance of whatever you asked to be fit in the first place. For that reason, there’s discussion about removing this automatic fallback in this issue, which AFAIK everyone is in favor of but still not clear on how best to handle fitting models with formulae (or other kinds of data transformations).

Topic		Replies	Views
Wrappers for GLM Data glm	11	1457	June 19, 2018
Julia Ecosystem (respecting hierarchy and common API) - Statistical Models Internals & Design	4	1385	July 19, 2017
StatsModels not adapted yet? Statistics question , package	13	1220	November 13, 2019
Multivariate OLS General Usage	17	5845	November 13, 2018
Add column names to a GLM.LinearModel estimated in matrix form Statistics question , glm	11	1153	July 1, 2020

Use of StatsModels?

Related topics