Use of StatsModels?

I’m interested in making some of my modeling functions accept dataframes and fomulae as inputs, using the StatsModels package. Is the following the proper way to achieve this?

My plan would be to do something like

using CSV, StatsModels, DataFrames
nerlove = CSV.read("nerlove.csv")
nerlove[:lnC] = log.(nerlove[:cost])
nerlove[:lnQ] = log.(nerlove[:output])
nerlove[:lnPL] = log.(nerlove[:labor])
nerlove[:lnPF] = log.(nerlove[:fuel])
nerlove[:lnPK] = log.(nerlove[:capital])
f = @formula(lnC ~ 1 + lnQ + lnPL + lnPF + lnPK)

At this point, I have a formula and a dataframe. Then, I would call a fitting function, e.g., a linear regression function that accepts a formula and a dataframe, like this:

ols(f,nerlove)

Inside ols(), I would get the dependent variables and regressor matrix doing something like

m = ModelFrame(f, nerlove)
mm = ModelMatrix(m)
y = Float64.(nerlove[f.lhs])
x = mm.m
b = x\y
etc.

I have tried this, and it works. My doubts are whether or not the last block is the best way to create ordinary arrays for computing regression coefficients, etc. I also don’t know how to deal with the possibility that the dataframe has missings in it, which is why I do the type conversion, assuming that there aren’t any. Pointers to examples would be very welcome.

Is there a reason why you don’t want to use GLM.jl? If you have some custom estimators in mind, then you might want to explore your own regression output type, but GLM is very flexible, and its easy to write a closure to store the outputs you want.

Yes, the idea is to use it with more general estimators, OLS is just an example. I eventually would also like to use the table formatting, and other features in GLM, and related packages, instead of my home-cooked stuff.

1 Like

For users, packages like GLM can be great, but for researchers and teachers there can be good reasons to avoid monolithic packages that end up working against you for your specific intents and purposes

1 Like

If you want to go all the way for full inter-operability with the rest of the ecosystem, then you can create your own model object <: StatisticalModel that implements the methods described here.

IIRC observations with missing values are dropped automatically. Maybe @dave.f.kleinschmidt can confirm. You can find examples in GLM, Econometrics, Microeconometrics and FixedEffectsModels.

BTW, soon it should be possible to put the log calls directly in the formula (thanks to this PR).

2 Likes

Thanks to all for the suggestions. I now have the basic stuff working, and I think I see where to look to find examples.

2 Likes

Yup, any row with a missing in a column involved in the formula are dropped. That happens (currently) in the ModelFrame constructor.

Note that you can get “automatic” support for formula/dataframe argument by specifying RegressionModel or StatisticalModel as the supertype of your model types. Then this fit method will apply automatically. That’s the approach that GLM takes. However, the downside is that what is returned is actually a DataFrameRegressionModel wrapper, not an instance of whatever you asked to be fit in the first place. For that reason, there’s discussion about removing this automatic fallback in this issue, which AFAIK everyone is in favor of but still not clear on how best to handle fitting models with formulae (or other kinds of data transformations).