Multivariate OLS

  • GLM.jl seems more like an addon package for DataFrames (which are not even blessed as a basic julia building block, so they could go away) than a standalone linear regression package. Or am I mistaken? Can it operate on data series that are not in DataFrames? (R’s lm is base functionality and works without use of dataframes.)

  • I can code a multivariate regression using linear algebra (although the obvious version, inv(X’ X)*X’ y, has numerical stability problems), and I am guessing that internal implementation details should be hidden from novices.

so what does the julia ecosystem suggest at the moment for a simple example:

julia> y=[1:10;]; x1= y.^2; x2= y.^3;  ## julia
julia> GLM.lm( y ~ x1 + x2 )
ERROR: MethodError: no method matching ~(::Array{Int64,1}, ::Array{Int64,1})

wanted:

print( m <- lm( y ~ x1+x2 ));
print( sm <- summary(m) );
mycoefs <- coefs( sm )
using DataFrames, GLM
y = 1:10 
df = DataFrame(y = y, x1 = y.^2, x2 = y.^3)
sm = GLM.lm( @formula(y ~ x1 + x2), df )
coef(sm)

Does what you want.

4 Likes

indeed (and thank you), but it requires DataFrames.

Maybe a better name should be DataFramesGLM ?

1 Like

You can instead do X \ y; I believe this should be more stable:

julia> X = randn(40,4);

julia> y = randn(40);

julia> X \ y
4-element Array{Float64,1}:
  0.0236363
  0.139099 
 -0.0906928
  0.0776183

julia> inv(X' * X) * X' * y
4-element Array{Float64,1}:
  0.0236363
  0.139099 
 -0.0906928
  0.0776183

If you’re running into numerical stability problems, perhaps you could try some sort of reparameterization of your design matrix?
If it is nearly rank deficient, you aren’t going to end up with interpret-able coefficients anyway.

2 Likes

For OLS,

y = Vector{Float64}(1:10)
X = Matrix{Float64}(hcat(y.^2, y.^3))
β = X \ y # Uses QR

The \ operator is the short-hand for

Q, R = qr(X)
β = inv(factorize(R)) * Q.'y

For even higher precision, you may use the singular values decomposition,

β = pinv(X) * y

or its long-form

U, S, V = svd(X)
β = V * diagm(1 ./ S) * U.' * y

Other alternatives using the LinearAlgebra stdlib include taking the inverse of the normal matrix through Cholesky or LdLt, Bunch-Kaufman, etc. Less common in certain applications a few packages provide sparse iterative solvers such as LSQR and LSMR.

Just because something is not in Base, it doesn’t mean is not “blessed”. Base aims to only include the very foundation and basic building blocks (e.g., LinearAlgebra is not in Base, but a standard library). Standard libraries are the basic code that is deemed essential for the language, development or meant to be shared across the whole ecosystem (Missings.jl is probably going to be added as a standard library).

GLM doesn’t use DataFrames per se, but uses StatsModels meaning that if StatsModels eventually plays well with other tabular data packages, no reason for GLM not to support them as well. Same for all other regression packages which rely on StatsModels. The usual pipeline is StatsBase / StatsModels / some regression package (with most having a dependency on DataFrames and Distributions as well).

The idea of using regression packages rather than just linear algebra is that it allows for efficient and proper handling of cases. For example, contrasts (interactions), weights, linear dependent variables, different estimators (fixed effects, first-difference, between, 2SLS, random effects, regularization, generalized method of moments, etc.), displaying the variable names, computing various variance-covariance estimates, etc. Regression packages can also benefit for outsourcing calculations such as covariance matrices or sharing a nice standard API (StatsBase.RegressionModel).

Package development in Julia aims to provide lightweight packages. Rather than having one encompass all package these are usually a compilation of smaller packages. For example, DataFrames used to have I/O, tabular data representation, missing data support, categorical variables support, and statistical model transformations. Nowadays, there are packages for each of those functionalities separately and one can choose and select which ones one needs.

9 Likes

Thanks @Nosferican. Yes, this “blessed” nonsense really needs to stop. DataFrames.jl is a package in the Julia ecosystem that is depended upon by a lot of the statistics ecosystem. Just because it’s not part of the JuliaLang repo doesn’t mean it’s not important or widely used, it just means it’s not forced in the installation. Many of the core Julia contributors also contribute to DataFrames.jl. I agree that it’s not a basic Julia building block because Julia is a language, not a statistics engine or a scientific computing engine. But what is powerful about Julia is that packages are first-class addons to the language which can be just as performant as anything written in Base even when written in Julia. Dismissing packages is a good way to miss out on what Julia is all about.

This should look familiar:

That’s just pinv(X). Basically, OLS is just solving Ax=b where you add a column of 1s if you want an intercept, so matrix factorizations of A\b is OLS.

Anyways

That is a surprising admission to me. Does GLM not accept a matrix for usage with the @formula macro? It’s really not hard to code yourself though (I actually have that as a problem for people learning Julia for the first time). That said, IMO regression from an array should be added to GLM since it’s a pretty basic thing (but someone needs to write the PR!).

3 Likes

hi N: thank you (again).

I understand that Package functions should not (have to) be in Base. BUT julia fresh out of the box comes with many packages:

BLAS         Dates         Iterators     Libdl         Mmap          Serializer    Test
Base         Distributed   LAPACK        LinAlg        Operators     SparseArrays  Threads
Collections  Docs          LibGit2       Markdown      Pkg           StackTraces
Core         FFTW          Libc          Meta          Profile       Sys

I view these as “blessed” (ok, wrong word; long-term maintained; generally agreed upon; used and developed actively by the main julia guys; …). These are unlikely to die anytime soon. Others could become unmaintained. I hope this is unlikely with DataFrames, but reading discourse, I have learned that DataFrames replacement is not impossible in the longer run.

Packages deep down the hierarchy that are not blessed are more vulnerable to rot. DataFrames relies on Missing etc. GLM relies on DataFrames etc. Here, with DataFrames, with fewer competing solutions, I think I am safe for a while. With graphing, the situation is a bit more confusing—there is no principal package, deep dependencies, etc. Potentially fragile (in the long run).

Lightweight is good when the pieces have self-sufficient use cases and uses.

StatsModels is dependent on DataFrames, too, so I may as well use GLM.

I should probably switch from julia to julia pro, and consider their package sets to be blessed (wrong term again). Chris: Much of what you call “first-class addons,” I called blessed. cpan is full of half-working stuff that used to work and used to be important (even CGI.pm is now deprecated). I need to know what is considered “first-class” and what “less-serious-class.”

Chris—the advantage of not coding myself is not reinventing the wheel and relying on maintained solutions that worry more about edge cases. But, yes, OLS is easy to code.

And thanks for help and pointers, everyone.

In early Julia there were no packages so this scientific functionality was built in. Later on, packages were added. Since these are autonomous from the rest of Julia’s Base library, they were moved from the standard Julia Base module to “the standard library”, and are simply packages built into the sysimage and loaded by default. However, that’s mostly an engineering problem than anything else right now. The reason why they weren’t moved completely out of the JuliaLang repo is because then they can’t be in the sysimage and that would cause a pretty big regression to startup time in a lot of cases, so that’s where we currently are.

I am sure that with improvements to precompilation we will see many of these leave, and many on that list are already going away. Pkg is going away for Pkg3 which is in a separate repo. FFTW already has moved to GitHub - JuliaMath/FFTW.jl: Julia bindings to the FFTW library for fast Fourier transforms. I’m sure than when it’s possible to do so, there will be a discussion about moving BLAS/LAPACK, LinAlg, and LibGit2. IIRC Iterators is in there because other standard library libraries needed it. The Julia of the future seems to be more of a lean Base installation with a “standard installation”. I among others will be happy to see a lean Julia be a great solution for things like embedded applications. Then things like JuliaPro are likely the first among many “enlarged standard installations”.

Of course anything could be replaced. Julia could be replaced in the longer run. But it, and DataFrames.jl, have a pretty active community. So I don’t see that happening any time soon.

That has nothing to do with blessings, that just is because nobody has a perfect solution right now. If there was a blessed package (which would probably be Plots.jl + GR), it would likely get replaced in a year or so down the line by Makie.jl which is too immature to bless (my speculation). So the stability has nothing to do with whether a package is “the chosen one”, it has to do with whether the architecture worked and whether it covers people’s needs.

Yes, I suggested that GLM compatibility with arrays should be added. Your code was a little nonsensical because you never passed any data to GLM. But actually, to me it was a surprising enough omission so I started snooping around, and sure enough:

julia> xs = randn(46, 3)
julia> ys = randn(46)
julia> GLM.lm(xs, ys)
Coefficients:
       Estimate Std.Error   t value Pr(>|t|)
x1     0.128357  0.158524  0.809701   0.4226
x2      0.17579  0.140364   1.25238   0.2172
x3   -0.0838361  0.140523 -0.596601   0.5539

That’s a documentation issue. It should really be added to the README.

3 Likes

Identifying good packages is a problem in many languages including Python, R, Stata, LaTeX, etc. That is part of the status of the project (WIP, Depreciated, Maintained, etc.). For Julia, a good rule of thumb is that those that are within an established organization are better maintained in general: JuliaData, JuliaStats, JuliaMath, JuliaComputing, but still it is not always the case nor universal among packages within an organization. Another good strategy is how credible and reliable the ecosystem has been. For example, in R, tidyverse is extremely strong (ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats). In R, data.frame sucks! It used to be that the Base tabular data was almost unusable. Tidyverse changed that and data.table provided an efficient alternative (the base tabular data functionality has improved a lot in the latest R releases). Another example, is the parallel packages bundled in the Microsoft R Open distribution. Those packages are maintained by a Microsoft Team as well.

For Julian packages, (1) check the badges, (2) make sure the package has C.I. and coverage results, (3) good documentation, (4) when was the last release or commit history, (5) who are the maintainers and main contributors, etc.

3 Likes

This is the reason I think its a huge blessing dataframes never made it into base.

the problem with R is that they calcified. they did not deprecate packages over time (with a trailoff) in favor of better ones. so, now it is a frankensystem. one system that is core, and a better system that is not.

the problem with julia is that we put a lot of burden researching onto newbies that just want to learn julia—before you can learn how to analyze data sets or do graphics (while learning the language, which is itself overwhelming), we ask them to research which package they should choose from among many. or we hope that near-newbies, like myself, who are appointed as their stats/finance teachers, tell them what they should learn. so we start off with a frankensystem.

which is why I would like a (deprecatable) best-practice current facility that I could comfortably recommend to my students.

as bad as R’s dataframe may be, or as bad as it graphics may be, at least they come fully suported. when my students install R, it is in the system and good enough for a start.

But it doesn’t come with ODE solvers. I don’t think it does rootfinders or parallelism either? I don’t think it does GMRES either.

of course, nothing comes with everything. (yes, R does come with rootfinders, parallelism, and optimizers in the box.) fortunately, there is one julia, and not ten dialects of julia with newbies having to figure out which one to choose.

it is a matter of where to draw the line.

I just want julia to be a generally viable alternative in finance, economics, and statistics classes, where the students spend their 10 weeks learning julia, rather than 10 weeks researching package ecosystems for minimum functionality. maybe julia does not want to go there. this would be a pity.

Why don’t you just tell your students to use JuliaPro? It comes with DataFrames and GLM built in. There’s no need to research anything about the package ecosystem.

1 Like

and you draw the line at whatever is needed for

But that’s quite arbitrary. MATLAB and Octave come with pretty terrible statistics support but that’s fine. However, if they pulled out the ODE solvers people would rebel because that’s such core functionality. So why wouldn’t you say it’s a core functionality?

Because what’s core is personal, except for the extreme basics. Julia is trying to make Base be those extreme basics because otherwise everything is core to some segment of the population and you get bloat.

How do you handle it? You could just tell your class about the 5 or so packages to use. Or if it’s just stats, point them to JuliaStats. It really shouldn’t take more than 5 minutes to introduce the stats packages. If you want it pre-installed, as @aaowens suggests, just have them install JuliaPro which was created exactly for this purpose.

But you’re not going to convince anyone that the top 100 packages should go to the Julia Base repository, or even that the top 3 packages you care about should. Julia’s already been there, and what it does is the opposite of what you’re thinking. Packages in Julia Base cannot update regularly because they are tied to Julia releases. They are harder for contributors to jump into since there’s so much other code around them. They are harder to test because they are tested with the rest of Julia. In the end what it does is cause stagnation due to the inertia of larger repositories, while on its own DataFrames.jl is nimble and can release bugfixes almost instantly. Julia had a lot of stuff in Base and is getting leaner for this reason.

This is getting off-topic now and so it should continue in that other thread.

2 Likes

To get back to the original question: GLM is perfectly capable of fitting (generalized) linear models without dataframes/formulae:

using GLM

X = randn(40, 2)
β = [1., -3.]
y = X * β .+ randn(40)
fitted = lm(X,y)

# GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}}:
# 
# Coefficients:
#      Estimate Std.Error  t value Pr(>|t|)
# x1    1.22315  0.161856  7.55705    <1e-8
# x2   -2.90029  0.134309 -21.5942   <1e-22

you can see for yourself using methods:

julia> methods(lm)
# 2 methods for generic function "lm":
lm(X, y) in GLM at /home/dave/.julia/v0.6/GLM/src/lm.jl:150
lm(X, y, allowrankdeficient::Bool) in GLM at /home/dave/.julia/v0.6/GLM/src/lm.jl:150

The link between the “standard” (matrix/vector) and DataFrames/formula-based fitting is provided by StatsModels.jl. That used to be part of DataFrames itself but was split out with an eye towards making things less dependent on DataFrames.

Edit: I see that @ChrisRackauckas beat me to the punch above here.

4 Likes

The community is aware of these issues, but there is no solution other than waiting for some “standard” packages to emerge and stabilize. This is likely to take years; moreover, it really helps if the users are willing to get their hands dirty, learn about the internals of some packages they depend on, and submit PRs. Otherwise they need to be prepared to wait for someone to fix bugs (which usually happens quickly) and implement new features (usually takes longer if complex).

At the moment, this should be understood as a cost of using Julia. Users should think carefully about their goals, and weigh these costs against the benefits.

2 Likes

It feels to me that dataframes is inherent non-julian in a way. Cos the tables are not typed using the column types. This may cause many parts of the program to not dispatch on the most efficient methods