Regression and goodness-of-fit?

Basic, technical questions on “goodness-of-fit” for regression models…

Using N data points, suppose I fit a regression model with n_\beta parameters \beta to get a predictor \hat{y}_i = \hat{\beta}\phi_i, with prediction error e_i = y_i - \hat{y}_i. The standard deviation \sigma_e is then given by:

\sigma_e^2 = \frac{1}{N} \sum^N_{i=1} e_i^2

and a corrected standard deviation s_e is given by

s_e^2 = \frac{1}{N-n_\beta} \sum_{i=1}^N e_i^2

where essentially the subtraction of n_\beta corrects for the fact that parameters \hat{\beta} have been estimated using the same data as used for computing the standard deviation. [Trivial example… \hat{y}_i = \bar{y} where n_\beta = 1.]

Two questions:

  1. Suppose instead that the n_\beta parameters have been computed from training data, while I want to compute the standard deviation over validation data which are different from the training data. I’d like to use the standard deviation as a measure of “goodness-of-fit”.
  • Since I didn’t use validation data to compute \hat{\beta}, would it be correct to compute the standard deviation over the validation data using the expression for \sigma_e? Or should I use the corrected expression, i.e., s_e?
  1. Is the correction by dividing with N-n_\beta based on an assumption of a linear regression model, or would the same idea be valid for nonlinear regression methods such as ANNs?

Sorry for bothering you with such trivial questions – I’m trying to convince some colleagues to take a look at Julia, and plan to use basic regression as a case study for them. (I’d like to understand what is going on in packages before I use them…)

If this is a Julia question, it’s not very clear what the question is :slight_smile: if it’s a stats question, maybe that is a better location to ask such questions. If you have specific questions about a given package like GLM then please link to the part of the code that you don’t understand, it’ll help people answer your questions more efficiently.

A short answer to your first question is that you can use whatever you want for goodness of fit, but what you use will have different guarantees/properties (and to take an extreme, if you use - say - “0” it won’t have any). But there’s no “correct” statistic, there are just common ones which enjoy nice properties like unbiasedness, CLTs, … For (2) the same answer could be given but maybe more pragmatically I don’t think I’ve ever seen people use this for ANNs (and I don’t think it’s appropriate). It seems to me you’re mixing some notions of classical statistics like properties of M estimators with some of machine learning like validation.


I apologize for being vague in my question… Essentially, what I’m after is the following. Suppose I have a number of candidate regression models, and I want to find which one is best. The models could be generated by ANNs (e.g., a single hidden layer with varying number of nodes), PCR, PLS, GLM, etc.

So the Julia relevance could be: is there a Julia package that takes a heterogeneous collection of models and helps the user determine which model is best? I realize the answer will depend on the data, etc., and that there may not be a unique answer.

And: if there is no such package, what would be a useful measure of “goodness-of-fit”.

What I seem to recall (I don’t have access to my private science book library at the moment due to office reconstruction…) is that if I train my models on all available data, then the Akaike Information Criterion or similar (BIC, etc.) should be used to choose model. However, in the field of chemometrics, it is common to split the data between a training set (which is used to find parameters of candidate models) and a validation set (used to choose the best model).

Some of the books in the field discuss measures for “goodness-of-fit” such as RMSE (which essentially is the standard deviation of the prediction error) and R2 (which has a related interpretation). These books also suggest “corrected” measures, e.g., SEE, where the standard deviation of the prediction error is “corrected” to take into account the reduced degree of freedom caused by estimating the model parameters.

It is possible that these notions from classical statistics are outdated, but they are used in a lot of the literature. I’m kind of puzzled why the SEE is sometimes used for goodness-of-fit based on validation data, when the model parameters have not be computed using validation data.

Anyway, if there is a package for comparing models, that would be super.

As a statistician who uses Julia I will comment, perhaps at too great a length, on the second question.

A linear regression model is closely linked to the Euclidean geometry of the N-dimensional response space, because the assumptions on the distribution of the response - independent, constant variance, multivariate Gaussian distribution with mean Xβ - implies that contours of constant probability are spheres centered at Xβ. In this space the set of possible fitted values, which is the column span of X, has dimension, or “degrees of freedom”, rank(X). Evaluation of the fitted values and the residuals consists of orthogonal projection into the column span of X and orthogonal to the column span. Thus the residuals are in a linear subspace of dimension N - rank(X).

So defining the mean squared residual with the denominator N-rank(X) applies to linear models exactly and only approximately to nonlinear models. Often this value is justified by saying that this gives an unbiased estimate of the variance in the probability model but that explanation assumes that you would want an unbiased estimate of a variance. If you think about it a bit, the distribution of the variance estimator is highly skewed and the bias relates to the mean or expected value of the estimator. Usually in an intro stats course we explain that the mean is not a good measure of a “central” value in a skewed distribution but somehow this point gets lost when talking about “unbiasedness”.

So the summary is that N-rank(X) is “exact” for linear regression models and only approximate for others. It is also somewhat irrelevant when N is large, as you would expect to have for models like ANN. Many people get themselves caught up in evaluating degrees of freedom for t or F distributions in hypothesis tests and it is just not important for values over one hundred or so.


For the model shootout, you could use MLJ however it’s probably not going to be super simple because here you will potentially want your own custom models. But, to simplify, let’s say you only consider models that are already provided like, a bunch of linear regression models, a decision tree or whatever. Then MLJ will make it relatively easy for you to compare a score you get on a some validation set. In the ML world this is a simple yet efficient and fairly standard way of doing this. (You could also just use ScikitLearn.jl to the same effect if you’re already familiar with it or with the python version)

For the stats front, the BIC and AIC are probably more used to penalise a single model complexity (eg encourage reducing the number of features used in a simple linear model while preserving some measure of quality) I think it can also be used to compare across model types but, personally, I’ve not seen folks seriously comparing BIC for an ANN vs a linear model.


By the way there’s a similar topic being discussed in a separate thread with pretty good answers by @rikh and others Bayesian Model Selection - #2 by rikh

1 Like

This really depends on your application (ie what aspect of the data you care about). See eg the references in

1 Like

I’m not well versed in statistics… but " the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values"… sounds like computing the “goodness-of-fit” from the training data while also taking into account the distribution in estimated parameters. But I may be wrong.

Anyway, thanks for references!

It can be thought of as related, but the important property of Bayesian posterior predictive checks from the point of view of this topic is that they impose a metric (you evaluate whatever is important for your purposes) and at the same time put a posterior distribution on it, so you can think about the data being far from or close to the model.

For example, suppose that a black box statistic tells you that your goodness of fit is 3.2. Is that large or small? You won’t know without context.

1 Like