MLJ probabilistic models basics

I am new to MLJ and trying to do a linear regression:

using MLJ

nexamples = 10
nfeatures = 3

X = randn(nexamples, nfeatures)

A = randn(nfeatures, 1)
b = randn()
noise = 0.1*randn(nexamples)
y = vec(X * A .+ b) + noise
X = MLJ.table(X)

model = @load LinearRegressor pkg = GLM

mach = machine(model, X, y)
fit!(mach)
y_hat = predict(mach)

This gives me a vector of probability distributions, which is really cool.

5-element Array{Distributions.Normal{Float64},1}:
 Distributions.Normal{Float64}(μ=-5.217306370475575, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.6777270495732746, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.4193911630828542, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-0.05493986909253701, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.7665791465850154, σ=0.04688655865968578)

However I would like to evaluate the model and also compare it against deterministic models.

How to do that? One idea would be to make the model deterministic, e.g. by taking the mean. How to code that?

you can use predict_mean (also in pipelines)

1 Like

@jw3126 @tlienart
This looks really cool.
In the example above only the conditional expectation E[Y|X] changes w/ X.
This is assuming

  1. homoskedasticity Var[Y|X] is constant.
  2. the conditional distribution Y|X is normal (due to GLM)

In reality that’s usually not likely.
I tried this w/ the Boston housing data & got a constant skedastic function

X, y =  @load_boston;
model = @load LinearRegressor pkg = GLM
mach = machine(model, X, y)
fit!(mach)
y_hat = predict(mach)

Distributions.Normal{Float64}(μ=30.212372064783466, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=25.267233800068816, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=30.849358585028362, σ=4.787101274343532)

Assuming you have enough data (you’re gonna need a lot), is it possible to use MLJ to get heteroskedastic predictions?
This is particularly useful in finance/insurance where users care alot more about σ then μ.

Is it possible to get predictions which are not always normal? Either best parametric fit (from some set of models, or nonparametric)
(I realize we’re prob gonna have to depart from linear models…)

For example suppose X is 1-dimensional:

For X ∈ (1.0,1.05), Y|X ∼ Distributions.Normal{Float64}(μ=3.3, σ=4.8) 
For X ∈ (2.2,2.35), Y|X ∼ Distributions.Normal{Float64}(μ=2.3, σ=1.6) 
For X ∈ (2.8,3.10), Y|X ∼ Distributions.LogNormal{Float64}(μ=5.0, σ=3.6) 
For X ∈ (3.5,3.72), Y|X ∼ NonParam{Float64}(μ̂  =4.0, σ̂  =3.3) 

In grad school it is very routine to estimate Var[b|X] assuming heteroskedasticity using robust EHW standard errors, or clustered se, (or FGLS w/ ML methods), then we can infer E[Y|X]. But only for linear models.

Is there a literature on this I’m not aware of? @Tamas_Papp @fipelle

I’d ultimately like to produce something like @mthelm85’s elegant figure:
output

update 1: @fipelle suggests Adrian et al 2019. (code in Matlab)
Figure 1: One-year-ahead predictive distribution of real GDP growth, based on quantile regressions with current real GDP growth and NFCI as conditioning variables


A shortcoming of this method is that it only produces unimodal predictions…
Technically this is doable in MLJ which has options for quantile regression.

In addition, the authors fit a skewed t -distribution (Azzalini & Capitanio 2003) to smooth the quantile function and recover a probability density function.
Azzalini has an R package for the skew-normal/t-distribution.
The skewed-t is currently not in Distributions.jl (@simonbyrne @johnmyleswhite @andreasnoack)

Update 2: it looks like @oxinabox’s DensityEstimationML.jl can be helpful as well but I don’t see any concise examples right now.

Update 3: an emerging literature extends boosting to probabilistic forecasting.
XGBoostLSS, CatBoostLSS, Gamlss, GamboostLSS, bamlss, disttree, ngboost.
Beautiful: Slides & Docs & Guide

XGBoostLSS Paper:
“The ultimate goal of regression analysis is to obtain information about the [entire] conditional distribution of a response given a set of explanatory variables.” (Hothorn et al., 2014, emphasis added)

Julia can really shine here!

2 Likes

Recently, I have seen a lot of empirical work (in macroeconomics and macro-finance) on quantile regressions (for example: Adrian, Boyarchenko and Giannone, 2019). However, I am not exactly sure from your post if you are more interested in estimating the conditional mean or variance. If you can clarify further on that I can try to be more helpful.

I’d like to estimate the entire conditional distribution when possible.
If not, @ least a heteroskedastic normal.

Thanks for link.
Looks like they use a skewed t-distribution developed by Azzalini and Capitanio (2003), coded in Matlab.

https://www.openicpsr.org/openicpsr/project/113169/version/V1/view?path=/openicpsr/113169/fcr:versions/V1/azzalini&type=folder

I’ve never seen this.

I would go with a parametrized Bayesian model (then model checking, and extending as necessary, repeated until satisfied), but that has nothing to do with MLJ so that would be off-topic here. Also, it is something that you can easily do with existing MCMC tools in Julia.

Thanks! I tried the following, but it just makes julia hang without output:

using Pkg
pkg"""
up
add MLJ MLJModels GLM MLJBase MLJModelInterface
"""

using MLJ

nfeatures = 3
nexamples = 10
X = randn(nexamples, nfeatures)
A = randn(nfeatures, 1)
noise = zeros(nexamples)
noise = randn(nexamples)
y = vec(X * A) + noise
X = MLJ.table(X)


Xs = source(X)
ys = source(y, kind=:target)

pmodel = @load LinearRegressor pkg=GLM
pmach = machine(pmodel, Xs, ys)
y_hat = predict_mean(pmach)
model = @from_network Det(pmodel=pmodel) <= y_hat

evaluate(model, X, y, measure=rms)

You forgot a fit in the mix, how about:

using MLJ

n, p = 10, 3

X = randn(n, p)
A = randn(p)
y = X*A + randn(n)

Xt = MLJ.table(X)

@load LinearRegressor pkg=GLM

pmach = machine(LinearRegressor(), Xt, y) 
fit!(pmach)

y_hat = predict_mean(pmach)
1 Like

Thanks, that does work. However it is not quite what I am looking for. Really I would like to build a first class deterministic model out of the probabilistic one. For instance I would like to evaluate it and compare against other deterministic models using rms measure.

That was a research project that was never completed,
I think the math in there all holds and may actually be at least partially novel.
(though on large bit of it I found out after writing 80% of the paper on it, definately is not novel)

I have now archieved the repo.
Idk how much of the stuff that ended up in it actually worked or was usable.