MLJ probabilistic models basics

jw3126 · March 15, 2020, 1:15pm

I am new to MLJ and trying to do a linear regression:

using MLJ

nexamples = 10
nfeatures = 3

X = randn(nexamples, nfeatures)

A = randn(nfeatures, 1)
b = randn()
noise = 0.1*randn(nexamples)
y = vec(X * A .+ b) + noise
X = MLJ.table(X)

model = @load LinearRegressor pkg = GLM

mach = machine(model, X, y)
fit!(mach)
y_hat = predict(mach)

This gives me a vector of probability distributions, which is really cool.

5-element Array{Distributions.Normal{Float64},1}:
 Distributions.Normal{Float64}(μ=-5.217306370475575, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.6777270495732746, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.4193911630828542, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-0.05493986909253701, σ=0.04688655865968578)
 Distributions.Normal{Float64}(μ=-1.7665791465850154, σ=0.04688655865968578)

However I would like to evaluate the model and also compare it against deterministic models.

How to do that? One idea would be to make the model deterministic, e.g. by taking the mean. How to code that?

tlienart · March 15, 2020, 5:22pm

you can use predict_mean (also in pipelines)

Albert_Zevelev · March 15, 2020, 11:24pm

@jw3126 @tlienart
This looks really cool.
In the example above only the conditional expectation E[Y|X] changes w/ X.
This is assuming

homoskedasticity Var[Y|X] is constant.
the conditional distribution Y|X is normal (due to GLM)

In reality that’s usually not likely.
I tried this w/ the Boston housing data & got a constant skedastic function

X, y =  @load_boston;
model = @load LinearRegressor pkg = GLM
mach = machine(model, X, y)
fit!(mach)
y_hat = predict(mach)

Distributions.Normal{Float64}(μ=30.212372064783466, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=25.267233800068816, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=30.849358585028362, σ=4.787101274343532)

Assuming you have enough data (you’re gonna need a lot), is it possible to use MLJ to get heteroskedastic predictions?
This is particularly useful in finance/insurance where users care alot more about σ then μ.

Is it possible to get predictions which are not always normal? Either best parametric fit (from some set of models, or nonparametric)
(I realize we’re prob gonna have to depart from linear models…)

For example suppose X is 1-dimensional:

For X ∈ (1.0,1.05), Y|X ∼ Distributions.Normal{Float64}(μ=3.3, σ=4.8) 
For X ∈ (2.2,2.35), Y|X ∼ Distributions.Normal{Float64}(μ=2.3, σ=1.6) 
For X ∈ (2.8,3.10), Y|X ∼ Distributions.LogNormal{Float64}(μ=5.0, σ=3.6) 
For X ∈ (3.5,3.72), Y|X ∼ NonParam{Float64}(μ̂  =4.0, σ̂  =3.3)

In grad school it is very routine to estimate Var[b|X] assuming heteroskedasticity using robust EHW standard errors, or clustered se, (or FGLS w/ ML methods), then we can infer E[Y|X]. But only for linear models.

Is there a literature on this I’m not aware of? @Tamas_Papp @fipelle

I’d ultimately like to produce something like @mthelm85’s elegant figure:
output

update 1: @fipelle suggests Adrian et al 2019. (code in Matlab)
Figure 1: One-year-ahead predictive distribution of real GDP growth, based on quantile regressions with current real GDP growth and NFCI as conditioning variables

A shortcoming of this method is that it only produces unimodal predictions…
Technically this is doable in MLJ which has options for quantile regression.

In addition, the authors fit a skewed t -distribution (Azzalini & Capitanio 2003) to smooth the quantile function and recover a probability density function.
Azzalini has an R package for the skew-normal/t-distribution.
The skewed-t is currently not in Distributions.jl (@simonbyrne @johnmyleswhite @andreasnoack)

Update 2: it looks like @oxinabox’s DensityEstimationML.jl can be helpful as well but I don’t see any concise examples right now.

Update 3: an emerging literature extends boosting to probabilistic forecasting.
XGBoostLSS, CatBoostLSS, Gamlss, GamboostLSS, bamlss, disttree, ngboost.
Beautiful: Slides & Docs & Guide

XGBoostLSS Paper:
“The ultimate goal of regression analysis is to obtain information about the [entire] conditional distribution of a response given a set of explanatory variables.” (Hothorn et al., 2014, emphasis added)

Julia can really shine here!

fipelle · March 16, 2020, 1:33am

Recently, I have seen a lot of empirical work (in macroeconomics and macro-finance) on quantile regressions (for example: Adrian, Boyarchenko and Giannone, 2019). However, I am not exactly sure from your post if you are more interested in estimating the conditional mean or variance. If you can clarify further on that I can try to be more helpful.

Albert_Zevelev · March 16, 2020, 1:52am

I’d like to estimate the entire conditional distribution when possible.
If not, @ least a heteroskedastic normal.

Thanks for link.
Looks like they use a skewed t-distribution developed by Azzalini and Capitanio (2003), coded in Matlab.

https://www.openicpsr.org/openicpsr/project/113169/version/V1/view?path=/openicpsr/113169/fcr:versions/V1/azzalini&type=folder

I’ve never seen this.

Tamas_Papp · March 16, 2020, 8:22am

I would go with a parametrized Bayesian model (then model checking, and extending as necessary, repeated until satisfied), but that has nothing to do with MLJ so that would be off-topic here. Also, it is something that you can easily do with existing MCMC tools in Julia.

jw3126 · March 16, 2020, 10:03am

Thanks! I tried the following, but it just makes julia hang without output:

using Pkg
pkg"""
up
add MLJ MLJModels GLM MLJBase MLJModelInterface
"""

using MLJ

nfeatures = 3
nexamples = 10
X = randn(nexamples, nfeatures)
A = randn(nfeatures, 1)
noise = zeros(nexamples)
noise = randn(nexamples)
y = vec(X * A) + noise
X = MLJ.table(X)


Xs = source(X)
ys = source(y, kind=:target)

pmodel = @load LinearRegressor pkg=GLM
pmach = machine(pmodel, Xs, ys)
y_hat = predict_mean(pmach)
model = @from_network Det(pmodel=pmodel) <= y_hat

evaluate(model, X, y, measure=rms)

tlienart · March 16, 2020, 7:08pm

You forgot a fit in the mix, how about:

using MLJ

n, p = 10, 3

X = randn(n, p)
A = randn(p)
y = X*A + randn(n)

Xt = MLJ.table(X)

@load LinearRegressor pkg=GLM

pmach = machine(LinearRegressor(), Xt, y) 
fit!(pmach)

y_hat = predict_mean(pmach)

jw3126 · March 16, 2020, 7:56pm

Thanks, that does work. However it is not quite what I am looking for. Really I would like to build a first class deterministic model out of the probabilistic one. For instance I would like to evaluate it and compare against other deterministic models using rms measure.

oxinabox · March 27, 2020, 9:26am

That was a research project that was never completed,
I think the math in there all holds and may actually be at least partially novel.
(though on large bit of it I found out after writing 80% of the paper on it, definately is not novel)

I have now archieved the repo.
Idk how much of the stuff that ended up in it actually worked or was usable.

Topic		Replies	Views
PPL connection to MLJ.jl Probabilistic Programming mlj	12	1649	September 22, 2019
[ANN] SossMLJ.jl - Probabilistic Programming for MLJ modeling Package Announcements	1	439	September 17, 2020
How to create a MLJModelInterface.Model interface of a complex model? Machine Learning mlj	1	365	February 25, 2021
Questions about the use of MLJ General Usage	0	240	August 11, 2021
MLJ w/Scikitlearn: passing return_std to predict Machine Learning mlj , scikitlearn	7	625	January 9, 2023

MLJ probabilistic models basics

Related topics