Problems with Lathe TrainTestSplit

Jasper_Hall · September 16, 2020, 12:40pm

I am attempting to use Lathe for Linear Regression but unfortunately the documentation is unreachable (webpage lathe.ai/docs/ returns an error). All the tutorials that I have seen about it (e.g. here) plus the documentation presented when it throws an error all say the same thing but it doesn’t work so I am confused. I have a DataFrame called df and do the following:

# Train test split
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(df, 0.75)

This gives this error:

ERROR: UndefVarError: dfTrainTestSplit not defined

If I try specifying the “at” parameter as is hinted in the command line documentation, then I get the error:

train, test = TrainTestSplit(df, at=0.75)
got unsupported keyword argument “at”

If I try anything else (like not specifying the percentage for instance) then I get the following error:

ERROR: MethodError: no method matching TrainTestSplit(::DataFrame)
Closest candidates are:
TrainTestSplit(::DataFrame, ::Float64)
TrainTestSplit(::Array, ::Float64)

So, this would suggest that the original call above specifying my data frame (df) and a percentage (0.75) is correct. I can see data frame and can do boxplots on it etc so I know that is not the problem, but without documentation I am lost.

P.S. I am using Julia 1.4.0 and have tried both Lathe (0.1.1) and Lathe#Butterball (0.1.2) and get the same result

Albert_Zevelev · September 16, 2020, 1:08pm

Hi @Jasper_Hall.
The most polished ML interface in Julia is MLJ.jl. I recommend you use them.
Their tutorials are a great place to start.

Jasper_Hall · September 16, 2020, 1:59pm

Thanks for the tip. There seem to be so many options with Julia for ML so it’s not always clear where to start. I will certainly give MLJ a look.

Albert_Zevelev · September 16, 2020, 4:04pm

Here’s an example that does 3 things:

splits the data into train/test samples
Trains a linear regression model (as you intended), out of sample score rmse=4.24
Trains a LightGBM regression model, out of sample score rmse=2.7198

julia> using Pkg
julia> Pkg.add(["MLJ", "MLJLinearModels"])

julia> using MLJ;
julia> X, y =  @load_boston;
julia> train, test = partition(eachindex(y), .7, rng=333);

julia> load("LinearRegressor", pkg="MLJLinearModels");
julia> m= LinearRegressor();
julia> mach = machine(m, X, y);
julia> fit!(mach, rows=train)
julia> fitted_params(mach)
julia> ŷ= predict(mach, rows=test)
julia> rms(ŷ, y[test])
4.424214263064273

julia> Pkg.add(["LightGBM"])
julia> load("LGBMRegressor", pkg="LightGBM");
julia> m= LGBMRegressor(num_iterations=100);
julia> mach = machine(m, X, y);
julia> fit!(mach, rows=train);
julia> #fitted_params(mach)  #long and messy
julia> ŷ= predict(mach, rows=test)
julia> rms(ŷ, y[test])
2.71982122922397

Jasper_Hall · September 16, 2020, 4:39pm

That’s very helpful - thanks. I have got this working on my data after I untangled both the need for coercing Float64 data to a scitype of continuous and then realising that I can’t have an array for X but it must be a table and you can’t seemingly have a table for y on the other hand. I’m sure the Developers of MLJ have reasons for all of this but it seems less than intuitive.

So far so good. I have the coeficients and the intercept but I want the P value and the T statistic. These are available using Lathe but aren’t even mentioned in the MLJ tutorial. Is there some call I can make to get these?

Albert_Zevelev · September 16, 2020, 6:09pm

MLJ.jl is just an interface to various ML packages in the Julia ecosystem.
Currently, the primary focus of MLJ.jl is to fit and predict not really statistical inference.
However, after a model is fit you can extract the fitted parameters using fitted_params(mach). See updated code above.
fit, predict, fitted_params works for basically any regression model.
Only a few models have: t-stats, standard errors, p-values etc
Currently MLJ.jl doesn’t have a standard method to extract these statistics.
For this I would use GLM.jl directly.
I’ve seen discussion about MLJ.jl incorporating some of these stats, but I don’t think it’s a priority right now.
@tlienart @ablaom @dilumaluthge

I think incorporating statistics into MLJ.jl properly would take A LOT of work. For example to correctly estimate standard errors the MLJ.j user must make assumptions about the covariance matrix of the residual.

Suppose there existed MLJ.standard_error(mach)
-this wouldn’t make much sense for many models such as LightGBM (but let’s ignore this…)
Consider: y=X\beta +\varepsilon w/ \varepsilon \sim \Omega
MLJ.fitted_params(mach) would give the usual \hat{\beta}_{ols}
What should: MLJ.standard_error(mach) give?
If you wanna assume the usual \Omega = \sigma^2 I_n, then MLJ.standard_error(mach)=\frac{\hat{\varepsilon}'\hat{\varepsilon}}{(n-k)} I_n
But this is a heroic assumption (rarely used nowadays).
There is a giant literature on robust standard errors: Eicker-Huber-White, Clustered (Liang-Ziegler), spatially correlated, auto-correlated etc.
There are a variety of packages: @gragusa’s CovarianceMatrices.jl, @matthieu’s Vcov.jl, sadly I’m working on one too etc
Hence, this would require fitting 1. the original model & 2. the standard error model (scedastic function)

dilumaluthge · September 16, 2020, 6:24pm

@ablaom and @tlienart may be able to weigh in further

dilumaluthge · September 16, 2020, 6:30pm

You can use MLJ to fit a GLM.jl model, and then you can get the p-values for that model via MLJ. I don’t remember the exact syntax- @ablaom will know.

Albert_Zevelev · September 16, 2020, 8:35pm

Here is the GLM interface to MLJ: https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl
I cannot find p-values anywhere.
However they do have:

glm_report(fitresult) = ( deviance     = GLM.deviance(fitresult),
                          dof_residual = GLM.dof_residual(fitresult),
                          stderror     = GLM.stderror(fitresult),
                          vcov         = GLM.vcov(fitresult) )

Accessed by report(mach)

julia> report(mach)
(deviance = 8502.480903578471,
 dof_residual = 341.0,
 stderror = [0.036681269553499325, 0.01737166925431697, 0.07680917833638921, 4.81712249226012, 0.5341500317712217, 0.016323170996032966, 0.2541450592739458, 0.0841378254119214, 0.004680968995291761, 0.16998443628436907, 0.0033963935217423864, 0.0638108519702615, 6.360054287151586],
 vcov = [0.0013455155360564766 -6.088746040570593e-5 … -0.00039008348465676903 -0.008230320370078825; -6.088746040570593e-5 0.0003017748926813815 … -8.860618288494182e-5 -0.0014983861144144224; … ; -0.00039008348465676903 -8.860618288494182e-5 … 0.004071824829170625 -0.13449472260664377; -0.008230320370078825 -0.0014983861144144224 … -0.13449472260664377 40.450290535515265],)

This can be used to obtain t-stats:t_{\hat{\beta}}=\frac{\hat{\beta}-\beta_{0}}{ \hat{s.e.}\left(\hat{\beta}\right) } and then p-values =Pr\left(>|t|\right) under H_0.

A few observations:

it doesn’t look like the current MLJ-GLM interface allows for p-values directly
glm_report standard errors currently assume the covariance of the residual matrix is spherical \Omega =\sigma^2 I_n.
If you’re only interested in out of sample prediction, you don’t care about this.
If you are doing stat inference & wanna submit a paper to a science journal you need more flexibility.
I’ve never seen a recent finance/econ paper that assumes \Omega =\sigma^2 I_n
the only honest solution is for MLJ.jl to add a fit_skedastic!(mach, type) function, where type is the type of standard error the user wants to compute depending on her assumptions about the unknown \Omega.

It can work as follows:

#step 1 fit a linear model w/ your favorite MLJ linear regression
julia> load("LinearRegressor", pkg="MLJLinearModels");
julia> m= LinearRegressor();
julia> mach = machine(m, X, y);
julia> fit!(mach)

#step 2 use residuals from fit model to estimate the skedastic function
julia> SE1 = fit_skedastic(mach, spherical)  #usual naive standard errors
julia> SE2 = fit_skedastic(mach, EHW)  #EickerHuberWhite standard errors
julia> SE3 = fit_skedastic(mach, cluster(state,year))  #se clustered by state-year

MLJ.jl can actually be a great place bc recently there has been a lot of interest in using ML methods to estimate skedastic functions for better statistical inference.

This could be super-helpful to the research community (myself included) bc currently computing multiple standard errors is a royal pain. Journals often wanna see robust standard errors computed in different ways under various assumptions about the unknown data generating process. In a recent paper I computed standard errors using 11 different methods

dilumaluthge · September 16, 2020, 9:39pm

What happens if you fit a GLM.jl model with MLJ, and then you print the fitresult directly?

I.e.

fit!(mach)
@show mach.fitresult

This should show the p-values and standard errors reported by GLM.jl

Albert_Zevelev · September 16, 2020, 9:46pm

yep, it spits out the same output as GLM.jl, but it assumes spherical errors and doesn’t allow the user to fit a custom skedastic function

julia> using MLJ;X, y =  @load_boston;train, test = partition(eachindex(y), .7, rng=333);

julia> load("LinearRegressor", pkg="GLM");
[ Info: Loading into module "Main":
import MLJModels ✔
import GLM ✔
import MLJModels.GLM_ ✔

julia> m= LinearRegressor();

julia> mach = machine(m, X, y);

julia> fit!(mach, rows=train)
[ Info: Training Machine{LinearRegressor} @164.
Machine{LinearRegressor} @164 trained 1 time.
  args:
    1:  Source @649 ⏎ `Table{AbstractArray{Continuous,1}}`
    2:  Source @820 ⏎ `AbstractArray{Continuous,1}`


julia> @show mach.fitresult
mach.fitresult = GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────────
            Coef.  Std. Error      z  Pr(>|z|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────
x1    -0.125665    0.0366813   -3.43    0.0006   -0.197559    -0.0537714
x2     0.0711207   0.0173717    4.09    <1e-4     0.0370729    0.105169
x3    -0.0094743   0.0768092   -0.12    0.9018   -0.160018     0.141069
x4   -19.0626      4.81712     -3.96    <1e-4   -28.504       -9.62125
x5     3.27676     0.53415      6.13    <1e-9     2.22984      4.32367
x6     0.00663543  0.0163232    0.41    0.6844   -0.0253574    0.0386283
x7    -1.72403     0.254145    -6.78    <1e-10   -2.22215     -1.22592
x8     0.40523     0.0841378    4.82    <1e-5     0.240323     0.570137
x9    -0.0163271   0.00468097  -3.49    0.0005   -0.0255016   -0.00715256
x10   -0.929381    0.169984    -5.47    <1e-7    -1.26254     -0.596217
x11    0.0091109   0.00339639   2.68    0.0073    0.00245409   0.0157677
x12   -0.525458    0.0638109   -8.23    <1e-15   -0.650525    -0.400391
x13   41.6909      6.36005      6.56    <1e-10   29.2254      54.1563
─────────────────────────────────────────────────────────────────────────

GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────────
            Coef.  Std. Error      z  Pr(>|z|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────
x1    -0.125665    0.0366813   -3.43    0.0006   -0.197559    -0.0537714
x2     0.0711207   0.0173717    4.09    <1e-4     0.0370729    0.105169
x3    -0.0094743   0.0768092   -0.12    0.9018   -0.160018     0.141069
x4   -19.0626      4.81712     -3.96    <1e-4   -28.504       -9.62125
x5     3.27676     0.53415      6.13    <1e-9     2.22984      4.32367
x6     0.00663543  0.0163232    0.41    0.6844   -0.0253574    0.0386283
x7    -1.72403     0.254145    -6.78    <1e-10   -2.22215     -1.22592
x8     0.40523     0.0841378    4.82    <1e-5     0.240323     0.570137
x9    -0.0163271   0.00468097  -3.49    0.0005   -0.0255016   -0.00715256
x10   -0.929381    0.169984    -5.47    <1e-7    -1.26254     -0.596217
x11    0.0091109   0.00339639   2.68    0.0073    0.00245409   0.0157677
x12   -0.525458    0.0638109   -8.23    <1e-15   -0.650525    -0.400391
x13   41.6909      6.36005      6.56    <1e-10   29.2254      54.1563
─────────────────────────────────────────────────────────────────────────

Jasper_Hall · September 18, 2020, 11:13am

A lot of great discussion here. Following on from the hints contained in it, I switched to GLM. GLM appears to be easier to use the MLJ (at least for this particular use-case) and has done everything I need.

Many thanks for all the great input.

Topic		Replies	Views
Python train_test_split vs Julia splitobs Data	11	3673	May 5, 2017
Failed method resolution New to Julia	2	346	October 4, 2019
Lathe.jl: No Method Matching Error Modelling & Simulations	0	520	July 28, 2021
Simple tool for train test split Machine Learning	12	11731	March 20, 2020
@formula for Lathe.preprocess UndefVarError Statistics	12	689	July 29, 2021

Problems with Lathe TrainTestSplit

Related topics