Problems with Lathe TrainTestSplit

I am attempting to use Lathe for Linear Regression but unfortunately the documentation is unreachable (webpage lathe.ai/docs/ returns an error). All the tutorials that I have seen about it (e.g. here) plus the documentation presented when it throws an error all say the same thing but it doesn’t work so I am confused. I have a DataFrame called df and do the following:

# Train test split
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(df, 0.75)

This gives this error:

ERROR: UndefVarError: dfTrainTestSplit not defined

If I try specifying the β€œat” parameter as is hinted in the command line documentation, then I get the error:

train, test = TrainTestSplit(df, at=0.75)
got unsupported keyword argument β€œat”

If I try anything else (like not specifying the percentage for instance) then I get the following error:

ERROR: MethodError: no method matching TrainTestSplit(::DataFrame)
Closest candidates are:
TrainTestSplit(::DataFrame, ::Float64)
TrainTestSplit(::Array, ::Float64)

So, this would suggest that the original call above specifying my data frame (df) and a percentage (0.75) is correct. I can see data frame and can do boxplots on it etc so I know that is not the problem, but without documentation I am lost.

P.S. I am using Julia 1.4.0 and have tried both Lathe (0.1.1) and Lathe#Butterball (0.1.2) and get the same result

1 Like

Hi @Jasper_Hall.
The most polished ML interface in Julia is MLJ.jl. I recommend you use them.
Their tutorials are a great place to start.

2 Likes

Thanks for the tip. There seem to be so many options with Julia for ML so it’s not always clear where to start. I will certainly give MLJ a look.

1 Like

Here’s an example that does 3 things:

  1. splits the data into train/test samples
  2. Trains a linear regression model (as you intended), out of sample score rmse=4.24
  3. Trains a LightGBM regression model, out of sample score rmse=2.7198
julia> using Pkg
julia> Pkg.add(["MLJ", "MLJLinearModels"])

julia> using MLJ;
julia> X, y =  @load_boston;
julia> train, test = partition(eachindex(y), .7, rng=333);

julia> load("LinearRegressor", pkg="MLJLinearModels");
julia> m= LinearRegressor();
julia> mach = machine(m, X, y);
julia> fit!(mach, rows=train)
julia> fitted_params(mach)
julia> yΜ‚= predict(mach, rows=test)
julia> rms(yΜ‚, y[test])
4.424214263064273

julia> Pkg.add(["LightGBM"])
julia> load("LGBMRegressor", pkg="LightGBM");
julia> m= LGBMRegressor(num_iterations=100);
julia> mach = machine(m, X, y);
julia> fit!(mach, rows=train);
julia> #fitted_params(mach)  #long and messy
julia> yΜ‚= predict(mach, rows=test)
julia> rms(yΜ‚, y[test])
2.71982122922397
2 Likes

That’s very helpful - thanks. I have got this working on my data after I untangled both the need for coercing Float64 data to a scitype of continuous and then realising that I can’t have an array for X but it must be a table and you can’t seemingly have a table for y on the other hand. I’m sure the Developers of MLJ have reasons for all of this but it seems less than intuitive.

So far so good. I have the coeficients and the intercept but I want the P value and the T statistic. These are available using Lathe but aren’t even mentioned in the MLJ tutorial. Is there some call I can make to get these?

1 Like
  1. MLJ.jl is just an interface to various ML packages in the Julia ecosystem.
    Currently, the primary focus of MLJ.jl is to fit and predict not really statistical inference.
    However, after a model is fit you can extract the fitted parameters using fitted_params(mach). See updated code above.
  2. fit, predict, fitted_params works for basically any regression model.
    Only a few models have: t-stats, standard errors, p-values etc
    Currently MLJ.jl doesn’t have a standard method to extract these statistics.
    For this I would use GLM.jl directly.
  3. I’ve seen discussion about MLJ.jl incorporating some of these stats, but I don’t think it’s a priority right now.
    @tlienart @ablaom @dilumaluthge

I think incorporating statistics into MLJ.jl properly would take A LOT of work. For example to correctly estimate standard errors the MLJ.j user must make assumptions about the covariance matrix of the residual.

Suppose there existed MLJ.standard_error(mach)
-this wouldn’t make much sense for many models such as LightGBM (but let’s ignore this…)
Consider: y=X\beta +\varepsilon w/ \varepsilon \sim \Omega
MLJ.fitted_params(mach) would give the usual \hat{\beta}_{ols}
What should: MLJ.standard_error(mach) give?
If you wanna assume the usual \Omega = \sigma^2 I_n, then MLJ.standard_error(mach)=\frac{\hat{\varepsilon}'\hat{\varepsilon}}{(n-k)} I_n
But this is a heroic assumption (rarely used nowadays).
There is a giant literature on robust standard errors: Eicker-Huber-White, Clustered (Liang-Ziegler), spatially correlated, auto-correlated etc.
There are a variety of packages: @gragusa’s CovarianceMatrices.jl, @matthieu’s Vcov.jl, sadly I’m working on one too etc
Hence, this would require fitting 1. the original model & 2. the standard error model (scedastic function)

2 Likes

@ablaom and @tlienart may be able to weigh in further

1 Like

You can use MLJ to fit a GLM.jl model, and then you can get the p-values for that model via MLJ. I don’t remember the exact syntax- @ablaom will know.

1 Like

Here is the GLM interface to MLJ: https://github.com/alan-turing-institute/MLJModels.jl/blob/master/src/GLM.jl
I cannot find p-values anywhere.
However they do have:

glm_report(fitresult) = ( deviance     = GLM.deviance(fitresult),
                          dof_residual = GLM.dof_residual(fitresult),
                          stderror     = GLM.stderror(fitresult),
                          vcov         = GLM.vcov(fitresult) )

Accessed by report(mach)

julia> report(mach)
(deviance = 8502.480903578471,
 dof_residual = 341.0,
 stderror = [0.036681269553499325, 0.01737166925431697, 0.07680917833638921, 4.81712249226012, 0.5341500317712217, 0.016323170996032966, 0.2541450592739458, 0.0841378254119214, 0.004680968995291761, 0.16998443628436907, 0.0033963935217423864, 0.0638108519702615, 6.360054287151586],
 vcov = [0.0013455155360564766 -6.088746040570593e-5 … -0.00039008348465676903 -0.008230320370078825; -6.088746040570593e-5 0.0003017748926813815 … -8.860618288494182e-5 -0.0014983861144144224; … ; -0.00039008348465676903 -8.860618288494182e-5 … 0.004071824829170625 -0.13449472260664377; -0.008230320370078825 -0.0014983861144144224 … -0.13449472260664377 40.450290535515265],)

This can be used to obtain t-stats:t_{\hat{\beta}}=\frac{\hat{\beta}-\beta_{0}}{ \hat{s.e.}\left(\hat{\beta}\right) } and then p-values =Pr\left(>|t|\right) under H_0.

A few observations:

  1. it doesn’t look like the current MLJ-GLM interface allows for p-values directly
  2. glm_report standard errors currently assume the covariance of the residual matrix is spherical \Omega =\sigma^2 I_n.
    If you’re only interested in out of sample prediction, you don’t care about this.
    If you are doing stat inference & wanna submit a paper to a science journal you need more flexibility.
    I’ve never seen a recent finance/econ paper that assumes \Omega =\sigma^2 I_n
  3. the only honest solution is for MLJ.jl to add a fit_skedastic!(mach, type) function, where type is the type of standard error the user wants to compute depending on her assumptions about the unknown \Omega.

It can work as follows:

#step 1 fit a linear model w/ your favorite MLJ linear regression
julia> load("LinearRegressor", pkg="MLJLinearModels");
julia> m= LinearRegressor();
julia> mach = machine(m, X, y);
julia> fit!(mach)

#step 2 use residuals from fit model to estimate the skedastic function
julia> SE1 = fit_skedastic(mach, spherical)  #usual naive standard errors
julia> SE2 = fit_skedastic(mach, EHW)  #EickerHuberWhite standard errors
julia> SE3 = fit_skedastic(mach, cluster(state,year))  #se clustered by state-year

MLJ.jl can actually be a great place bc recently there has been a lot of interest in using ML methods to estimate skedastic functions for better statistical inference.

This could be super-helpful to the research community (myself included) bc currently computing multiple standard errors is a royal pain. Journals often wanna see robust standard errors computed in different ways under various assumptions about the unknown data generating process. In a recent paper I computed standard errors using 11 different methods
image

1 Like

What happens if you fit a GLM.jl model with MLJ, and then you print the fitresult directly?

I.e.

fit!(mach)
@show mach.fitresult

This should show the p-values and standard errors reported by GLM.jl

1 Like

yep, it spits out the same output as GLM.jl, but it assumes spherical errors and doesn’t allow the user to fit a custom skedastic function

julia> using MLJ;X, y =  @load_boston;train, test = partition(eachindex(y), .7, rng=333);

julia> load("LinearRegressor", pkg="GLM");
[ Info: Loading into module "Main":
import MLJModels βœ”
import GLM βœ”
import MLJModels.GLM_ βœ”

julia> m= LinearRegressor();

julia> mach = machine(m, X, y);

julia> fit!(mach, rows=train)
[ Info: Training Machine{LinearRegressor} @164.
Machine{LinearRegressor} @164 trained 1 time.
  args:
    1:  Source @649 ⏎ `Table{AbstractArray{Continuous,1}}`
    2:  Source @820 ⏎ `AbstractArray{Continuous,1}`


julia> @show mach.fitresult
mach.fitresult = GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────────
            Coef.  Std. Error      z  Pr(>|z|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────
x1    -0.125665    0.0366813   -3.43    0.0006   -0.197559    -0.0537714
x2     0.0711207   0.0173717    4.09    <1e-4     0.0370729    0.105169
x3    -0.0094743   0.0768092   -0.12    0.9018   -0.160018     0.141069
x4   -19.0626      4.81712     -3.96    <1e-4   -28.504       -9.62125
x5     3.27676     0.53415      6.13    <1e-9     2.22984      4.32367
x6     0.00663543  0.0163232    0.41    0.6844   -0.0253574    0.0386283
x7    -1.72403     0.254145    -6.78    <1e-10   -2.22215     -1.22592
x8     0.40523     0.0841378    4.82    <1e-5     0.240323     0.570137
x9    -0.0163271   0.00468097  -3.49    0.0005   -0.0255016   -0.00715256
x10   -0.929381    0.169984    -5.47    <1e-7    -1.26254     -0.596217
x11    0.0091109   0.00339639   2.68    0.0073    0.00245409   0.0157677
x12   -0.525458    0.0638109   -8.23    <1e-15   -0.650525    -0.400391
x13   41.6909      6.36005      6.56    <1e-10   29.2254      54.1563
─────────────────────────────────────────────────────────────────────────

GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal{Float64},GLM.IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
─────────────────────────────────────────────────────────────────────────
            Coef.  Std. Error      z  Pr(>|z|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────
x1    -0.125665    0.0366813   -3.43    0.0006   -0.197559    -0.0537714
x2     0.0711207   0.0173717    4.09    <1e-4     0.0370729    0.105169
x3    -0.0094743   0.0768092   -0.12    0.9018   -0.160018     0.141069
x4   -19.0626      4.81712     -3.96    <1e-4   -28.504       -9.62125
x5     3.27676     0.53415      6.13    <1e-9     2.22984      4.32367
x6     0.00663543  0.0163232    0.41    0.6844   -0.0253574    0.0386283
x7    -1.72403     0.254145    -6.78    <1e-10   -2.22215     -1.22592
x8     0.40523     0.0841378    4.82    <1e-5     0.240323     0.570137
x9    -0.0163271   0.00468097  -3.49    0.0005   -0.0255016   -0.00715256
x10   -0.929381    0.169984    -5.47    <1e-7    -1.26254     -0.596217
x11    0.0091109   0.00339639   2.68    0.0073    0.00245409   0.0157677
x12   -0.525458    0.0638109   -8.23    <1e-15   -0.650525    -0.400391
x13   41.6909      6.36005      6.56    <1e-10   29.2254      54.1563
─────────────────────────────────────────────────────────────────────────

1 Like

A lot of great discussion here. Following on from the hints contained in it, I switched to GLM. GLM appears to be easier to use the MLJ (at least for this particular use-case) and has done everything I need.

Many thanks for all the great input.

2 Likes