Custom XGBoost Loss function w/ Zygote. Julia Computing blog post

Albert_Zevelev · April 1, 2020, 10:12pm

Here’s what I had in mind applied to the Iris data:

using AutoMLPipeline, DataFrames
#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;
#
learners = DataFrame()
for m in sk 
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;

Gives the scores (mean, sd) for 49 models. Incompatible models conveniently output NaN.

Following your suggestions here is code to extract regression/classification models.

m_reg= sk[occursin.("Regressor", sk)];
m_reg= m_reg ∪ sk[occursin.("Regression", sk)];
m_reg= m_reg ∪ ["SVR", "ElasticNet", "Ridge", "RidgeCV", "BayesianRidge",
    "KernelRidge", "Lars", "Lasso", "LassoLars"];

m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl ∪ sk[occursin.("NB", sk)];
m_cl= m_cl ∪ sk[occursin.("SVC", sk)];
m_cl= m_cl ∪ ["LDA", "QDA"];
#47 out of 49 models.
#"OrthogonalMatchingPursuit", "NearestCentroid"

I really like your elegant & minimalist use of pipelines.
The Julia community (and the world) would be a better place if there was a way to merge your package w/ MLJ… c’est la vie…

ppalmes · April 1, 2020, 10:39pm

thanks for the sample code!

this is what i hope to achieve with automlpipeline, i.e., coding the feature extraction, feature engineering and modeling in a simple easy to understand manner such that it becomes easy to attack ML problems using elegant syntax borrowed from unix pipes.

please feel free to do a PR if you want to extend automlpipeline for regression. ill add an implementation once i finish the code for openml benchmarks which is in my priority at the moment.

you can try @distributed since each crossvalidation is independent.

ppalmes · April 1, 2020, 10:51pm

using Distributed
nprocs() == 1 && addprocs()
@everywhere using AutoMLPipeline, DataFrames

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));
m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl ∪ sk[occursin.("NB", sk)];
m_cl= m_cl ∪ sk[occursin.("SVC", sk)];
m_cl= m_cl ∪ ["LDA", "QDA"];

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;

learners = @distributed (vcat) for m in m_cl
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,folds,err = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    DataFrame(name=learner.name,mean=mean,sd=sd,folds=folds,errors=err)
end;
@show learners;

Albert_Zevelev · April 1, 2020, 11:10pm

Nice.
Suppose I train/crossvalidate the models on a subset of rows train, is there an easy way to score the best cross-validated model on separate test rows that were not used for training/cross-validation?

ppalmes · April 1, 2020, 11:14pm

crossvalidation purpose is to run the models with same set of parameters several times to get the average performance. each fold starts fresh with a new copy of the model using the same set of parameters. if for example among the models, SVC is the best, then you can use the fit! to train the SVC model for a particular training data, and use transform! to use the said model for prediction feeding it a new testing data. there is no point in crossvalidation to save the model because what it does is to estimate the average performance of a model given different splits of the data. crossvalidation basically use fit! and transform! for each training/testing fold.

ppalmes · April 1, 2020, 11:32pm

you can use BestLearner to automatically pick the best learner using internal cross-validation among a group of learners and uses the best learner output as it’s final output

learners = SKLearner.(["AdaBoostClassifier","BaggingClassifier","SGDClassifier","SVC","LinearSVC"])
blearner = BestLearner(learners)
crossvalidate(blearner,X,Y,"accuracy_score")
fit!(blearner,X,Y)
@show blearner.model[:best_learner];

Albert_Zevelev · April 2, 2020, 4:37am

I probably wasn’t being clear about my objective.

Consider a Lasso model, w/ one hyper-parameter λ.
Suppose I have a grid of possible values for λ: G=[0.0, 0.1, …, 1.0]
(Sometimes sklearn includes a default grid, sometimes I make my own grid, sometimes I use another tuning method…)

My objective: find the model w/ the best out-of-sample (OOS) predictive power (where I define “best” as minimum RMSE),
Q: how do I select the optimal hyper-parameter?

Partition the rows into train & test samples.
For each λ in G, compute the average CV RMSE within X[train,:] using K-fold CV or some other resampling technique.
Note: this means splitting X[train,:] into K-folds.
For each λ I will have a score (CV RMSE).
Select the optimal λ according to some method.
Many select the λ w/ the lowest CV RMSE.
Tibshirani et al recommend the λ corresponding to the most parsimonious model (fewest nonzero predictors) w/ CV RMSE 1 standard deviation above the minimum.
A good ML interface allows the user to specify a technique for selecting the optimal λ.
Using the optimal λ, predict y w/ X[test,:] & compute OOS RMSE(y[test], pred).

It might seem a bit redundant.
CV-RMSE is to avoid overfitting during model training.
The 1sd method (for example) is to avoid overfitting during model selection.

ppalmes · April 2, 2020, 7:07am

Perhaps this is what you want. I’m using SVC with RBF kernel having two parameters (C and gamma) to optimize. Just replace the @distributed code that uses parallelism above.

ftable = @distributed (vcat) for C in 1:5
    gres = @distributed (vcat) for gamma = 1:5
        svcmodel  = SKLearner("SVC",Dict(:impl_args=>Dict(:kernel=>"rbf",:C=>C,:gamma=>gamma) ))
        mn,sd,fld,err = crossvalidate(svcmodel,X,Y)
        DataFrame(name=svcmodel.name,mean=mn,sd=sd,C=C,gamma=gamma,folds=fld,errors=err)
    end
    gres
end
dfsorted=sort(ftable,:mean,rev=true)
@show dfsorted
25×7 DataFrame
│ Row │ name    │ mean     │ sd        │ C     │ gamma │ folds │ errors │
│     │ String  │ Float64  │ Float64   │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼─────────┼──────────┼───────────┼───────┼───────┼───────┼────────┤
│ 1   │ SVC_G24 │ 0.966667 │ 0.0471405 │ 1     │ 1     │ 10    │ 0      │
│ 2   │ SVC_E9A │ 0.966667 │ 0.0471405 │ 3     │ 1     │ 10    │ 0      │
│ 3   │ SVC_3xr │ 0.96     │ 0.0466137 │ 1     │ 2     │ 10    │ 0      │
│ 4   │ SVC_wcn │ 0.96     │ 0.0466137 │ 2     │ 1     │ 10    │ 0      │
│ 5   │ SVC_RxN │ 0.96     │ 0.0466137 │ 4     │ 2     │ 10    │ 0      │
│ 6   │ SVC_Oor │ 0.96     │ 0.0466137 │ 2     │ 3     │ 10    │ 0      │
│ 7   │ SVC_ATA │ 0.96     │ 0.0344265 │ 3     │ 4     │ 10    │ 0      │
│ 8   │ SVC_NRs │ 0.96     │ 0.0466137 │ 3     │ 5     │ 10    │ 0      │
│ 9   │ SVC_CoV │ 0.953333 │ 0.0449966 │ 1     │ 4     │ 10    │ 0      │
⋮
│ 16  │ SVC_n99 │ 0.953333 │ 0.0706233 │ 1     │ 3     │ 10    │ 0      │
│ 17  │ SVC_Ux0 │ 0.953333 │ 0.0449966 │ 3     │ 2     │ 10    │ 0      │
│ 18  │ SVC_O6M │ 0.953333 │ 0.0706233 │ 4     │ 1     │ 10    │ 0      │
│ 19  │ SVC_otD │ 0.953333 │ 0.0322031 │ 4     │ 4     │ 10    │ 0      │
│ 20  │ SVC_iIi │ 0.953333 │ 0.0322031 │ 5     │ 1     │ 10    │ 0      │
│ 21  │ SVC_XjT │ 0.953333 │ 0.0632456 │ 5     │ 2     │ 10    │ 0      │
│ 22  │ SVC_zEv │ 0.946667 │ 0.068853  │ 5     │ 5     │ 10    │ 0      │
│ 23  │ SVC_ldK │ 0.946667 │ 0.0421637 │ 3     │ 3     │ 10    │ 0      │
│ 24  │ SVC_n0e │ 0.946667 │ 0.0525874 │ 4     │ 3     │ 10    │ 0      │
│ 25  │ SVC_diL │ 0.946667 │ 0.0525874 │ 5     │ 3     │ 10    │ 0      │

ppalmes · April 2, 2020, 7:13am

the nice thing about this is that if you have a cluster of machines with ssh, you can basically run this code in parallel among the clusters and each crossvalidation will be run in separate machines which is ideal if your parameters’ search spaces are huge. you can also create a wrapper to use the GridSearchCV in sklearn and incorporate it in the crossvalidation. the code of crossvalidation: https://github.com/IBM/AutoMLPipeline.jl/blob/master/src/crossvalidator.jl

Marc.Cox · April 8, 2020, 5:39pm

Thank You very much for the Linux/Unix Pipe syntax which is a very helpful foundation and a natural fit for distributed processing, but hope we can elevate the language syntax to a slightly higher abstraction layer using mathematical notation, while simultaneously keeping the Automatic vectorization gears hidden/encapsulated to achieve this >> Automatic vectorization - Wikipedia

So also request support syntax for Function Composition (computer science) described here >>

" The ability to easily compose functions encourages factoring (breaking apart) functions for maintainability and code reuse. More generally, big systems might be built by composing whole programs ."

Generally allowing recomposing / resequencing function calls via REComposible Functional Programming is desirable. For example I am presently rewriting alot of my basically Julia procedural / .jl script code into functions and function calls so that I can easily resequence the order of operations so that when f(g(h(x) ) is less effective for machine learning cross validation etc. than h(g(f(x))) reordering the operations is very easy.

Generally I see computer language syntax/mathematics notation as critical to facilitate expressivity – a major benefit of writing Julia code , and actually at the heart of this thread discussion subject , namely >> “Custom XGBoost Loss function w/ Zygote. Julia Computing blog post” << because Cross Validation of Zygote Loss functions will involve mathematics notation such as in terms of the derivatives of f’ (x) and g’ (x) and the product of functions as follows: f(g(x))' = f'(g(x)) * g'(x)

per Chain rule - Wikipedia

A quick example:

Present Pipe syntax:

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;

Proposed Function composition / Functional Programming pseudo-code syntax:

#Get models.
sk= collect(keys(AutoMLPipeline.SKLearners.learner_dict))

BTW it may be that the Functional Programming pseudo-code syntax above already (almost?) works but even so I believe its still important to get in the habit of writing Functional Composition example code to support Calculus multiple derivative notations e.g. f ( g ( x ))' = f' ( g ( x )) * g' ( x ) so we can most easily write the XGBoost custom loss functions here like other boosting methods do, where they generalize them by allowing optimization of an arbitrary differentiable loss function as Per

Gradient boosting - Wikipedia

`HTH

Ps> I believe we'll have to get the **math notation compact and expressive as** **possible** to hide/encapsulate complexity, because Next Up is `` "How to define a Fitness Function in a Genetic Algorithm ? " `

One mathematical description here >>

and here >>
How to define a Fitness Function in a Genetic Algorithm? | by Vijini Mallawaarachchi | Towards Data Science

ppalmes · April 8, 2020, 5:49pm

thanks for the feedback. as i’m still recovering from covid, i’ll read them when i’m well and find ways to improve, add, extend the pipeline expressions. so far, i model this with unix pipes because it is so easy to follow the logic of incrementally adding processing elements to the right and easy to debug by taking things one at a time from left to right. Unix has been a powerful OS due partly to this idea of pipes. build each tool doing one function and doing it well with well-defined input/output interface. KISS philosophy. you can then combine those simple functions by placing them in a pipe.

Marc.Cox · April 8, 2020, 5:52pm

Thank You, Take care, Hope you are better soon

ppalmes · April 8, 2020, 5:54pm

please raise issue or PR in https://github.com/IBM/AutoMLPipeline.jl so that I can track and update the development there.

Albert_Zevelev · April 25, 2020, 12:43am

After I train all MLJ models I create a Table w/ TableView.jl show the score (RMSE) & run time for each model:

Currently MLJ interfaces to 50 regression models (continuous Y).
The usual suspects (gradient boosting: XGBoost & LightGBM) perform well.
Note how well the pure-julia EvoTrees.jl performs despite being so new!

sc = [train_m(m, X, y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )

Question: how can I save the table/plot outputted by TableView.showtable()?
Is there a better way to create a table showing scores (that is storable) than TableView.showtable()?

nilshg · April 27, 2020, 7:51am

Just construct a DataFrame from the names and scores?

julia> using  DataFrames

julia> m_names = ["EvoTreeRegressor", "BaggingRegressor"];

julia> sc = [2.5212, 2.5358];

julia> DataFrame(Model = m_names, RMSE = sc)
2×2 DataFrame
│ Row │ Model            │ RMSE    │
│     │ String           │ Float64 │
├─────┼──────────────────┼─────────┤
│ 1   │ EvoTreeRegressor │ 2.5212  │
│ 2   │ BaggingRegressor │ 2.5358  │

You can still display that in Juno with showtable but also write it to csv with CSV.write("results.csv", df)

Albert_Zevelev · April 29, 2020, 12:06am

Thanks @nilshg.
I want to create a figure (jpg, pdf …) w/ the showtable(df) results so I can post the full table here, (or in a paper).

nilshg · April 29, 2020, 8:07am

Ah right I misunderstood sorry. I’m not aware of anything that produces images from tables directly, although you might get somewhere with the PrettyTables LaTeX backend?

Topic		Replies	Views
Flux - Has anyone ever gotten Custom Loss Functions to work with Zygote? General Usage question	7	856	October 17, 2020
Problem with Zygote gradient Numerics zygote , differentialequation	10	848	December 16, 2021
Zygote has a problem of derivation in the loss function New to Julia question , flux	6	484	September 15, 2020
New machine learning package, Julia implementation of XGBoost Machine Learning	32	9619	March 10, 2020
Debugging Zygote: ERROR: Mutating arrays is not supported New to Julia flux , zygote	2	425	July 31, 2022

Custom XGBoost Loss function w/ Zygote. Julia Computing blog post

Related topics