Custom XGBoost Loss function w/ Zygote. Julia Computing blog post

Hereโ€™s what I had in mind applied to the Iris data:

using AutoMLPipeline, DataFrames
#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;
#
learners = DataFrame()
for m in sk 
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;

Gives the scores (mean, sd) for 49 models. Incompatible models conveniently output NaN.

Following your suggestions here is code to extract regression/classification models.

m_reg= sk[occursin.("Regressor", sk)];
m_reg= m_reg โˆช sk[occursin.("Regression", sk)];
m_reg= m_reg โˆช ["SVR", "ElasticNet", "Ridge", "RidgeCV", "BayesianRidge",
    "KernelRidge", "Lars", "Lasso", "LassoLars"];

m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl โˆช sk[occursin.("NB", sk)];
m_cl= m_cl โˆช sk[occursin.("SVC", sk)];
m_cl= m_cl โˆช ["LDA", "QDA"];
#47 out of 49 models.
#"OrthogonalMatchingPursuit", "NearestCentroid"

I really like your elegant & minimalist use of pipelines.
The Julia community (and the world) would be a better place if there was a way to merge your package w/ MLJโ€ฆ cโ€™est la vieโ€ฆ

1 Like

thanks for the sample code!

this is what i hope to achieve with automlpipeline, i.e., coding the feature extraction, feature engineering and modeling in a simple easy to understand manner such that it becomes easy to attack ML problems using elegant syntax borrowed from unix pipes.

please feel free to do a PR if you want to extend automlpipeline for regression. ill add an implementation once i finish the code for openml benchmarks which is in my priority at the moment.

you can try @distributed since each crossvalidation is independent.

1 Like
using Distributed
nprocs() == 1 && addprocs()
@everywhere using AutoMLPipeline, DataFrames

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));
m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl โˆช sk[occursin.("NB", sk)];
m_cl= m_cl โˆช sk[occursin.("SVC", sk)];
m_cl= m_cl โˆช ["LDA", "QDA"];

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;

learners = @distributed (vcat) for m in m_cl
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,folds,err = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    DataFrame(name=learner.name,mean=mean,sd=sd,folds=folds,errors=err)
end;
@show learners;
2 Likes

Nice.
Suppose I train/crossvalidate the models on a subset of rows train, is there an easy way to score the best cross-validated model on separate test rows that were not used for training/cross-validation?

1 Like

crossvalidation purpose is to run the models with same set of parameters several times to get the average performance. each fold starts fresh with a new copy of the model using the same set of parameters. if for example among the models, SVC is the best, then you can use the fit! to train the SVC model for a particular training data, and use transform! to use the said model for prediction feeding it a new testing data. there is no point in crossvalidation to save the model because what it does is to estimate the average performance of a model given different splits of the data. crossvalidation basically use fit! and transform! for each training/testing fold.

you can use BestLearner to automatically pick the best learner using internal cross-validation among a group of learners and uses the best learner output as itโ€™s final output

learners = SKLearner.(["AdaBoostClassifier","BaggingClassifier","SGDClassifier","SVC","LinearSVC"])
blearner = BestLearner(learners)
crossvalidate(blearner,X,Y,"accuracy_score")
fit!(blearner,X,Y)
@show blearner.model[:best_learner];

I probably wasnโ€™t being clear about my objective.

Consider a Lasso model, w/ one hyper-parameter ฮป.
Suppose I have a grid of possible values for ฮป: G=[0.0, 0.1, โ€ฆ, 1.0]
(Sometimes sklearn includes a default grid, sometimes I make my own grid, sometimes I use another tuning methodโ€ฆ)

My objective: find the model w/ the best out-of-sample (OOS) predictive power (where I define โ€œbestโ€ as minimum RMSE),
Q: how do I select the optimal hyper-parameter?

  1. Partition the rows into train & test samples.
  2. For each ฮป in G, compute the average CV RMSE within X[train,:] using K-fold CV or some other resampling technique.
    Note: this means splitting X[train,:] into K-folds.
    For each ฮป I will have a score (CV RMSE).
  3. Select the optimal ฮป according to some method.
    Many select the ฮป w/ the lowest CV RMSE.
    Tibshirani et al recommend the ฮป corresponding to the most parsimonious model (fewest nonzero predictors) w/ CV RMSE 1 standard deviation above the minimum.
    A good ML interface allows the user to specify a technique for selecting the optimal ฮป.
  4. Using the optimal ฮป, predict y w/ X[test,:] & compute OOS RMSE(y[test], pred).

It might seem a bit redundant.
CV-RMSE is to avoid overfitting during model training.
The 1sd method (for example) is to avoid overfitting during model selection.

Perhaps this is what you want. Iโ€™m using SVC with RBF kernel having two parameters (C and gamma) to optimize. Just replace the @distributed code that uses parallelism above.

ftable = @distributed (vcat) for C in 1:5
    gres = @distributed (vcat) for gamma = 1:5
        svcmodel  = SKLearner("SVC",Dict(:impl_args=>Dict(:kernel=>"rbf",:C=>C,:gamma=>gamma) ))
        mn,sd,fld,err = crossvalidate(svcmodel,X,Y)
        DataFrame(name=svcmodel.name,mean=mn,sd=sd,C=C,gamma=gamma,folds=fld,errors=err)
    end
    gres
end
dfsorted=sort(ftable,:mean,rev=true)
@show dfsorted
25ร—7 DataFrame
โ”‚ Row โ”‚ name    โ”‚ mean     โ”‚ sd        โ”‚ C     โ”‚ gamma โ”‚ folds โ”‚ errors โ”‚
โ”‚     โ”‚ String  โ”‚ Float64  โ”‚ Float64   โ”‚ Int64 โ”‚ Int64 โ”‚ Int64 โ”‚ Int64  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ SVC_G24 โ”‚ 0.966667 โ”‚ 0.0471405 โ”‚ 1     โ”‚ 1     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 2   โ”‚ SVC_E9A โ”‚ 0.966667 โ”‚ 0.0471405 โ”‚ 3     โ”‚ 1     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 3   โ”‚ SVC_3xr โ”‚ 0.96     โ”‚ 0.0466137 โ”‚ 1     โ”‚ 2     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 4   โ”‚ SVC_wcn โ”‚ 0.96     โ”‚ 0.0466137 โ”‚ 2     โ”‚ 1     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 5   โ”‚ SVC_RxN โ”‚ 0.96     โ”‚ 0.0466137 โ”‚ 4     โ”‚ 2     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 6   โ”‚ SVC_Oor โ”‚ 0.96     โ”‚ 0.0466137 โ”‚ 2     โ”‚ 3     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 7   โ”‚ SVC_ATA โ”‚ 0.96     โ”‚ 0.0344265 โ”‚ 3     โ”‚ 4     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 8   โ”‚ SVC_NRs โ”‚ 0.96     โ”‚ 0.0466137 โ”‚ 3     โ”‚ 5     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 9   โ”‚ SVC_CoV โ”‚ 0.953333 โ”‚ 0.0449966 โ”‚ 1     โ”‚ 4     โ”‚ 10    โ”‚ 0      โ”‚
โ‹ฎ
โ”‚ 16  โ”‚ SVC_n99 โ”‚ 0.953333 โ”‚ 0.0706233 โ”‚ 1     โ”‚ 3     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 17  โ”‚ SVC_Ux0 โ”‚ 0.953333 โ”‚ 0.0449966 โ”‚ 3     โ”‚ 2     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 18  โ”‚ SVC_O6M โ”‚ 0.953333 โ”‚ 0.0706233 โ”‚ 4     โ”‚ 1     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 19  โ”‚ SVC_otD โ”‚ 0.953333 โ”‚ 0.0322031 โ”‚ 4     โ”‚ 4     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 20  โ”‚ SVC_iIi โ”‚ 0.953333 โ”‚ 0.0322031 โ”‚ 5     โ”‚ 1     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 21  โ”‚ SVC_XjT โ”‚ 0.953333 โ”‚ 0.0632456 โ”‚ 5     โ”‚ 2     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 22  โ”‚ SVC_zEv โ”‚ 0.946667 โ”‚ 0.068853  โ”‚ 5     โ”‚ 5     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 23  โ”‚ SVC_ldK โ”‚ 0.946667 โ”‚ 0.0421637 โ”‚ 3     โ”‚ 3     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 24  โ”‚ SVC_n0e โ”‚ 0.946667 โ”‚ 0.0525874 โ”‚ 4     โ”‚ 3     โ”‚ 10    โ”‚ 0      โ”‚
โ”‚ 25  โ”‚ SVC_diL โ”‚ 0.946667 โ”‚ 0.0525874 โ”‚ 5     โ”‚ 3     โ”‚ 10    โ”‚ 0      โ”‚
2 Likes

the nice thing about this is that if you have a cluster of machines with ssh, you can basically run this code in parallel among the clusters and each crossvalidation will be run in separate machines which is ideal if your parametersโ€™ search spaces are huge. you can also create a wrapper to use the GridSearchCV in sklearn and incorporate it in the crossvalidation. the code of crossvalidation: https://github.com/IBM/AutoMLPipeline.jl/blob/master/src/crossvalidator.jl

2 Likes

Thank You very much for the Linux/Unix Pipe syntax which is a very helpful foundation and a natural fit for distributed processing, but hope we can elevate the language syntax to a slightly higher abstraction layer using mathematical notation, while simultaneously keeping the Automatic vectorization gears hidden/encapsulated to achieve this >> Automatic vectorization - Wikipedia

So also request support syntax for Function Composition (computer science) described here >>

" The ability to easily compose functions encourages factoring (breaking apart) functions for maintainability and code reuse. More generally, big systems might be built by composing whole programs ."

Generally allowing recomposing / resequencing function calls via REComposible Functional Programming is desirable. For example I am presently rewriting alot of my basically Julia procedural / .jl script code into functions and function calls so that I can easily resequence the order of operations so that when f(g(h(x) ) is less effective for machine learning cross validation etc. than h(g(f(x))) reordering the operations is very easy.

Generally I see computer language syntax/mathematics notation as critical to facilitate expressivity โ€“ a major benefit of writing Julia code , and actually at the heart of this thread discussion subject , namely >> โ€œCustom XGBoost Loss function w/ Zygote. Julia Computing blog postโ€ << because Cross Validation of Zygote Loss functions will involve mathematics notation such as in terms of the derivatives of fโ€™ (x) and gโ€™ (x) and the product of functions as follows: f(g(x))' = f'(g(x)) * g'(x)

per Chain rule - Wikipedia

A quick example:

Present Pipe syntax:

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;


Proposed Function composition / Functional Programming pseudo-code syntax:

#Get models.
sk= collect(keys(AutoMLPipeline.SKLearners.learner_dict))

BTW it may be that the Functional Programming pseudo-code syntax above already (almost?) works but even so I believe its still important to get in the habit of writing Functional Composition example code to support Calculus multiple derivative notations e.g. f ( g ( x ))' = f' ( g ( x )) * g' ( x ) so we can most easily write the XGBoost custom loss functions here like other boosting methods do, where they generalize them by allowing optimization of an arbitrary differentiable loss function as Per

Gradient boosting - Wikipedia

`HTH
:slightly_smiling_face:

Ps> I believe we'll have to get the **math notation compact and expressive as** **possible** to hide/encapsulate complexity, because Next Up is `` "How to define a Fitness Function in a Genetic Algorithm ? " ` :sunglasses: :+1:

One mathematical description here >>

and here >>
How to define a Fitness Function in a Genetic Algorithm? | by Vijini Mallawaarachchi | Towards Data Science

1 Like

thanks for the feedback. as iโ€™m still recovering from covid, iโ€™ll read them when iโ€™m well and find ways to improve, add, extend the pipeline expressions. so far, i model this with unix pipes because it is so easy to follow the logic of incrementally adding processing elements to the right and easy to debug by taking things one at a time from left to right. Unix has been a powerful OS due partly to this idea of pipes. build each tool doing one function and doing it well with well-defined input/output interface. KISS philosophy. you can then combine those simple functions by placing them in a pipe.

3 Likes

Thank You, Take care, Hope you are better soon

2 Likes

please raise issue or PR in https://github.com/IBM/AutoMLPipeline.jl so that I can track and update the development there.

1 Like

After I train all MLJ models I create a Table w/ TableView.jl show the score (RMSE) & run time for each model:
image
Currently MLJ interfaces to 50 regression models (continuous Y).
The usual suspects (gradient boosting: XGBoost & LightGBM) perform well.
Note how well the pure-julia EvoTrees.jl performs despite being so new!

sc = [train_m(m, X, y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )

Question: how can I save the table/plot outputted by TableView.showtable()?
Is there a better way to create a table showing scores (that is storable) than TableView.showtable()?

2 Likes

Just construct a DataFrame from the names and scores?

julia> using  DataFrames

julia> m_names = ["EvoTreeRegressor", "BaggingRegressor"];

julia> sc = [2.5212, 2.5358];

julia> DataFrame(Model = m_names, RMSE = sc)
2ร—2 DataFrame
โ”‚ Row โ”‚ Model            โ”‚ RMSE    โ”‚
โ”‚     โ”‚ String           โ”‚ Float64 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1   โ”‚ EvoTreeRegressor โ”‚ 2.5212  โ”‚
โ”‚ 2   โ”‚ BaggingRegressor โ”‚ 2.5358  โ”‚

You can still display that in Juno with showtable but also write it to csv with CSV.write("results.csv", df)

Thanks @nilshg.
I want to create a figure (jpg, pdf โ€ฆ) w/ the showtable(df) results so I can post the full table here, (or in a paper).

Ah right I misunderstood sorry. Iโ€™m not aware of anything that produces images from tables directly, although you might get somewhere with the PrettyTables LaTeX backend?

1 Like