Custom XGBoost Loss function w/ Zygote. Julia Computing blog post

i don’t store prediction type info in my model structure. the behavior of prediction type is implicit to the behavior of fit and transform functions.

Here is how I automatically store all (49) learners from sk:

using AutoMLPipeline
sk= AutoMLPipeline.SKLearners.learner_dict;
sk= keys(sk) |> collect |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y))

I realize some of these learners apply to regression others classification, but let’s put that aside for a moment.
Suppose I want to train all models in “sk”

learners = DataFrame()
for m in skv 
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,y,"accuracy_score",10)
    global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;

It appears that the performance metrics in your package currently work for classification and not regression (such as RMSE) unless I’m missing something?

2 Likes

this is awesome, I’ll try to turn this into a tutorial when I have some time :slight_smile:

2 Likes

yeah. all the problems i’m working right now is for classification tasks. but it’s easy to add the regression metric for crossvalidation. you can look at the skcrossvalidator.jl source (https://github.com/IBM/AutoMLPipeline.jl/blob/master/src/skcrossvalidator.jl). I’m using these metrics: https://scikit-learn.org/stable/modules/model_evaluation.html.

i’ll create an issue to add regression support. I focus in classification as regression is more trivial to implement once classification workflow works. also, feel free to make a PR ;).

you can differentiate classifiers from regressors because their names contain a substring of either Classifier or Regressor .

Here’s what I had in mind applied to the Iris data:

using AutoMLPipeline, DataFrames
#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;
#
learners = DataFrame()
for m in sk 
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;

Gives the scores (mean, sd) for 49 models. Incompatible models conveniently output NaN.

Following your suggestions here is code to extract regression/classification models.

m_reg= sk[occursin.("Regressor", sk)];
m_reg= m_reg ∪ sk[occursin.("Regression", sk)];
m_reg= m_reg ∪ ["SVR", "ElasticNet", "Ridge", "RidgeCV", "BayesianRidge",
    "KernelRidge", "Lars", "Lasso", "LassoLars"];

m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl ∪ sk[occursin.("NB", sk)];
m_cl= m_cl ∪ sk[occursin.("SVC", sk)];
m_cl= m_cl ∪ ["LDA", "QDA"];
#47 out of 49 models.
#"OrthogonalMatchingPursuit", "NearestCentroid"

I really like your elegant & minimalist use of pipelines.
The Julia community (and the world) would be a better place if there was a way to merge your package w/ MLJ… c’est la vie…

1 Like

thanks for the sample code!

this is what i hope to achieve with automlpipeline, i.e., coding the feature extraction, feature engineering and modeling in a simple easy to understand manner such that it becomes easy to attack ML problems using elegant syntax borrowed from unix pipes.

please feel free to do a PR if you want to extend automlpipeline for regression. ill add an implementation once i finish the code for openml benchmarks which is in my priority at the moment.

you can try @distributed since each crossvalidation is independent.

1 Like
using Distributed
nprocs() == 1 && addprocs()
@everywhere using AutoMLPipeline, DataFrames

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;
sk= sk |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y));
m_cl= sk[occursin.("Classifier", sk)];
m_cl= m_cl ∪ sk[occursin.("NB", sk)];
m_cl= m_cl ∪ sk[occursin.("SVC", sk)];
m_cl= m_cl ∪ ["LDA", "QDA"];

iris = AutoMLPipeline.Utils.getiris();
X = iris[:,1:4];
Y = iris[:,end] |> Vector;

learners = @distributed (vcat) for m in m_cl
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,folds,err = crossvalidate(pcmc,X,Y,"accuracy_score",10)
    DataFrame(name=learner.name,mean=mean,sd=sd,folds=folds,errors=err)
end;
@show learners;
2 Likes

Nice.
Suppose I train/crossvalidate the models on a subset of rows train, is there an easy way to score the best cross-validated model on separate test rows that were not used for training/cross-validation?

1 Like

crossvalidation purpose is to run the models with same set of parameters several times to get the average performance. each fold starts fresh with a new copy of the model using the same set of parameters. if for example among the models, SVC is the best, then you can use the fit! to train the SVC model for a particular training data, and use transform! to use the said model for prediction feeding it a new testing data. there is no point in crossvalidation to save the model because what it does is to estimate the average performance of a model given different splits of the data. crossvalidation basically use fit! and transform! for each training/testing fold.

you can use BestLearner to automatically pick the best learner using internal cross-validation among a group of learners and uses the best learner output as it’s final output

learners = SKLearner.(["AdaBoostClassifier","BaggingClassifier","SGDClassifier","SVC","LinearSVC"])
blearner = BestLearner(learners)
crossvalidate(blearner,X,Y,"accuracy_score")
fit!(blearner,X,Y)
@show blearner.model[:best_learner];

I probably wasn’t being clear about my objective.

Consider a Lasso model, w/ one hyper-parameter λ.
Suppose I have a grid of possible values for λ: G=[0.0, 0.1, …, 1.0]
(Sometimes sklearn includes a default grid, sometimes I make my own grid, sometimes I use another tuning method…)

My objective: find the model w/ the best out-of-sample (OOS) predictive power (where I define “best” as minimum RMSE),
Q: how do I select the optimal hyper-parameter?

  1. Partition the rows into train & test samples.
  2. For each λ in G, compute the average CV RMSE within X[train,:] using K-fold CV or some other resampling technique.
    Note: this means splitting X[train,:] into K-folds.
    For each λ I will have a score (CV RMSE).
  3. Select the optimal λ according to some method.
    Many select the λ w/ the lowest CV RMSE.
    Tibshirani et al recommend the λ corresponding to the most parsimonious model (fewest nonzero predictors) w/ CV RMSE 1 standard deviation above the minimum.
    A good ML interface allows the user to specify a technique for selecting the optimal λ.
  4. Using the optimal λ, predict y w/ X[test,:] & compute OOS RMSE(y[test], pred).

It might seem a bit redundant.
CV-RMSE is to avoid overfitting during model training.
The 1sd method (for example) is to avoid overfitting during model selection.

Perhaps this is what you want. I’m using SVC with RBF kernel having two parameters (C and gamma) to optimize. Just replace the @distributed code that uses parallelism above.

ftable = @distributed (vcat) for C in 1:5
    gres = @distributed (vcat) for gamma = 1:5
        svcmodel  = SKLearner("SVC",Dict(:impl_args=>Dict(:kernel=>"rbf",:C=>C,:gamma=>gamma) ))
        mn,sd,fld,err = crossvalidate(svcmodel,X,Y)
        DataFrame(name=svcmodel.name,mean=mn,sd=sd,C=C,gamma=gamma,folds=fld,errors=err)
    end
    gres
end
dfsorted=sort(ftable,:mean,rev=true)
@show dfsorted
25×7 DataFrame
│ Row │ name    │ mean     │ sd        │ C     │ gamma │ folds │ errors │
│     │ String  │ Float64  │ Float64   │ Int64 │ Int64 │ Int64 │ Int64  │
├─────┼─────────┼──────────┼───────────┼───────┼───────┼───────┼────────┤
│ 1   │ SVC_G24 │ 0.966667 │ 0.0471405 │ 1     │ 1     │ 10    │ 0      │
│ 2   │ SVC_E9A │ 0.966667 │ 0.0471405 │ 3     │ 1     │ 10    │ 0      │
│ 3   │ SVC_3xr │ 0.96     │ 0.0466137 │ 1     │ 2     │ 10    │ 0      │
│ 4   │ SVC_wcn │ 0.96     │ 0.0466137 │ 2     │ 1     │ 10    │ 0      │
│ 5   │ SVC_RxN │ 0.96     │ 0.0466137 │ 4     │ 2     │ 10    │ 0      │
│ 6   │ SVC_Oor │ 0.96     │ 0.0466137 │ 2     │ 3     │ 10    │ 0      │
│ 7   │ SVC_ATA │ 0.96     │ 0.0344265 │ 3     │ 4     │ 10    │ 0      │
│ 8   │ SVC_NRs │ 0.96     │ 0.0466137 │ 3     │ 5     │ 10    │ 0      │
│ 9   │ SVC_CoV │ 0.953333 │ 0.0449966 │ 1     │ 4     │ 10    │ 0      │
⋮
│ 16  │ SVC_n99 │ 0.953333 │ 0.0706233 │ 1     │ 3     │ 10    │ 0      │
│ 17  │ SVC_Ux0 │ 0.953333 │ 0.0449966 │ 3     │ 2     │ 10    │ 0      │
│ 18  │ SVC_O6M │ 0.953333 │ 0.0706233 │ 4     │ 1     │ 10    │ 0      │
│ 19  │ SVC_otD │ 0.953333 │ 0.0322031 │ 4     │ 4     │ 10    │ 0      │
│ 20  │ SVC_iIi │ 0.953333 │ 0.0322031 │ 5     │ 1     │ 10    │ 0      │
│ 21  │ SVC_XjT │ 0.953333 │ 0.0632456 │ 5     │ 2     │ 10    │ 0      │
│ 22  │ SVC_zEv │ 0.946667 │ 0.068853  │ 5     │ 5     │ 10    │ 0      │
│ 23  │ SVC_ldK │ 0.946667 │ 0.0421637 │ 3     │ 3     │ 10    │ 0      │
│ 24  │ SVC_n0e │ 0.946667 │ 0.0525874 │ 4     │ 3     │ 10    │ 0      │
│ 25  │ SVC_diL │ 0.946667 │ 0.0525874 │ 5     │ 3     │ 10    │ 0      │
2 Likes

the nice thing about this is that if you have a cluster of machines with ssh, you can basically run this code in parallel among the clusters and each crossvalidation will be run in separate machines which is ideal if your parameters’ search spaces are huge. you can also create a wrapper to use the GridSearchCV in sklearn and incorporate it in the crossvalidation. the code of crossvalidation: https://github.com/IBM/AutoMLPipeline.jl/blob/master/src/crossvalidator.jl

2 Likes

Thank You very much for the Linux/Unix Pipe syntax which is a very helpful foundation and a natural fit for distributed processing, but hope we can elevate the language syntax to a slightly higher abstraction layer using mathematical notation, while simultaneously keeping the Automatic vectorization gears hidden/encapsulated to achieve this >> https://en.wikipedia.org/wiki/Automatic_vectorization

So also request support syntax for Function Composition (computer science) described here >>

" The ability to easily compose functions encourages factoring (breaking apart) functions for maintainability and code reuse. More generally, big systems might be built by composing whole programs ."

Generally allowing recomposing / resequencing function calls via REComposible Functional Programming is desirable. For example I am presently rewriting alot of my basically Julia procedural / .jl script code into functions and function calls so that I can easily resequence the order of operations so that when f(g(h(x) ) is less effective for machine learning cross validation etc. than h(g(f(x))) reordering the operations is very easy.

Generally I see computer language syntax/mathematics notation as critical to facilitate expressivity – a major benefit of writing Julia code , and actually at the heart of this thread discussion subject , namely >> “Custom XGBoost Loss function w/ Zygote. Julia Computing blog post” << because Cross Validation of Zygote Loss functions will involve mathematics notation such as in terms of the derivatives of f’ (x) and g’ (x) and the product of functions as follows: f(g(x))' = f'(g(x)) * g'(x)

per https://en.wikipedia.org/wiki/Chain_rule

A quick example:

Present Pipe syntax:

#Get models.
sk= AutoMLPipeline.SKLearners.learner_dict |> keys |> collect;


Proposed Function composition / Functional Programming pseudo-code syntax:

#Get models.
sk= collect(keys(AutoMLPipeline.SKLearners.learner_dict))

BTW it may be that the Functional Programming pseudo-code syntax above already (almost?) works but even so I believe its still important to get in the habit of writing Functional Composition example code to support Calculus multiple derivative notations e.g. f ( g ( x ))' = f' ( g ( x )) * g' ( x ) so we can most easily write the XGBoost custom loss functions here like other boosting methods do, where they generalize them by allowing optimization of an arbitrary differentiable loss function as Per

https://en.wikipedia.org/wiki/Gradient_boosting

`HTH
:slightly_smiling_face:

Ps> I believe we'll have to get the **math notation compact and expressive as** **possible** to hide/encapsulate complexity, because Next Up is `` "How to define a Fitness Function in a Genetic Algorithm ? " ` :sunglasses: :+1:

One mathematical description here >>

https://www.mathworks.com/help/gads/examples/coding-and-minimizing-a-fitness-function-using-the-genetic-algorithm.html

and here >>
https://towardsdatascience.com/how-to-define-a-fitness-function-in-a-genetic-algorithm-be572b9ea3b4

1 Like

thanks for the feedback. as i’m still recovering from covid, i’ll read them when i’m well and find ways to improve, add, extend the pipeline expressions. so far, i model this with unix pipes because it is so easy to follow the logic of incrementally adding processing elements to the right and easy to debug by taking things one at a time from left to right. Unix has been a powerful OS due partly to this idea of pipes. build each tool doing one function and doing it well with well-defined input/output interface. KISS philosophy. you can then combine those simple functions by placing them in a pipe.

3 Likes

Thank You, Take care, Hope you are better soon

2 Likes

please raise issue or PR in https://github.com/IBM/AutoMLPipeline.jl so that I can track and update the development there.

1 Like

After I train all MLJ models I create a Table w/ TableView.jl show the score (RMSE) & run time for each model:
image
Currently MLJ interfaces to 50 regression models (continuous Y).
The usual suspects (gradient boosting: XGBoost & LightGBM) perform well.
Note how well the pure-julia EvoTrees.jl performs despite being so new!

sc = [train_m(m, X, y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )

Question: how can I save the table/plot outputted by TableView.showtable()?
Is there a better way to create a table showing scores (that is storable) than TableView.showtable()?

2 Likes

Just construct a DataFrame from the names and scores?

julia> using  DataFrames

julia> m_names = ["EvoTreeRegressor", "BaggingRegressor"];

julia> sc = [2.5212, 2.5358];

julia> DataFrame(Model = m_names, RMSE = sc)
2×2 DataFrame
│ Row │ Model            │ RMSE    │
│     │ String           │ Float64 │
├─────┼──────────────────┼─────────┤
│ 1   │ EvoTreeRegressor │ 2.5212  │
│ 2   │ BaggingRegressor │ 2.5358  │

You can still display that in Juno with showtable but also write it to csv with CSV.write("results.csv", df)

Thanks @nilshg.
I want to create a figure (jpg, pdf …) w/ the showtable(df) results so I can post the full table here, (or in a paper).