Custom XGBoost Loss function w/ Zygote. Julia Computing blog post

There is a really cool recent blog post @ Julia Computing about how easily custom loss functions can be implemented in XGBoost w/ Zygote. They use the Titanic survival data from Kaggle & compare scores on the Leaderboard.

I wasn’t able to replicate their results & there was no place to post a comment so I’ll post my code here.

Code
#Download Kaggle Titanic data & go to that directory.
using CSVFiles, DataFrames
df = DataFrame(CSVFiles.load("train.csv"));
names(df)
#only keep: Age, Embarked, Sex, Pclass, SibSp, Parch, Fare
df= select(df, [:Age, :Embarked, :Sex, :Pclass, :SibSp, :Parch, :Fare, :Survived] )
#Num missing val for: Embarked
sum(df[:,:Embarked] .== "")
df[df[:,:Embarked] .== "", :Embarked] .= "S" #Impute w/ most freq val in Column.
#Num obs missing AGE #replace missing w/ average age.
sum(ismissing.(df[!,:Age]))
using Statistics
average_age = mean(df[.!ismissing.(df[!,:Age]), :Age])
df[ismissing.(df[!, :Age]), :Age] .= average_age
#use one-hot encoding for categoricals: Pclass and Embarked
for i in unique(df.Pclass)
	df[:,Symbol("Pclass_"*string(i))] = Int.(df.Pclass .== i)
end
#
for i in unique(df.Embarked)
	df[:,Symbol("Embarked_"*string(i))] = Int.(df.Embarked .== i)
end
#
gender_dict = Dict("male"=>1, "female"=>0);
df[!, :Sex] = map(x->gender_dict[x], df[!,:Sex]);
#
df= select(df, Not([:Pclass, :Embarked]) )
#
x_train= convert(Matrix{Float32}, select(df[1:800,:],Not(:Survived))  )
y_train = convert(Array{Float32}, df[1:800,:Survived])
#Validation  data
x_val = convert(Matrix{Float32},select(df[801:end,:],Not(:Survived)))
y_val = convert(Array{Float32}, df[801:end,:Survived])
#
using XGBoost
train_dmat = DMatrix(x_train, label=y_train)
bst_base = xgboost(train_dmat,2, eta=0.3, objective="binary:logistic", eval_metric="auc")
ŷ = predict(bst_base, x_val)
#function to calculate the accuracy and weighted f score
function evaluate(y, ŷ; threshold=0.5)
	out = zeros(Int64, 2,2)
	ŷ = Int.(ŷ.>=threshold)
	out[1,1]=sum((y.==0).&(ŷ.==0))
	out[2,2]=sum((y.==1).&(ŷ.==1))
	out[2,1]=sum((y.==1).&(ŷ.==0))
	out[1,2]=sum((y.==0).&(ŷ.==1))
	r0 = out[1,1]/(out[1,1]+out[1,2])
	p0 = out[1,1]/(out[1,1]+out[2,1])
	f0 = 2*p0*r0/(p0+r0)
	r1 = out[2,2]/(out[2,2]+out[2,1])
	p1 = out[2,2]/(out[2,2]+out[1,2])
	f1 = 2*r1*p1/(p1+r1	)
	println("Weighted f1 = ",
		round(
		(sum(y .== 0.0)/length(y)) * f0 + (sum(y .== 1.0)/length(y)) * f1
		,digits=3) )
	println("Accuracy =", (out[2,2]+out[1,1])/sum(out))
	out
end
evaluate(y_val, ŷ)
#Custom loss function: weigh false negatives higher than false positives in our loss function
function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
		gradients = … #calculate gradients
		hessians = … #calculate hessians
		return gradients, hessians
end
#Derivative By Hand. Analytical
function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
	beta = 1.5
	p = 1. ./ (1 .+ exp.(-preds))
	grad = p .* ((beta .- 1) .* y .+ 1) .- beta .* y
	hess = ((beta .- 1) .* y .+ 1) .* p .* (1.0 .- p)
	return grad, hess
end
#Define Loss fcn
σ(x) = 1/(1+exp(-x))
weighted_logistic_loss(x,y) = -1.5 .* y*log(σ(x)) -1 .* (1-y)*log(1-σ(x))
#Zygote AD MAGIC
using Zygote
grad_logistic(x,y) = gradient(weighted_logistic_loss,x,y)[1]
hess_logistic(x,y) = gradient(grad_logistic,x,y)[1]
#Use Gradient/Hessian to define Custom Loss
function custom_objective(preds::Vector{Float32}, dtrain::DMatrix)
  y = get_info(dtrain, "label")
  grad = grad_logistic.(preds,y)
  hess = hess_logistic.(preds,y)
  return grad,hess
end
bst = xgboost(train_dmat, 2,eta=0.3, eval_metric="auc", obj=custom_objective)
ŷ = predict(bst, x_val)
evaluate(y_val, ŷ)

This small modification moved the authors up 6,400 places on the Kaggle leaderboard.

These kinds of tutorials are valuable for ML users considering Julia, especially as their is discussion about bringing back Julia support for Kaggle. To reduce the risk of users trying ML in Julia & getting discouraged, we should try to get these things right.

If you have suggestions on how to further improve this code please chime in

9 Likes

Reminded me of this from the Machine Learning Julia / MLJ community.

So suggest also take a look at https://alan-turing-institute.github.io/MLJTutorials/

End to end examples with MLJ

Crabs XGB, simple , classification , xg-boost , tuning

https://alan-turing-institute.github.io/MLJTutorials/ end-to-end/crabs-xgb /

King County Houses, simple , regression , scientific type , tuning , xg-boost .

https://alan-turing-institute.github.io/MLJTutorials/ end-to-end/HouseKingCounty /

NOTE: Tuning helps a fair bit!

Getting started with MLJ

If you want to help, the best you can do is go through the MLJ Tutorials 1 and ask questions on that repo . That should allyou to get a grasp of how things are done in MLJ, and if you have suggestions to improve the tutorials or add functionalities to MLJ to make it more user friendly, that will be very welcome there.

If you would like to comment on MLJ’s design decision, I’d recommend you read the docs 3 first, they are pretty complete.

Source Link: GitHub - JuliaAI/DataScienceTutorials.jl: A set of tutorials to show how to use Julia for data science (DataFrames, MLJ, ...)

Using RDatasets

The package RDatasets.jl provides access to most of the many datasets listed on this page. These are well known, standard datasets that can be used to get started with data processing and classical machine learning such as for instance iris , crabs , Boston , etc.

Source Link: https://alan-turing-institute.github.io/MLJTutorials/data/loading/index.html List of All the standard Data Sets here >> Source Link: http://vincentarelbundock.github.io/Rdatasets/datasets.html The particular Data Set example you mentioned above namely “Survival of passengers on the* Titanic ” >> Source Link: R: Survival of passengers on the Titanic

2 Likes

Thanks for putting this together! I’m guessing you couldn’t replicate at least in part because it looks like they’re using old DataFrames syntax. A couple of places where data cleaning code could be clearer (IMO… All of these are purely personal preference):

  • count(==(""), df.Embarked) instead of sum(df[:,:Embarked] .== "")
  • df[df[:,:Embarked] .== "", :Embarked] .= "S" isn’t generic. That is, you’ve hard-coded “S” without showing how you find out was most prevalent. I like countmap from StatsBase for stuff like this
  • count (ismissing, df.Age) instead of sum(ismissing.(df[!,:Age]))
  • mean(skipmissing(df.Age)) instead of mean(df[.!ismissing.(df[!,:Age])
  • for changing gender to numbers, I’d do something like:
# ensure we have gender for everyone so female isn't imputed
@assert all(s-> in(s, ("male","female"), df.Sex
# true becomes 1, false becomes 0
df.Sex = Int.(s == "male" for s in df.Sex)
2 Likes

Interesting to see (again) how our personal style preferences diverge :slight_smile:

But more to the point: why do you use the generator in the last line, rather than just

df.Sex = df.Sex .== "male"

Is there some benefit in reduced allocations or speed? I can’t measure any, but as usual my benchmarking might be off…

1 Like

Not sure, that’s what I would do to, but in the original they were Ints, I just assumed that was needed for the ML stuff later (and all of it is getting put into a matrix, so I thought they’d want the same type).

Sorry I should have used Int.(df.Sex .== "male"), my point was more around the broadcasted .== comparison vs. the (s == "male" for s in df.Sex) generator syntax.

Oh I see. No, no reason there :slight_smile:. I often avoid broadcasting except in places where it’s super obvious. I think I might have assumed it was faster, since I thought s .== "male" would allocate, and then Int.() would have to allocate again, but maybe julia recognizes and fuses these? Looks like the broadcasting is marginally faster, and allocates less:

julia> using BenchmarkTools

julia> s = rand(["male", "female"], 1_000_000);

julia> @benchmark Int.(map(x-> x == "male", $s))
BenchmarkTools.Trial:
  memory estimate:  8.58 MiB
  allocs estimate:  5
  --------------
  minimum time:     8.278 ms (0.00% GC)
  median time:      8.739 ms (0.00% GC)
  mean time:        9.143 ms (4.32% GC)
  maximum time:     12.423 ms (22.50% GC)
  --------------
  samples:          547
  evals/sample:     1

julia> @benchmark Int.(x == "male" for x in $s)
BenchmarkTools.Trial:
  memory estimate:  8.58 MiB
  allocs estimate:  5
  --------------
  minimum time:     8.525 ms (0.00% GC)
  median time:      9.027 ms (0.00% GC)
  mean time:        9.462 ms (4.25% GC)
  maximum time:     14.442 ms (21.88% GC)
  --------------
  samples:          529
  evals/sample:     1

julia> @benchmark Int.($s .== "male")
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     7.657 ms (0.00% GC)
  median time:      7.957 ms (0.00% GC)
  mean time:        8.491 ms (5.52% GC)
  maximum time:     12.271 ms (25.10% GC)
  --------------
  samples:          589
  evals/sample:     1

But I guess it’s the combination of broacast and another method that’s the problem:

julia> @benchmark [Int(x=="male") for x in $s]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  3
  --------------
  minimum time:     7.825 ms (0.00% GC)
  median time:      8.244 ms (0.00% GC)
  mean time:        8.839 ms (5.41% GC)
  maximum time:     11.946 ms (27.29% GC)
  --------------
  samples:          566
  evals/sample:     1

julia> @benchmark map(x-> Int(x=="male"), $s)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  3
  --------------
  minimum time:     7.595 ms (0.00% GC)
  median time:      7.964 ms (0.00% GC)
  mean time:        8.512 ms (5.59% GC)
  maximum time:     11.825 ms (26.52% GC)
  --------------
  samples:          588
  evals/sample:     1

I’ll request the authors of the post to update with all the suggestions in here.

@Albert_Zevelev Perhaps it would help if you could also submit a link to your post here in the kaggle forum for Julia support: Will kaggle include Julia? | Data Science and Machine Learning

-viral

2 Likes

Thank you @viralbshah, will do.
Regarding the discussion about support for Julia on Kaggle:

Kaggle has a lot of ML tutorials. For example, their House Price Regression contest has tutorials in Python & R. It could increase adoption if we created Julia tutorials for a few of their famous competitions including the House Price Data & the Titanic survival data. I’ve got some material I can contribute, but I’d want Julia community input to make sure the tutorials showcase the best of Julia.

The reason it might help to keep updated tutorials in an official location is that Julia is still rapidly evolving & many of the Julia ML tutorials I find online have code that no longer works which could discourage newcomers trying out Julia…

2 Likes

I think that would be great. I tried out the Titanic one back when I was first learning Julia (v0.4 days) and Julia was ostensibly supported on Kaggle. Would be great to see updated versions.

Hi Kevin,

Maybe this is what you’re looking for ? >>

Reminded me of this from the Machine Learning Julia / MLJ community.

So suggest also take a look at https://alan-turing-institute.github.io/MLJTutorials/

End to end examples with MLJ

Crabs XGB , simple , classification , xg-boost , tuning

https://alan-turing-institute.github.io/MLJTutorials/ end-to-end/crabs-xgb /

King County Houses , simple , regression , scientific type , tuning , xg-boost .

https://alan-turing-institute.github.io/MLJTutorials/ end-to-end/HouseKingCounty /

NOTE: Tuning helps a fair bit!

Getting started with MLJ

If you want to help, the best you can do is go through the MLJ Tutorials 1 and ask questions on that repo . That should allyou to get a grasp of how things are done in MLJ, and if you have suggestions to improve the tutorials or add functionalities to MLJ to make it more user friendly, that will be very welcome there.

If you would like to comment on MLJ’s design decision, I’d recommend you read the docs 3 first, they are pretty complete.

Source Link: GitHub - JuliaAI/DataScienceTutorials.jl: A set of tutorials to show how to use Julia for data science (DataFrames, MLJ, ...)

Using RDatasets

The package RDatasets.jl provides access to most of the many datasets listed on this page. These are well known, standard datasets that can be used to get started with data processing and classical machine learning such as for instance iris , crabs , Boston , etc.

Source Link: https://alan-turing-institute.github.io/MLJTutorials/data/loading/index.html List of All the standard Data Sets here >> Source Link: http://vincentarelbundock.github.io/Rdatasets/datasets.html The particular Data Set example you mentioned above namely “S urvival of passengers on the Titanic ” >> Source Link: R: Survival of passengers on the Titanic

Hi, Albert,

I think it is a great idea. When I started reading some tutorials (except the MLJ material proposed by @Marc.Cox, that it is very recent) I found they were not working more. I think more updated examples in different places it should be good.

1 Like

Here is my code to train all 49 regression models to predict house prices in Boston.
@tlienart helped w/ training all MLJ models
@mthelm85 helped w/ one_hot_encode

Functions to clean data, load & train all models:

Code

#load packages
using MLJ, RDatasets, TableView, DataFrames
################################################################################
#OHE/AZ(X)/load_m/train_m
#OHE
function one_hot_encode(d::DataFrame)
    encoded = DataFrame()
    for col in names(d), val in unique(d[!, col])
        lab = string(col) * "_" * string(val)
        encoded[!, Symbol(lab) ] = ifelse.(d[!, col] .== val, 1, 0)
    end
    return encoded
end
#AZ: convert Strings & Count to OHE.
function AZ(X)
    sch = schema(X);
    #ty = [CategoricalString{UInt8}, CategoricalString{UInt32}, CategoricalValue{Int64,UInt32}]
    tn = [Int, Float16, Float32, Float64]
    vs = [];
    for (name, type) in zip(sch.names, sch.types)
        if type ∉ tn  #∈ ty #∉ [Int32, Int64, Float64]
            #println(:($name) , "  ", type)
            push!(vs, :($name) )
            #global X = coerce(X, :($name) =>Continuous);
        end
    end
    #
    Xd= DataFrame(X);
    X_ohe = one_hot_encode( Xd[:, vs]  )
    Xd = hcat( X_ohe, select(Xd, Not( vs )) )
    Xd = coerce(Xd, autotype(Xd, :discrete_to_continuous))
    #sch= schema(Xd);
    #@show sch.scitypes;
    #
    X=Xd
    return X
end
#Load & make model list.
@inline function load_m(model_list)
    model_names = Vector{String}(undef, length(model_list))
    @inbounds for (i, model) in enumerate(model_list)
        #load(model.name, pkg=model.package_name)
        load(model.name, pkg=model.package_name, verbosity=0) #
        #@load model pkg=model.package_name verbosity=1
        model_names[i] = model.name
    end
    return model_names
end
#Train & Score.
#NOTE: if we do target engineering we need to transform Y back to compare score.
@inline function train_m(m::String, X, y, train, test, pr, meas; invtrans=identity)
    t1 = time_ns()
    println(m)
    if m =="XGBoostRegressor"
        mdl  = eval(Meta.parse("$(m)(num_round=500)"))
    elseif m=="EvoTreeRegressor"
        mdl  = eval(Meta.parse("$(m)(nrounds = 1500)"))
    else
        mdl  = eval(Meta.parse("$(m)()"))
    end
    #
    mach = machine(mdl, X, y)
    fit!(mach, rows=train, verbosity=0) #, verbosity=0
    #ŷ = MLJ.pr(mach, rows=test)
    ŷ = pr(mach, rows=test)
    ŷ = invtrans.(ŷ)
    y = invtrans.(y)
    #AZ Custom oos-R2
    if meas=="Rsq"
        ê = (ŷ-y[test]) #sse=ê'ê;
        ẽ = ( mean(y[train]) .- y[test] )
        R2 = ( 1 - ( (ê'ê)/(ẽ'ẽ) ) )*100
        s = R2
    elseif meas==rmsl
        s = meas(abs.(ŷ), abs.(y[test]) )  #abs.() for rmsl AMES.
    else
        s = meas(ŷ, y[test])
    end
    t2 = time_ns()
    return [round(s, sigdigits=5), round((t2-t1)/1.0e9, sigdigits=5)]
end

Applied to the Boston housing data:

Code

X, y =  @load_boston;
train, test = partition(eachindex(y), .7, rng=333);
X = AZ(X)
m_match = models(matching(X, y), x -> x.prediction_type == :deterministic);
m_names = load_m(m_match);

#
sc = [train_m(m, X, y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, X, log.(y), train, test, predict, rms, invtrans=exp) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, log.(X.+1), y, train, test, predict, rms) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#
sc = [train_m(m, log.(X.+1), log.(y), train, test, predict, rms, invtrans=exp) for m in m_names]
sc =hcat(sc...)';
showtable( hcat(
    m_names[sortperm(sc[:,1])] ,
    sc[sortperm(sc[:,1]), :]
    ) )
#

All 49 models can be trained in seconds & the out-of-sample scores are competitive!

Note a lot more can be done:

  1. Moar Models (not currently in MLJ):
    MLJFlux.jl (WIP)
    @xiaodai’s JLBoost.jl, @joshday’s SparseRegression.jl, @rakeshvar’s AnyBoost.jl
    Also the MLJ Roadmap mentions: Turing.jl, Gen.jl, Soss.jl
  2. HP tuning (I currently use default HP grids):
    MLJTuning.jl (WIP) looks promising.
    I’d love to use @baggepinnen’s Hyperopt.jl to automatically tune all models w/ Bayesian optimization.
  3. Ensembling:
    MLJ has nice options for ensembles I’d like to automate for a large number of models.
    In addition @ppalmes’s AutoMLPipelines.jl is amazing (see discussion).

In addition to the Boston data I’ve run all models on:
Regression: Ames Iowa housing/Diamonds/King County Housing
Classification: Crabs/Iris/Titanic/Pima

MLJ community: please let me know if this works or if you have any feedback

Paulito, would it be possible to use your package to automatically train every model on the Boston housing data? And consider some stacked ensembles of the set of models?

7 Likes

definitely. this is what automlpipeline is designed for. you can do the evaluations in parallel using @distributed. if i have time today, i’ll give a sample code. it should be very short code. unfortunately, since sklearn libs are not thread-safe, you can only use @distributed instead of @threads.

1 Like

for complex ensembles, here’s the doc with examples:

https://ibm.github.io/AutoMLPipeline.jl/dev/man/metaensembles/

2 Likes

Thanks!
In my code above I can automatically create a list of all matching models w/

m_match = models(matching(X, y), x -> x.prediction_type == :deterministic);

Is there an equivalent in your package?

i don’t store prediction type info in my model structure. the behavior of prediction type is implicit to the behavior of fit and transform functions.

Here is how I automatically store all (49) learners from sk:

using AutoMLPipeline
sk= AutoMLPipeline.SKLearners.learner_dict;
sk= keys(sk) |> collect |> x-> sort(x,lt=(x,y)->lowercase(x)<lowercase(y))

I realize some of these learners apply to regression others classification, but let’s put that aside for a moment.
Suppose I want to train all models in “sk”

learners = DataFrame()
for m in skv 
    learner = SKLearner(m)
    pcmc = AutoMLPipeline.@pipeline learner
    println(learner.name)
    mean,sd,_ = crossvalidate(pcmc,X,y,"accuracy_score",10)
    global learners = vcat(learners,DataFrame(name=learner.name,mean=mean,sd=sd))
end;
@show learners;

It appears that the performance metrics in your package currently work for classification and not regression (such as RMSE) unless I’m missing something?

2 Likes

this is awesome, I’ll try to turn this into a tutorial when I have some time :slight_smile:

2 Likes

yeah. all the problems i’m working right now is for classification tasks. but it’s easy to add the regression metric for crossvalidation. you can look at the skcrossvalidator.jl source (AutoMLPipeline.jl/skcrossvalidator.jl at master · IBM/AutoMLPipeline.jl · GitHub). I’m using these metrics: 3.3. Metrics and scoring: quantifying the quality of predictions — scikit-learn 1.1.2 documentation.

i’ll create an issue to add regression support. I focus in classification as regression is more trivial to implement once classification workflow works. also, feel free to make a PR ;).

you can differentiate classifiers from regressors because their names contain a substring of either Classifier or Regressor .