There is a really cool recent blog post @ Julia Computing about how easily custom loss functions can be implemented in XGBoost w/ Zygote. They use the Titanic survival data from Kaggle & compare scores on the Leaderboard.
I wasn’t able to replicate their results & there was no place to post a comment so I’ll post my code here.
Code
#Download Kaggle Titanic data & go to that directory.
using CSVFiles, DataFrames
df = DataFrame(CSVFiles.load("train.csv"));
names(df)
#only keep: Age, Embarked, Sex, Pclass, SibSp, Parch, Fare
df= select(df, [:Age, :Embarked, :Sex, :Pclass, :SibSp, :Parch, :Fare, :Survived] )
#Num missing val for: Embarked
sum(df[:,:Embarked] .== "")
df[df[:,:Embarked] .== "", :Embarked] .= "S" #Impute w/ most freq val in Column.
#Num obs missing AGE #replace missing w/ average age.
sum(ismissing.(df[!,:Age]))
using Statistics
average_age = mean(df[.!ismissing.(df[!,:Age]), :Age])
df[ismissing.(df[!, :Age]), :Age] .= average_age
#use one-hot encoding for categoricals: Pclass and Embarked
for i in unique(df.Pclass)
df[:,Symbol("Pclass_"*string(i))] = Int.(df.Pclass .== i)
end
#
for i in unique(df.Embarked)
df[:,Symbol("Embarked_"*string(i))] = Int.(df.Embarked .== i)
end
#
gender_dict = Dict("male"=>1, "female"=>0);
df[!, :Sex] = map(x->gender_dict[x], df[!,:Sex]);
#
df= select(df, Not([:Pclass, :Embarked]) )
#
x_train= convert(Matrix{Float32}, select(df[1:800,:],Not(:Survived)) )
y_train = convert(Array{Float32}, df[1:800,:Survived])
#Validation data
x_val = convert(Matrix{Float32},select(df[801:end,:],Not(:Survived)))
y_val = convert(Array{Float32}, df[801:end,:Survived])
#
using XGBoost
train_dmat = DMatrix(x_train, label=y_train)
bst_base = xgboost(train_dmat,2, eta=0.3, objective="binary:logistic", eval_metric="auc")
ŷ = predict(bst_base, x_val)
#function to calculate the accuracy and weighted f score
function evaluate(y, ŷ; threshold=0.5)
out = zeros(Int64, 2,2)
ŷ = Int.(ŷ.>=threshold)
out[1,1]=sum((y.==0).&(ŷ.==0))
out[2,2]=sum((y.==1).&(ŷ.==1))
out[2,1]=sum((y.==1).&(ŷ.==0))
out[1,2]=sum((y.==0).&(ŷ.==1))
r0 = out[1,1]/(out[1,1]+out[1,2])
p0 = out[1,1]/(out[1,1]+out[2,1])
f0 = 2*p0*r0/(p0+r0)
r1 = out[2,2]/(out[2,2]+out[2,1])
p1 = out[2,2]/(out[2,2]+out[1,2])
f1 = 2*r1*p1/(p1+r1 )
println("Weighted f1 = ",
round(
(sum(y .== 0.0)/length(y)) * f0 + (sum(y .== 1.0)/length(y)) * f1
,digits=3) )
println("Accuracy =", (out[2,2]+out[1,1])/sum(out))
out
end
evaluate(y_val, ŷ)
#Custom loss function: weigh false negatives higher than false positives in our loss function
function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
gradients = … #calculate gradients
hessians = … #calculate hessians
return gradients, hessians
end
#Derivative By Hand. Analytical
function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
beta = 1.5
p = 1. ./ (1 .+ exp.(-preds))
grad = p .* ((beta .- 1) .* y .+ 1) .- beta .* y
hess = ((beta .- 1) .* y .+ 1) .* p .* (1.0 .- p)
return grad, hess
end
#Define Loss fcn
σ(x) = 1/(1+exp(-x))
weighted_logistic_loss(x,y) = -1.5 .* y*log(σ(x)) -1 .* (1-y)*log(1-σ(x))
#Zygote AD MAGIC
using Zygote
grad_logistic(x,y) = gradient(weighted_logistic_loss,x,y)[1]
hess_logistic(x,y) = gradient(grad_logistic,x,y)[1]
#Use Gradient/Hessian to define Custom Loss
function custom_objective(preds::Vector{Float32}, dtrain::DMatrix)
y = get_info(dtrain, "label")
grad = grad_logistic.(preds,y)
hess = hess_logistic.(preds,y)
return grad,hess
end
bst = xgboost(train_dmat, 2,eta=0.3, eval_metric="auc", obj=custom_objective)
ŷ = predict(bst, x_val)
evaluate(y_val, ŷ)
This small modification moved the authors up 6,400 places on the Kaggle leaderboard.
These kinds of tutorials are valuable for ML users considering Julia, especially as their is discussion about bringing back Julia support for Kaggle. To reduce the risk of users trying ML in Julia & getting discouraged, we should try to get these things right.
If you have suggestions on how to further improve this code please chime in