Help me convert a weird data type into booleans!

Hello there,

I’ve been struggling to learn Julia for the last few days.
I’m especially interested in machine learning.

So, I’ve managed to train a decision tree model, and I’ve been able to make some predictions.

However, they came into the form of categorical distributions!
I’d like to convert them to booleans to submit my results to Kaggle (data science competition website)

Here’s how they look like:

julia> predictions
3281-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{2}, Bool, UInt8, Float32}:
 UnivariateFinite{Multiclass{2}}(false=>0.473, true=>0.527)
 UnivariateFinite{Multiclass{2}}(false=>0.805, true=>0.195)
 UnivariateFinite{Multiclass{2}}(false=>0.121, true=>0.879)
 UnivariateFinite{Multiclass{2}}(false=>0.359, true=>0.641)
 ...

Instead I’d like to have an array of booleans to have this as my final result:

PassengerId,Transported
0013_01,False
0018_01,False
0019_01,False
0021_01,False
0023_01,False
0027_01,False

Thanks :pray:

The following should work:

  1. run pdf.(x, true) to get probabilities of true
  2. then choose some cut-off threshold for classification to make a prediction based on the produced probability, e.g. pdf.(x, true) .> 0.5 (if you want to make classification at 0.5)
2 Likes

If you’re happy with 0.5 thresholding you can also just call mode.(x), or call predict_mode instead of predict in MLJ (which I assume you are using). The CategoricalDistributions.jl readme has more. And you may want to look at Working with Dategorical Data section of the MLJ manual.

3 Likes

Or you can even wrap your probabilistic model using MLJ’s BinaryThresholdPredictor to get a point-predictor and optimise the threshold to minimise your loss by wrapping again using TunedModel. There is an example in the More on Probabilistic Predictors section of the MLJ manual.

3 Likes

@bkamins @ablaom Thank you both for your help.

I ended up wrapping my EvoTree with a BinaryThresholdPredictor.

Here’s how the final product looks, if anyone passing here is curious:

EvoTreeClassifier = @load EvoTreeClassifier pkg = EvoTrees verbosity = 0

prob_predictor = EvoTreeClassifier()
point_predictor = BinaryThresholdPredictor(prob_predictor, threshold=0.5)

balanced = BalancedAccuracy(adjusted=true)

r = range(point_predictor, :threshold, lower=0.1, upper=0.9)
tuned_point_predictor = TunedModel(
  point_predictor,
  tuning=RandomSearch(rng=123),
  resampling=CV(nfolds=6),
  range=r,
  measure=balanced,
  n=30,
)
mach2 = machine(tuned_point_predictor, X, y) |> fit!
optimized_point_predictor = report(mach2).best_model
optimized_point_predictor.threshold # 0.260
predict(mach2, X)[1:3] # [1, 1, 0]


e = evaluate!(mach2, resampling=CV(nfolds=6), measure=[balanced, accuracy])
e.measurement[1] # 0.576 ± 0.0263

final_machine = machine(optimized_point_predictor, X, y) |> fit!
evaluate!(final_machine, measure=[balanced, accuracy])

I also uploaded my work on GitHub

2 Likes

Hey that’s cool to see MLJ applied to this Kaggle competition.

I see you’ve chosen to optimise balanced rather than accuracy but note that the competition is based on the latter. I’m curious if you score any better if you use accuracy instead.

2 Likes

If you know other competitions that are fit for MLJ, I will gladly take a bite at them! Or if you have Julia ML framework recommendations too.


I’ve tried optimizing for accuracy instead just now, and I’ve got a lower score (0.787 < 0.788) which seems pretty logical for me, since we don’t actually know whether the testing dataset is balanced or not. They might be having a slightly unbalanced testing set to reward users who think about balancing, even though it’s not super necessary.

I’ve compared my two out.csv. About 2% of my answers changed because of using a balanced model. On the other hand, my score increased by about 0.1%, so I’m uncertain if there’s really an improvement or just pure chance :slightly_smiling_face: