Help me convert a weird data type into booleans!

clouedoc · January 28, 2023, 6:05pm

Hello there,

I’ve been struggling to learn Julia for the last few days.
I’m especially interested in machine learning.

So, I’ve managed to train a decision tree model, and I’ve been able to make some predictions.

However, they came into the form of categorical distributions!
I’d like to convert them to booleans to submit my results to Kaggle (data science competition website)

Here’s how they look like:

julia> predictions
3281-element CategoricalDistributions.UnivariateFiniteVector{Multiclass{2}, Bool, UInt8, Float32}:
 UnivariateFinite{Multiclass{2}}(false=>0.473, true=>0.527)
 UnivariateFinite{Multiclass{2}}(false=>0.805, true=>0.195)
 UnivariateFinite{Multiclass{2}}(false=>0.121, true=>0.879)
 UnivariateFinite{Multiclass{2}}(false=>0.359, true=>0.641)
 ...

Instead I’d like to have an array of booleans to have this as my final result:

PassengerId,Transported
0013_01,False
0018_01,False
0019_01,False
0021_01,False
0023_01,False
0027_01,False

Thanks

bkamins · January 28, 2023, 6:59pm

The following should work:

run pdf.(x, true) to get probabilities of true
then choose some cut-off threshold for classification to make a prediction based on the produced probability, e.g. pdf.(x, true) .> 0.5 (if you want to make classification at 0.5)

ablaom · January 28, 2023, 9:05pm

If you’re happy with 0.5 thresholding you can also just call mode.(x), or call predict_mode instead of predict in MLJ (which I assume you are using). The CategoricalDistributions.jl readme has more. And you may want to look at Working with Dategorical Data section of the MLJ manual.

ablaom · January 29, 2023, 7:20am

Or you can even wrap your probabilistic model using MLJ’s BinaryThresholdPredictor to get a point-predictor and optimise the threshold to minimise your loss by wrapping again using TunedModel. There is an example in the More on Probabilistic Predictors section of the MLJ manual.

clouedoc · January 30, 2023, 10:37am

@bkamins @ablaom Thank you both for your help.

I ended up wrapping my EvoTree with a BinaryThresholdPredictor.

Here’s how the final product looks, if anyone passing here is curious:

EvoTreeClassifier = @load EvoTreeClassifier pkg = EvoTrees verbosity = 0

prob_predictor = EvoTreeClassifier()
point_predictor = BinaryThresholdPredictor(prob_predictor, threshold=0.5)

balanced = BalancedAccuracy(adjusted=true)

r = range(point_predictor, :threshold, lower=0.1, upper=0.9)
tuned_point_predictor = TunedModel(
  point_predictor,
  tuning=RandomSearch(rng=123),
  resampling=CV(nfolds=6),
  range=r,
  measure=balanced,
  n=30,
)
mach2 = machine(tuned_point_predictor, X, y) |> fit!
optimized_point_predictor = report(mach2).best_model
optimized_point_predictor.threshold # 0.260
predict(mach2, X)[1:3] # [1, 1, 0]


e = evaluate!(mach2, resampling=CV(nfolds=6), measure=[balanced, accuracy])
e.measurement[1] # 0.576 ± 0.0263

final_machine = machine(optimized_point_predictor, X, y) |> fit!
evaluate!(final_machine, measure=[balanced, accuracy])

I also uploaded my work on GitHub

ablaom · January 30, 2023, 7:37pm

Hey that’s cool to see MLJ applied to this Kaggle competition.

I see you’ve chosen to optimise balanced rather than accuracy but note that the competition is based on the latter. I’m curious if you score any better if you use accuracy instead.

clouedoc · January 31, 2023, 1:34pm

If you know other competitions that are fit for MLJ, I will gladly take a bite at them! Or if you have Julia ML framework recommendations too.

I’ve tried optimizing for accuracy instead just now, and I’ve got a lower score (0.787 < 0.788) which seems pretty logical for me, since we don’t actually know whether the testing dataset is balanced or not. They might be having a slightly unbalanced testing set to reward users who think about balancing, even though it’s not super necessary.

I’ve compared my two out.csv. About 2% of my answers changed because of using a balanced model. On the other hand, my score increased by about 0.1%, so I’m uncertain if there’s really an improvement or just pure chance

Topic		Replies	Views
Extracting values from UnivariateFinite General Usage mlj , distributions	8	1680	July 7, 2023
Using measure in MLJ to evaluate binary classifier New to Julia machine-learning , mlj	2	1504	August 31, 2021
Turing.jl, differentiation & categorical outputs -- `isprobvec` conundrum Machine Learning question , differentiation , turing , potential-bug	5	717	February 8, 2021
Unable to evaluate inputs comprising columns of either String or Int type: convert error Machine Learning mlj	8	365	November 16, 2023
Function `confusion_matrix ` in MLJ always fails General Usage mlj	4	642	May 23, 2022

Help me convert a weird data type into booleans!

Related topics