How to properly use MLJBase.roc_curve

Hi all, I’m getting used to ML and Julia and am wondering how to properly make use of MLJ’s api. Specifically, I’m not sure what exactly it wants as arguments to MLJ.roc_curve. Here is some code I use to train and evaluate a model:


using DataFrames, CSV, MLJ, DecisionTree, MLDataUtils, Pipe

df = DataFrame(CSV.File("./data.csv"))
df = @pipe (DataFrames.transform(df, [:DGN, :PRE6] .=> x -> parse.(UInt32, SubString.(x, Ref(4))), renamecols=false) |>
    DataFrames.transform(_, :PRE14  => x -> parse.(UInt32, SubString.(x, Ref(3:4))), renamecols=false))

using MLDataUtils

y, X = unpack(df[!, Not("id")], ==(:Risk1Yr); rng=123);
(X_train, y_train), (X_val, y_val), (X_test, y_test) = stratifiedobs((X, y), p=(0.7, 0.2));

rf = RandomForestClassifier()
DecisionTree.fit!(rf, Matrix(X_train), y_train)

preds = DecisionTree.predict_proba(rf, Matrix(X_val))
MLJ.roc_curve(preds[:,2], y_val)

I get the following error:

MethodError: no method matching pdf(::Float64, ::Bool)
Closest candidates are:
  pdf(!Matched::Distributions.Logistic, ::Real) at ~/.julia/packages/Distributions/Vkexj/src/univariate/continuous/logistic.jl:81
  pdf(!Matched::KernelDensity.BivariateKDE, ::Any, !Matched::Any) at ~/.julia/packages/KernelDensity/bNBAQ/src/interp.jl:32
  pdf(!Matched::Distributions.Truncated, ::Real) at ~/.julia/packages/Distributions/Vkexj/src/truncate.jl:133
  ...

Stacktrace:
 [1] _broadcast_getindex_evalf
   @ ./broadcast.jl:670 [inlined]
 [2] _broadcast_getindex
   @ ./broadcast.jl:643 [inlined]
 [3] getindex
   @ ./broadcast.jl:597 [inlined]
 [4] copy
   @ ./broadcast.jl:899 [inlined]
 [5] materialize
   @ ./broadcast.jl:860 [inlined]
 [6] roc_curve(ŷm::Vector{Float64}, ym::SubArray{Bool, 1, Vector{Bool}, Tuple{Vector{Int64}}, false})
   @ MLJBase ~/.julia/packages/MLJBase/U4Dis/src/measures/roc.jl:48
 [7] top-level scope
   @ ~/code/julia/mltest/test_mlj.ipynb:17

Any advice?

This has definitely bitten me before - roc_curve doesn’t have great documentation and the error is pretty unhelpful. It works if you stay in the MLJ ecosystem - I believe it expects predictions in the form of MLJ’s own UnivariateFinite type, i.e. a distribution over classes (although @ablaom will be able to confirm whether this is true).

Here’s a full MWE:

julia> using MLJ, MLJDecisionTreeInterface

julia> X = rand(100, 3); y = rand(Bool, 100);

julia> Tree = @load RandomForestClassifier pkg=DecisionTree
[ Info: For silent loading, specify `verbosity=0`.
import MLJDecisionTreeInterface ✔
RandomForestClassifier

julia> tree = Tree()
RandomForestClassifier(
  max_depth = -1,
  min_samples_leaf = 1,
  min_samples_split = 2,
  min_purity_increase = 0.0,
  n_subfeatures = -1,
  n_trees = 10,
  sampling_fraction = 0.7,
  feature_importance = :impurity,
  rng = Random._GLOBAL_RNG())

julia> mach = machine(tree, (x1 = X[:, 1], x2 = X[:, 2]), categorical(y))
untrained Machine; caches model-specific representations of data
  model: RandomForestClassifier(max_depth = -1, …)
  args:
    1:  Source @514 ⏎ Table{AbstractVector{Continuous}}
    2:  Source @659 ⏎ AbstractVector{Multiclass{2}}


julia> fit!(mach)
[ Info: Training machine(RandomForestClassifier(max_depth = -1, …), …).
trained Machine; caches model-specific representations of data
  model: RandomForestClassifier(max_depth = -1, …)
  args:
    1:  Source @514 ⏎ Table{AbstractVector{Continuous}}
    2:  Source @659 ⏎ AbstractVector{Multiclass{2}}


julia> ŷ = predict(mach);

julia> typeof(ŷ)
UnivariateFiniteVector{Multiclass{2}, Bool, UInt32, Float64}

julia> roc_curve(ŷ, y)
([0.0, 0.034482758620689655, 0.10344827586206896, 0.13793103448275862, 0.1724137931034483, 0.22413793103448276, 0.25862068965517243, 0.3620689655172414, 0.41379310344827586, 0.5, 1.0], [0.0, 0.47619047619047616, 0.5714285714285714, 0.6428571428571429, 0.6428571428571429, 0.7142857142857143, 0.7857142857142857, 0.8095238095238095, 0.8333333333333334, 0.8333333333333334, 1.0], [1.0, 0.9, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0])

Thanks, this worked! I’m curious though, what exactly is going on with this line:

Tree = @load RandomForestClassifier pkg=DecisionTree
tree = Tree()

The code works when I load RandomForestClassifier in that way, but doesn’t work when I do

using DecisionTree
tree = RandomForestClassifier()

EDIT: Here’s the error message I get:

MethodError: no method matching machine(::RandomForestClassifier, ::SubDataFrame{DataFrame, DataFrames.Index, Vector{Int64}}, ::CategoricalArrays.CategoricalVector{Bool, UInt32, Bool, CategoricalArrays.CategoricalValue{Bool, UInt32}, Union{}})
Closest candidates are:
  machine(!Matched::Type{<:Model}, ::Any...; kwargs...) at ~/.julia/packages/MLJBase/U4Dis/src/machines.jl:335
  machine(!Matched::Static, ::Any...; cache, kwargs...) at ~/.julia/packages/MLJBase/U4Dis/src/machines.jl:339
  machine(!Matched::Union{Symbol, Model}, ::Any, ::Any...; scitype_check_level, kwargs...) at ~/.julia/packages/MLJBase/U4Dis/src/machines.jl:363
  ...

Stacktrace:
 [1] top-level scope
   @ ~/code/julia/mltest/test_mlj.ipynb:6

I’m not an MLJ expert, but I think this is what the interface packages like MLJDecisionTreeInterface are for - they expose the models in a way that makes them usable with all the MLJ machine, TunedModel, fit etc machinery.

1 Like