Unable to evaluate inputs comprising columns of either String or Int type: convert error

Hi, I’m trying out MLJ to do decision tree regression where the inputs are categorical that are either string or ints. Below is the schema of the input (origin and destination are h3 hashes).

┌─────────────┬───────────────────┬────────────────────────────────────┐
│ names       │ scitypes          │ types                              │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ dow         │ OrderedFactor{7}  │ CategoricalValue{Int64, UInt32}    │
│ start_month │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32}    │
│ start_hour  │ OrderedFactor{24} │ CategoricalValue{Int64, UInt32}    │
│ origin      │ Multiclass{8}     │ CategoricalValue{String15, UInt32} │
│ destination │ Multiclass{7}     │ CategoricalValue{String15, UInt32} │
└─────────────┴───────────────────┴────────────────────────────────────┘

but when I run evaluate on it evaluate(DecisionTreeRegressor(), X, y) I get the following error:

┌ Error: Problem fitting the machine machine(DecisionTreeRegressor(max_depth = 0, …), …). 
└ @ MLJBase C:\Users\test\.julia\packages\MLJBase\fEiP2\src\machines.jl:682
[ Info: Running type checks... 
[ Info: Type checks okay. 

MethodError: Cannot `convert` an object of type 
  CategoricalArrays.CategoricalValue{Int64,UInt32{}} to an object of type 
  CategoricalArrays.CategoricalValue{Union{Int64, String15},UInt32{}}

Closest candidates are:
  convert(::Type{T}, ::T) where T
   @ Base Base.jl:64
  (::Type{T})(::T) where T<:CategoricalArrays.CategoricalValue
   @ CategoricalArrays C:\Users\test\.julia\packages\CategoricalArrays\0yLZN\src\value.jl:95
  (::Type{CategoricalArrays.CategoricalValue{T, R}} where {T<:Union{AbstractChar, AbstractString, Number}, R<:Integer})(::Any, ::Any)
   @ CategoricalArrays C:\Users\test\.julia\packages\CategoricalArrays\0yLZN\src\typedefs.jl:80

It seems that because I have columns that are strings and columns that are Ints, MLJ fuses them into Union{String,Int64} but for some reason, CategoricalArrays.CategoricalValue{Int64, ...} cannot be converted to CategoricalArrays.CategoricalValue{Union{Int64,String}, ...}. In fact, I redid a minimal sample and confirmed the behavior:

julia> convert(CategoricalArrays.CategoricalValue{Union{String,Int64}, UInt32}, X.dow[1])
ERROR: MethodError: Cannot `convert` an object of type
  CategoricalArrays.CategoricalValue{Int64,UInt32{}} to an object of type
  CategoricalArrays.CategoricalValue{Union{Int64, String},UInt32{}}

Any advice on what should I be doing?

I’m using Julia 1.9.3 and MLJ is version v0.20.1.

Can you please provide a MWE with the input table?

This is a portion of the CSV data file:

"dow","start_month","start_hour","start_minute","origin","destination","speed","dt"
0,10,9,44,AVOCADO,BANANA,4.5509095,452
0,10,9,54,CANATLOUPE,DURIAN,1.6525811,188
0,10,10,53,ELDERBERRY,FIG,4.696476,1739
0,10,14,36,CANATLOUPE,DURIAN,1.8465607,264
0,10,14,35,BANANA,CANATLOUPE,0.6611374,95
0,10,14,34,GUAVA,BANANA,0.0,27
0,10,22,43,CANATLOUPE,DURIAN,1.0265952,151
0,10,22,41,BANANA,CANATLOUPE,0.0,135
0,10,23,47,HAWTHORN,IMBE,5.8344016,420
1,10,9,0,CANATLOUPE,DURIAN,2.8944004,236
1,10,8,59,BANANA,CANATLOUPE,0.0,67
1,10,8,58,GUAVA,BANANA,0.0,20
1,10,9,59,ELDERBERRY,FIG,5.3777676,1521
1,10,11,57,JACKFRUIT,LIME,3.381521,48
1,10,14,18,FIG,MANGO,0.0,88

and the code I used to load:

using MLJ;
using CSV;
using DataFrames;
using DataFramesMeta;

speed_stats = DataFrame(CSV.File("filtered_speed_stats.csv"));
speed_stats_typed = @transform(speed_stats,
    :origin = categorical(:origin, ordered=false),
    :destination = categorical(:destination, ordered=false),
    :dow = categorical(:dow),
    :start_month = categorical(:start_month),
    :start_hour = categorical(:start_hour)
);
y, X = unpack(speed_stats_typed, ==(:dt));
X = coerce(select(X, :dow, :start_month, :start_hour, :origin, :destination), :dow => OrderedFactor,
    :start_month => OrderedFactor,:start_hour => OrderedFactor, :origin => Multiclass, :destination => Multiclass);
y = coerce(y, Continuous);
RandomForestRegressor = @load RandomForestRegressor pkg=BetaML;
model = RandomForestRegressor();
evaluate(model, X, y);

cc @sylvaticus

I’ll look at it (tomorrow)…

I managed to do a workaround by converting the string categorical arrays to their levels

 X_int = @transform(X,
  :origin = categorical(levelcode.(:origin), ordered=false),
  :destination = categorical(levelcode.(:destination), ordered=false))

resulting in the schema:

julia> schema(X_int)
┌─────────────┬───────────────────┬─────────────────────────────────┐
│ names       │ scitypes          │ types                           │
├─────────────┼───────────────────┼─────────────────────────────────┤
│ dow         │ OrderedFactor{7}  │ CategoricalValue{Int64, UInt32} │
│ start_month │ OrderedFactor{2}  │ CategoricalValue{Int64, UInt32} │
│ start_hour  │ OrderedFactor{24} │ CategoricalValue{Int64, UInt32} │
│ origin      │ Multiclass{8}     │ CategoricalValue{Int64, UInt32} │
│ destination │ Multiclass{7}     │ CategoricalValue{Int64, UInt32} │
└─────────────┴───────────────────┴─────────────────────────────────┘

and this works when evaluated!

That aside though, shouldn’t the original data (one with string categorical arrays) work?

Yes, indeed the model works even without the two coerce call (in your first post) because the inner BetaML model already considers by default X string values categorical and Y integer values continuous.

I still have 2 problems that I need to check:

  • why with the coerce call on the X I have the error
  • why I have the MLJ warning, as in theory the BetaML random forest should accept “any” type of data in input

EDIT: The error in BetaML is indeed linked to the conversion to Matrix of the dataframe (Matrix(X)), it is this line to rise the error.

I’d like to add that one of the reasons we created DataScienceTraits.jl in place of ScientificTypes.jl was because the defaults were not very good.

Actually, it defaulted to treating y as Count, so while I think it will still work with the regression, doing models(matching(X,y)) returns nothing.

As I’m just starting out with Julia, the convo went over my head but if I got it correctly, it’s less of an error due to misuse but some gaps in MLJ’s conversion for BetaML?