RandomForestRegressor in Julia

I’m trying to train a RandomForestRegressor using DecisionTree.jl
and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:

using CSV
using DataFrames

using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree

x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

mod = RandomForestRegressor()

param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])

model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(x_test)

This throws a MethodError like this:

ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...
Stacktrace:
 [1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
 [2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
 [3] top-level scope
   @ c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17

If you’re curious about the shape of the data:

julia> size(x)
(1550, 70)

julia> size(y_train)
(1550, 10)

How can I solve this problem? Any help would be appreciated.

https://github.com/bensadeghi/DecisionTree.jl/blob/master/src/scikitlearnAPI.jl#L300

fit! expects a vector for its third argument rather than a matrix.

Hi, you reminded me, I just saw (not yet registered, is in the General queue; can still be used):

This package implements the Stable and Interpretable RUle Sets (SIRUS) for classification. Regression is also technically possible but not yet implemented.

The SIRUS algorithm was presented by Bénard et al. in 2020 and 2021. In short, SIRUS combines the predictive accuracy of random forests with the explainability of decision trees while remaining stable. […]

Intriguing, but not too helpful since you’re looking for regression (you can look up if available already through e.g. Python/SciKitLearn?).

I just googled a bit, and you can see how RandomForestRegressor is used here: https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Decision_Tree_Regression_Julia.ipynb

and RandomizedSearchCV here (though with a classifier, I’m unfamiliar with this but since also works with regression according to SciKitLearn’s docs, should also be workable from Julia): https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Randomized_Search.ipynb

I’m more curious myself if possible, and how used without SciKitlearn (while very viable to use with Python code):

I did find this code (for pre-1.0 Julia):

and this for current:

Hmm. Such a shame. In Python you can give multiple variables into RandomForestRegressor class to fit and predict! But in Julia, everything is in initial level I guess.

If the fix is as simple as:

fit!(model, Matrix(x), Vector(DataFrames.dropmissing(y_train))) then that warrants an additional question, why weren’t the suggestions more helpful:

Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...

Julia seems to be assuming the first argument is wrong (or at least sorting that way), while the last one is wrong, or kind of (Vector is actually a form of a Matrix, at least in MATLAB; and <: AbstractVector{T} i.e. alias for AbstractArray{T,1}).

I suppose Python would be more forgiving if you “cast” to a Matrix (or you wouldn’t even need that cast), rather than to Vector.

I’ve proposed in a similar situation, than extra methods be provided in packages (then for Int vs Float mixup), here defining same except for Matrix would help.

It’s probably not good, as in then having a runtime check (with slowdown), but it could be implemented as a suggestion: fit!(..., Matrix) = not_implemented_suggest("fit!(..., Vector)").

It’s not great to have to do this in packages, so could the order of suggestions for “Closest candidates” in Julia be changed, to take likely mixed-up types into account?

I suggest listing suggestions, as assuming the first argument is correct (as would be likely in OOP world), maybe starting at other end, have one suggestion for last argument, then one for next to last, then, here, one for first argument wrong.

Does t makes sense for last argument to be an Array? It’s possibly just not implemented yet in the wrapper package SciKitLearn.jl. If I understood you correctly, then that’s even better for my proposal, I just posted, not_implemented (EDIT: renamed above) wasn’t meant as in “yet” (rather for when would never apply), and in that case should just have a method. Can you fix the wrapper by making a PR?

Yes, it does. It’s called the Multi-output model, or Multi-Target model. It’s already applicable in sklearn in python.

Not at this time.

I gave up on using DecisionTree.jl , And ScikitLearn.jl is adequate in my case:

using ScikitLearn: @sk_import, fit!, predict
@sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames


x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

x_test = reshape(x_test, 1,length(x_test))

mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(model, x_test)

This works fine for me, But it’s super slow! Much slower than Python.

That’s interesting. Using a Python library from Julia should never be much slower, so I wander if you did something wrong, and how to profile.

Note, what I wrote assumed using it directly, e.g. with PyCall.jl or PythonCall.jl which is an option for you. If you use the ScikitLearn.jl wrapper (or any (thin) wrapper), it shouldn’t add overhead.

I also had simple/single-threaded in mind. I don’t know if this changes things:

I looked at all the code files and noticed: ScikitLearn.jl/grid_search.jl at e70bf7208306110d91f1cfe183cb27ccf88e9215 · cstjean/ScikitLearn.jl · GitHub

Is it about something simple as running as:

julia --procs auto

It’s quite slow to start that way (at least in Julia 1.8-rc1, with my 16 cores), but after startup could give up to 16x (for me) speedup, if actually exploited. Maybe you’re measuring the fixed startup overhead (that Python doesn’t have?), that seems way too excessive (11 sec for me, rather than usual 0.2 sec startup), and should (and I believe could) be fixed in some Julia version.

I’m not sure Distributed is exploited in the package (i.e. should I also see e.g. @everywhere there?). Was it the plan, and the wrapper incomplete?

What’s your timing, both with Julia and with pure Python code? Can you monitor and see if Python spawns many processes (and Julia does not)?

Since Discourse policy is riddddddddiculous and I couldn’t edit my reply, You can find your answer in beautiful Stackoverflow.