RandomForestRegressor in Julia

Shayan · July 18, 2022, 8:26pm

I’m trying to train a RandomForestRegressor using DecisionTree.jl
and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:

using CSV
using DataFrames

using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree

x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

mod = RandomForestRegressor()

param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])

model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(x_test)

This throws a MethodError like this:

ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...
Stacktrace:
 [1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
 [2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
   @ ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
 [3] top-level scope
   @ c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17

If you’re curious about the shape of the data:

julia> size(x)
(1550, 70)

julia> size(y_train)
(1550, 10)

How can I solve this problem? Any help would be appreciated.

Jeff_Emanuel · July 18, 2022, 9:17pm

https://github.com/bensadeghi/DecisionTree.jl/blob/master/src/scikitlearnAPI.jl#L300

fit! expects a vector for its third argument rather than a matrix.

Palli · July 18, 2022, 9:17pm

Hi, you reminded me, I just saw (not yet registered, is in the General queue; can still be used):

This package implements the Stable and Interpretable RUle Sets (SIRUS) for classification. Regression is also technically possible but not yet implemented.

The SIRUS algorithm was presented by Bénard et al. in 2020 and 2021. In short, SIRUS combines the predictive accuracy of random forests with the explainability of decision trees while remaining stable. […]

Intriguing, but not too helpful since you’re looking for regression (you can look up if available already through e.g. Python/SciKitLearn?).

I just googled a bit, and you can see how RandomForestRegressor is used here: https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Decision_Tree_Regression_Julia.ipynb

and RandomizedSearchCV here (though with a classifier, I’m unfamiliar with this but since also works with regression according to SciKitLearn’s docs, should also be workable from Julia): https://github.com/cstjean/ScikitLearn.jl/blob/master/examples/Randomized_Search.ipynb

I’m more curious myself if possible, and how used without SciKitlearn (while very viable to use with Python code):

I did find this code (for pre-1.0 Julia):

and this for current:

Shayan · July 18, 2022, 9:52pm

Hmm. Such a shame. In Python you can give multiple variables into RandomForestRegressor class to fit and predict! But in Julia, everything is in initial level I guess.

Palli · July 18, 2022, 10:08pm

If the fix is as simple as:

fit!(model, Matrix(x), Vector(DataFrames.dropmissing(y_train))) then that warrants an additional question, why weren’t the suggestions more helpful:

Closest candidates are:
  fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
  fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
  fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
  ...

Julia seems to be assuming the first argument is wrong (or at least sorting that way), while the last one is wrong, or kind of (Vector is actually a form of a Matrix, at least in MATLAB; and <: AbstractVector{T} i.e. alias for AbstractArray{T,1}).

I suppose Python would be more forgiving if you “cast” to a Matrix (or you wouldn’t even need that cast), rather than to Vector.

I’ve proposed in a similar situation, than extra methods be provided in packages (then for Int vs Float mixup), here defining same except for Matrix would help.

It’s probably not good, as in then having a runtime check (with slowdown), but it could be implemented as a suggestion: fit!(..., Matrix) = not_implemented_suggest("fit!(..., Vector)").

It’s not great to have to do this in packages, so could the order of suggestions for “Closest candidates” in Julia be changed, to take likely mixed-up types into account?

I suggest listing suggestions, as assuming the first argument is correct (as would be likely in OOP world), maybe starting at other end, have one suggestion for last argument, then one for next to last, then, here, one for first argument wrong.

Palli · July 18, 2022, 10:13pm

Does t makes sense for last argument to be an Array? It’s possibly just not implemented yet in the wrapper package SciKitLearn.jl. If I understood you correctly, then that’s even better for my proposal, I just posted, not_implemented (EDIT: renamed above) wasn’t meant as in “yet” (rather for when would never apply), and in that case should just have a method. Can you fix the wrapper by making a PR?

Shayan · July 19, 2022, 11:22am

Yes, it does. It’s called the Multi-output model, or Multi-Target model. It’s already applicable in sklearn in python.

Not at this time.

Shayan · July 19, 2022, 1:33pm

I gave up on using DecisionTree.jl , And ScikitLearn.jl is adequate in my case:

using ScikitLearn: @sk_import, fit!, predict
@sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames


x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)

x_test = reshape(x_test, 1,length(x_test))

mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
                  "max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)

fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))

predict(model, x_test)

This works fine for me, But it’s super slow! Much slower than Python.

Palli · July 20, 2022, 3:14pm

That’s interesting. Using a Python library from Julia should never be much slower, so I wander if you did something wrong, and how to profile.

Note, what I wrote assumed using it directly, e.g. with PyCall.jl or PythonCall.jl which is an option for you. If you use the ScikitLearn.jl wrapper (or any (thin) wrapper), it shouldn’t add overhead.

I also had simple/single-threaded in mind. I don’t know if this changes things:

I looked at all the code files and noticed: ScikitLearn.jl/grid_search.jl at e70bf7208306110d91f1cfe183cb27ccf88e9215 · cstjean/ScikitLearn.jl · GitHub

Is it about something simple as running as:

julia --procs auto

It’s quite slow to start that way (at least in Julia 1.8-rc1, with my 16 cores), but after startup could give up to 16x (for me) speedup, if actually exploited. Maybe you’re measuring the fixed startup overhead (that Python doesn’t have?), that seems way too excessive (11 sec for me, rather than usual 0.2 sec startup), and should (and I believe could) be fixed in some Julia version.

I’m not sure Distributed is exploited in the package (i.e. should I also see e.g. @everywhere there?). Was it the plan, and the wrapper incomplete?

What’s your timing, both with Julia and with pure Python code? Can you monitor and see if Python spawns many processes (and Julia does not)?

Shayan · July 21, 2022, 3:30pm

Since Discourse policy is riddddddddiculous and I couldn’t edit my reply, You can find your answer in beautiful Stackoverflow.

Topic		Replies	Views
Comparison between Julia and Python Random Forest Regression Machine Learning	6	3727	December 20, 2022
Issues with fit! and DecisionTreeRegressor New to Julia	3	734	February 11, 2018
Boruta algorithm Machine Learning	4	1057	February 21, 2022
Error from MLJ Iris example, no method matching? Machine Learning	4	555	November 11, 2021
MLJ confusion_matrix() - MethodError Machine Learning question , package	5	1294	September 18, 2020

RandomForestRegressor in Julia

Related topics