Parallel Random Forest

rushirg · March 2, 2017, 5:32am

I am using Random Forest algorithm for classification using build_forest() and then apply_forest().
As these operations are running on only one process, how could i parallelize these operations?
And how can i generate graph for the same?

bicycle1885 · March 2, 2017, 6:50am

I’m not perfectly sure what you want to do, but I guess you want to parallelize training and prediction of random forest. The easiest way as far as I know is using Threads.@threads, which can run a loop body in parallel with multiple threads. An example code for training may looks like this:

function train_forest(X, Y, n_trees)
    trees = make_trees(n_trees)
    Threads.@threads for i in 1:n_trees
        train_tree!(trees[i], X, Y)
    end
    return RandomForest(trees)
end

rushirg · March 2, 2017, 7:08am

Yes, what i am doing is building a forest mode like

model = build_forest(yTrain, xTrain, 20, 50, 1.0)

where yTrain is labels and xTrain is features and then applying the model

predTest = apply_forest(model, xTest)

xTest is test matrix

as all these operations are running on single process, what i want is to parallelize this task
how could i do this?

fabiangans · March 2, 2017, 7:20am

I guess you are using the DecisionTree.jl package?

If so, it looks like the forest training is already parallelized through the @parallel macro, so you would only have to run addprocs() before training your model and then build_forest should use multiple workers (Check the link below for the package source code).

https://github.com/bensadeghi/DecisionTree.jl/blob/master/src/regression.jl#L121

ChrisRackauckas · March 2, 2017, 7:48am

Please mention that this is cross listed

http://stackoverflow.com/questions/42548249/how-to-perform-parallel-execution-of-random-forest-in-julia

rushirg · March 5, 2017, 6:25pm

Yes, I am using DecisionTree package, I tried addprocs(4) in my code. But, after reading the test data set, i got error like this:

it is also saying error at build_forest() function call.
also showing error like:

ERROR (unhandled task failure): On worker 4:

similarly for On worker 3

bjarthur · March 6, 2017, 1:41pm

after addprocs you need to load your packages on each worker, like this:

import DecisionTree
@everywhere using DecisionTree

rushirg · March 6, 2017, 4:49pm

Thank you @bjarthur it works !

But now i am getting less accuracy than previously on one process. is there any way to improve accuracy and efficiency of the algorithm?

Also, is there any way to store trained model so that i can load it and directly used it on test data set. As every time it is training the model.

cstjean · March 6, 2017, 5:24pm

That’s probably just a random fluctuation. You can improve accuracy by tweaking the hyperparameters (depth, number of trees, pruning threshold), but you have to be careful about overfitting. You can either setup cross-validation yourself and do a loop over different combinations of hyperparameter values, or use the ScikitLearn.jl interface, along with GridSearchCV to do model selection.

JLD.jl should work for saving pure-Julia structures to disk.

Ajaychat3 · September 20, 2018, 5:21am

How can I do parrallel computing with modules imported using PyCall and @pyimport. I am trying to do something like this but it does not work. If I make n_jobs >1, and remove 3rd line of code from top (@everywhere (@pyimport lightgbm as lgb) ) it still uses single processor. I am using julia 1.0 on win 10 64 bit.

using PyCall : @pyimport
@pyimport lightgbm as lgb
@everywhere (@pyimport lightgbm as lgb) 
        model = lgb.LGBMClassifier(colsample_bytree=1.0,
                    learning_rate=0.1, max_depth=-1, min_child_samples=20,
                    min_child_weight=0.001, min_split_gain=0.0, n_estimators=250,
                    n_jobs=1, num_leaves=31, objective="binary", random_state=123,
                    reg_alpha=0.0, reg_lambda=0.0, subsample=1.0)
        
        fit!(model, X, y)

The error displayed is

On worker 6:
LoadError: UndefVarError: @pyimport not defined
top-level scope
eval at .\boot.jl:319
#116 at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\process_messages.jl:276
run_work_thunk at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\process_messages.jl:56
run_work_thunk at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\process_messages.jl:65
#102 at .\task.jl:259
in expression starting at In[37]:13
#remotecall_wait#154(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker, ::Module, ::Vararg{Any,N} where N) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:407
remotecall_wait(::Function, ::Distributed.Worker, ::Module, ::Vararg{Any,N} where N) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:398
#remotecall_wait#157(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64, ::Module, ::Vararg{Any,N} where N) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:419
remotecall_wait(::Function, ::Int64, ::Module, ::Vararg{Any,N} where N) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\remotecall.jl:419
(::getfield(Distributed, Symbol("##163#165")){Module,Expr})() at .\task.jl:259

...and 3 more exception(s).


Stacktrace:
 [1] sync_end(::Array{Any,1}) at .\task.jl:226
 [2] remotecall_eval(::Module, ::Array{Int64,1}, ::Expr) at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\macros.jl:207
 [3] top-level scope at C:\cygwin\home\Administrator\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.0\Distributed\src\macros.jl:190
 [4] top-level scope at In[37]:5

bernhard · September 20, 2018, 6:07am

maybe it helps if you do
@everywhere using PyCall
such that the package is loaded on all procs

Ajaychat3 · September 20, 2018, 7:22am

@bernhard
Thanks it does starts parrallel processing. However, I am not gaining any improvement in speed by doing this. I have data size of about ( 220K,28). Without parallel processing one single run takes about 10.6s. With parallel processing feature on, it takes about 16s.

I would like to know how can I gain improvement in speed?

bernhard · September 20, 2018, 7:27am

Well, parallelization is not always trivial. Not all problems benefit from it (it depends on cache, data size, …)

However, I am not quite sure what exactly you are running in your code because you are using PyCall.
Effectively the model is fitted in Python, right?
If so, I am not sure if any ‘Distributed Code in Julia’ (or additional julia procs) will change anything at all, because python is doing the work here.

Ajaychat3 · September 20, 2018, 7:42am

@bernhard
I have recently moved to Julia and trying to move my Python models to Julia. Since there is no lightgbm or XGBoost model in julia, I am trying to use the same thorugh PyCall in Julia.

I agree that calling python models in Julia may not be as efficient as native Julia models but I have no choice and hence my current exploration. In Python with the help of Cython, I was able to run 10 iterations of 3-fold CV in approx 95 seconds that too using single core.

Just another question. If I were to warp the whole model fit in a function, what changes would I need to make in the code (given in my first post).

bernhard · September 20, 2018, 7:58am

Hm, I do not know if there is a native xgboost or gbm implementation in Julia. It does not seem to be the case. Maybe you can find something on https://pkg.julialang.org/
But I guess you already searched and opted for PyCall (an alternative would of course by RCall).

Have you tried this: https://github.com/bensadeghi/DecisionTree.jl ?
If your problem is binary, classification might work right?

I have an experimental Julia package which includes a regression boosting approach, but it is not well documented and may likely not fit your purpose (GitHub - kafisatz/DecisionTrees.jl: Julia Decision Tree Algorithms for Regression )

bernhard · September 20, 2018, 8:00am

Not sure what you mean by that
how about this:

function myfit(X,y)
model = lgb.LGBMClassifier(colsample_bytree=1.0,
                    learning_rate=0.1, max_depth=-1, min_child_samples=20,
                    min_child_weight=0.001, min_split_gain=0.0, n_estimators=250,
                    n_jobs=1, num_leaves=31, objective="binary", random_state=123,
                    reg_alpha=0.0, reg_lambda=0.0, subsample=1.0)
        
        fit!(model, X, y)
return model
end

Ajaychat3 · September 20, 2018, 8:58am

Thanks for your quick revert.

Not sure what you mean by that

What I meant was that what changes would be required for using parrallel computing inside a function call to the model being fit. You have already shared a code for wrapping the model fit inside the function call. Where and what code do I put inside or outside the code to use parrallel computing in this case.

baggepinnen · September 20, 2018, 9:06am

Ajaychat3 · September 20, 2018, 9:11am

I have looked at it but not able to compile the model.

Ajaychat3 · September 20, 2018, 3:13pm

Thanks I will look into the models suggested by you.

Topic		Replies	Views
How to best parallelize custom decision tree / forest? Machine Learning question	0	267	May 4, 2023
My Random Forest is very slow Performance	10	4840	August 28, 2020
Julia call from Python3 running in single core General Usage	34	4027	December 2, 2016
No variability in xgboost outputs? (XGBoost.jl) Statistics question	10	1098	August 25, 2021
Parallelizing function returns no output General Usage question	6	1285	November 15, 2018

Parallel Random Forest

Related topics