Parallel Random Forest



I am using Random Forest algorithm for classification using build_forest() and then apply_forest().
As these operations are running on only one process, how could i parallelize these operations?
And how can i generate graph for the same?


I’m not perfectly sure what you want to do, but I guess you want to parallelize training and prediction of random forest. The easiest way as far as I know is using Threads.@threads, which can run a loop body in parallel with multiple threads. An example code for training may looks like this:

function train_forest(X, Y, n_trees)
    trees = make_trees(n_trees)
    Threads.@threads for i in 1:n_trees
        train_tree!(trees[i], X, Y)
    return RandomForest(trees)


Yes, what i am doing is building a forest mode like

model = build_forest(yTrain, xTrain, 20, 50, 1.0)

where yTrain is labels and xTrain is features and then applying the model

predTest = apply_forest(model, xTest)

xTest is test matrix

as all these operations are running on single process, what i want is to parallelize this task
how could i do this?


I guess you are using the DecisionTree.jl package?

If so, it looks like the forest training is already parallelized through the @parallel macro, so you would only have to run addprocs() before training your model and then build_forest should use multiple workers (Check the link below for the package source code).


Please mention that this is cross listed


Yes, I am using DecisionTree package, I tried addprocs(4) in my code. But, after reading the test data set, i got error like this:

it is also saying error at build_forest() function call.
also showing error like:

ERROR (unhandled task failure): On worker 4:

similarly for On worker 3


after addprocs you need to load your packages on each worker, like this:

import DecisionTree
@everywhere using DecisionTree


Thank you @bjarthur :smile: it works !

But now i am getting less accuracy than previously on one process. is there any way to improve accuracy and efficiency of the algorithm?

Also, is there any way to store trained model so that i can load it and directly used it on test data set. As every time it is training the model.


That’s probably just a random fluctuation. You can improve accuracy by tweaking the hyperparameters (depth, number of trees, pruning threshold), but you have to be careful about overfitting. You can either setup cross-validation yourself and do a loop over different combinations of hyperparameter values, or use the ScikitLearn.jl interface, along with GridSearchCV to do model selection.

JLD.jl should work for saving pure-Julia structures to disk.