My Random Forest is very slow

sylvaticus · August 24, 2020, 7:56pm

I implemented Decision Tree and Random Forest classifiers/regressions based originally on Josh Gordon’s code in Python.

The problem is that building the forest is very slow… It is said everywhere that one of the advantages of random forests is that it is computationally cheap, but my implementation takes around 10 minutes to train a 30-trees forest out of just 5k records… Where I go wrong? Too many small functions?

These are the two main functions, the full code is here:



"""

   buildTree(x, y, depth; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a Decision Tree.

Given a dataset of features `x` and the corresponding dataset of labels `y`, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached.
The given tree is then returned.

# Parameters:
- `x`: The dataset's features (N × D)
- `y`: The dataset's labels (N × 1)
- `depth`: The current tree's depth. Used when calling the function recursively [def: `1`]
- `maxDepth`: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: `N`, i.e. no limits]
- `minGain`: The minimum information gain to allow for a node's partition [def: `0`]
- `minRecords`:  The minimum number of records a node must holds to consider for a partition of it [def: `2`]
- `maxFeatures`: The maximum number of (random) features to consider at each partitioning [def: `D`, i.e. look at all features]
- `splittingCriterion`: Either `gini`, `entropy` or `variance` (see [`infoGain`](@ref) ) [def: `gini` for categorical labels (classification task) and `variance` for numerical labels(regression task)]
- `forceClassification`: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: `false`]

# Notes:

Missing data (in the feature dataset) are supported.
"""
function buildTree(x, y, depth=1; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=size(x,2), splittingCriterion = eltype(y) <: Number ? "variance" : "gini", forceClassification=false)

    # Force what would be a regression task into a classification task
    if forceClassification && eltype(y) <: Number
        y = string.(y)
    end

    # Check if this branch has still the minimum number of records required and we are reached the maxDepth allowed. In case, declare it a leaf
    if size(x,1) <= minRecords || depth >= maxDepth return Leaf(y, depth) end

    # Try partitioing the dataset on each of the unique attribute,
    # calculate the information gain,
    # and return the question that produces the highest gain.
    gain, question = findBestSplit(x,y;maxFeatures=maxFeatures,splittingCriterion=splittingCriterion)

    # Base case: no further info gain
    # Since we can ask no further questions,
    # we'll return a leaf.
    if gain <= minGain  return Leaf(y, depth)  end

    # If we reach here, we have found a useful feature / value
    # to partition on.
    trueIdx, falseIdx = partition(question,x)

    # Recursively build the true branch.
    trueBranch = buildTree(x[trueIdx,:], y[trueIdx], depth+1, maxDepth=maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion=splittingCriterion, forceClassification=forceClassification)

    # Recursively build the false branch.
    falseBranch = buildTree(x[falseIdx,:], y[falseIdx], depth+1, maxDepth=maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion=splittingCriterion, forceClassification=forceClassification)

    # Return a Question node.
    # This records the best feature / value to ask at this point,
    # as well as the branches to follow
    # dependingo on the answer.
    return DecisionNode(question, trueBranch, falseBranch, depth)
end

"""
   buildForest(x, y, nTrees; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a "forest of Decision Trees.

See [`buildForest`](@ref). The parameters are exactly the same except here we have `nTrees` to define the number of trees in the forest [def: `30`] and the `maxFeatures` default to `√D` instead of `D`.

Each individual decision tree is built using bootstrap over the data, i.e. "sampling N records with replacement" (hence, some records appear multiple times and some records do not appear in the specific tree training). The `maxFeature` to square of the features inject further variability and reduce the correlation between the forest trees.

The predictions of the "forest" are then the aggregated predictions of the individual trees (from which the name "bagging": **b**oostrap **agg**regat**ing**).

The function returns a touple whose first argument is the forest ityself (array of Trees) and the second one is an array with the ids of the records that has _not_ being used to train the specific tree.
"""
function buildForest(x, y, nTrees=30; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=Int(round(sqrt(size(x,2)))), splittingCriterion = eltype(y) <: Number ? "variance" : "gini", forceClassification=false)
    forest = Union{DecisionNode,Leaf}[]
    errors = Float64[]
    notSampledByTree = Array{Int64,1}[] # to later compute the Out of Bag Error

    #jobIsRegression = (forceClassification || !(eltype(y) <: Number ) ? false : true # we don't need the tertiary operator here, but it is more clear with it...
    (N,D) = size(x)

    for i in 1:nTrees
        toSample = rand(1:N,N)
        notToSample = setdiff(1:N,toSample)
        bootstrappedx = x[toSample,:] # "boosted is different than "bootstrapped": https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f
        bootstrappedy = y[toSample]
        #controlx = x[notToSample,:]
        #controly = y[notToSample]
        tree = buildTree(bootstrappedx, bootstrappedy; maxDepth = maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion = splittingCriterion, forceClassification=forceClassification)
        #ŷ = predict(tree,controlx)
        push!(forest,tree)
        push!(notSampledByTree,notToSample)
    end
    return (forest,notSampledByTree)
end

anon92994695 · August 24, 2020, 8:19pm

So there are lots of reasons why RF’s from scratch can be slow. One of the main reasons in my experience is recursion. My first RF from scratch was recursive and slow as slow can be. I rewrote it as a while loop and saw drastic performance increases - still not the fastest implementation but better. Recommend trying that, despite it being uglier to write.

rdeits · August 24, 2020, 8:51pm

Have you read through the performance tips? The first thing I noticed from your code is that all of your structs have un-typed fields, which is a pretty important performance problem (see Performance Tips · The Julia Language ). Once you’ve fixed that, have you checked for type-stability of your functions with @code_warntype?

Another thing I noticed is that you’re using strings to encode your splitting criterion and then doing:

    if splittingCriterion == "gini"
        critFunction = giniImpurity
    elseif splittingCriterion == "entropy"
        critFunction = entropy
    elseif splittingCriterion == "variance"
        critFunction = popvar
    else
        @error "Splitting criterion not supported"
    end

in a few places. It would be easier and probably faster to pass the actual functions themselves rather than constantly looking them up from strings. For example, you could just do:

buildTree(..., splittingCritereon=giniImpurity)

which would skip all the string checks, allow users to pass in their own criteria, and probably improve performance too.

sylvaticus · August 25, 2020, 3:29pm

I implemented the two things gaining ~50% of time:

What I did learn:

Typing the fields of the Question struct was easy and gained me ~50% time and memory usage
Typing the fields of the two DecisionNode and Leaf wasn’t simple, as I had to introduce templating a bit everywhere, the DecisionNode is still partially untyped, as the different columns in the feature matrix can have different types. Also, I didn’t obtained any noticeable performance gain, at the opposite a bit worst
Passing the splitting function directly didn’t really change anything in terms of performances, but it is indeed more flexible and “clean”

Despite the gain, the program remains pretty slow… I may try the approach suggested by @ckneale… I think that it is partially due to Random Forest being so flexible… the feature matrix can contain any kind of data at the same time, so it is difficult to optimise it…

anon92994695 · August 25, 2020, 7:26pm

50% improvement is pretty good! But yea, I had a similar scenario. I agree - it isn’t easy to turn a DT algorithm from recursive to a while loop. It’s much more clunky, took me the better part of a week of afternoons I think, and my code for it isn’t very clean.

Barring that, I think many RF algorithms also do their work in parallel even if they don’t tell you they are. It’d be worth checking say System Monitor and running some cookie-cutter RF’s in python/R/etc. With 4 cores calculating GINI/etc on feature vectors one can anticipate more gains.

juliohm · August 25, 2020, 7:42pm

I understand you’re doing this as an exercise to practice the language, but would like to cite DecisionTree.jl for those who may be just interested in a well-tested package: https://github.com/bensadeghi/DecisionTree.jl

sylvaticus · August 26, 2020, 10:27am

Thank you, I added a reference to that package directly in the top of the package readme.

Also, mulithreading the loop in buildForest() I got a 3x speed gain with 4 threads.

It was just a matter to add Threads.@threads in front of the for i in 1:nTrees loop, as in Random Forest (without boosting) the boostrapped trees are fully independent.

anon92994695 · August 26, 2020, 8:52pm

how far away do you feel you are from a usable implementation now?
Depending on tree size/datasize RF can take a bit to build.

jeremiedb · August 26, 2020, 9:48pm

If you’re looking for pure speed, using a histogram approach to tree split evaluation also brings important acceleration. This is the approach I’ve used for EvoTrees.jl which is about as fast as xgboost. If it’s a desired feature, I could actually integrate a RF option to the library, which should be fairly straightforward to implement.

sylvaticus · August 27, 2020, 1:07pm

@anon92994695, I followed your suggestion and moved the algorithm to an unrecursive form, basically storing the nodes on an array and the ids of the row to process on a “queue” where sets of rows are added when a new split on a decision node is done, and retrieved when a new node is processed, until there are no more set of rows to process:

"""
   buildTree(x, y; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a Decision Tree.

Given a dataset of features `x` and the corresponding dataset of labels `y`, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached.
The given tree is then returned.

# Parameters:
- `x`: The dataset's features (N × D)
- `y`: The dataset's labels (N × 1)
- `maxDepth`: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: `N`, i.e. no limits]
- `minGain`: The minimum information gain to allow for a node's partition [def: `0`]
- `minRecords`:  The minimum number of records a node must holds to consider for a partition of it [def: `2`]
- `maxFeatures`: The maximum number of (random) features to consider at each partitioning [def: `D`, i.e. look at all features]
- `splittingCriterion`: Either `gini`, `entropy` or `variance` (see [`infoGain`](@ref) ) [def: `gini` for categorical labels (classification task) and `variance` for numerical labels(regression task)]
- `forceClassification`: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: `false`]

# Notes:

Missing data (in the feature dataset) are supported.
"""
function buildTree(x, y::AbstractArray{Ty,1}; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=size(x,2), forceClassification=false, splittingCriterion = (Ty <: Number && !forceClassification) ? variance : gini) where {Ty}

    tree = Tree{Ty}(Array{Union{AbstractDecisionNode,Leaf{Ty}},1}[],Ty)
    N    = size(x,1)

    if forceClassification && Ty <: Number
        y = string.(y)
    end

    idsToTreat = Tuple{Array{Int64,1},Int64,Bool}[] # feature ids, parent nodeid, branch type

    currentId = 1
    depth = 1

    # creating the root node
    gain, question = findBestSplit(x,y,1:N;maxFeatures=maxFeatures,splittingCriterion=splittingCriterion)

    if size(x,1) <= minRecords || depth >= maxDepth || gain <= minGain
        push!(tree.nodes, Leaf(y, depth))
    else
        trueRIds, falseRIds = partition(question,x,1:N)
        push!(tree.nodes, DecisionNode(question,0,0,depth))
        push!(idsToTreat,  (trueRIds, length(tree.nodes), true))  # feature ids, parent nodeid, branch type
        push!(idsToTreat,  (falseRIds, length(tree.nodes), false)) # feature ids, parent nodeid, branch type
    end

    while length(idsToTreat) > 0
        workingSet = pop!(idsToTreat)
        thisNodeRIds = workingSet[1]
        thisDepth =  tree.nodes[workingSet[2]].depth + 1

        # Check if this branch has still the minimum number of records required, that we didn't reached the maxDepth allowed and that there is still a gain in splitting. In case, declare it a leaf
        isLeaf = false
        if length(thisNodeRIds) <= minRecords || thisDepth >= maxDepth
            isLeaf = true
        else
            # Try partitioing the dataset on each of the unique attribute,
            # calculate the information gain,
            # and return the question that produces the highest gain.
            gain, question = findBestSplit(x,y,thisNodeRIds;maxFeatures=maxFeatures,splittingCriterion=splittingCriterion)
            if gain <= minGain
                isLeaf = true
            end
        end

        if isLeaf
            push!(tree.nodes, Leaf(y, thisNodeRIds, thisDepth))
        else
            trueRIds, falseRIds = partition(question,x,thisNodeRIds) # get the records ids matching or not the "best possible" question for this node
            push!(tree.nodes, DecisionNode(question,0,0,thisDepth))   # add the node to the array. We don't yeat know the ids of the childs
            push!(idsToTreat,  (trueRIds, length(tree.nodes), true))   # feature ids, parent nodeid, branch type
            push!(idsToTreat,  (falseRIds, length(tree.nodes), false)) # feature ids, parent nodeid, branch type
        end

        # Whatever Leaf or DecisionNode, assign the ids of this node as the child id to the parent
        if workingSet[3] # if it is a "true" branch
            tree.nodes[workingSet[2]].trueNodeId = length(tree.nodes)
        else
            tree.nodes[workingSet[2]].falseNodeId = length(tree.nodes)
        end
    end

    # Return the tree
    return tree
end

However I ended up with an algorithm that is even slightly slower than before :-/
I am 356 times slower than DecisionTrees and I use 4096 times more memory :-/

sylvaticus · August 28, 2020, 3:59pm

I changed the logic to “unrecurse” the algorithm, and I can really confirm that the bottleneck of the Decision Tree algorithm in not in it being recursive or not.

The problems are all scattered around and need to be optimised one by one… When I do @code_warntype I have now all nice blue output, but still it is slow like hell… even if I managed to go down to a factor 6.2 for the time and 61 for the memory compared with the reference algorithm (DecisionTree). BetaML.Trees has still some nice points, namely the fact it supports missing data in the feature matrix (in a split they are assigned randomly proportional to the non-missing values for that feature) and an optional tree weight given by that tree performance in the unsampled records.

Using the Juno profiler (nice that now it shows a proportional colour bar over the line of code!) I can see that most of the time is spent in moving the references of the rows of the features and labels between the nodes… but at this point how DecisionTree can handle that performances is just magic for me…

Topic		Replies	Views
Parallel Random Forest General Usage question	23	5632	May 2, 2019
RandomForestRegressor in Julia Machine Learning	9	1901	July 21, 2022
Comparison between Julia and Python Random Forest Regression Machine Learning	6	3811	December 20, 2022
Deep forest NLP Machine Learning nlp	6	999	March 6, 2019
How to best parallelize custom decision tree / forest? Machine Learning question	0	269	May 4, 2023

My Random Forest is very slow

Related topics