I implemented Decision Tree and Random Forest classifiers/regressions based originally on Josh Gordon’s code in Python.
The problem is that building the forest is very slow… It is said everywhere that one of the advantages of random forests is that it is computationally cheap, but my implementation takes around 10 minutes to train a 30-trees forest out of just 5k records… Where I go wrong? Too many small functions?
These are the two main functions, the full code is here:
"""
buildTree(x, y, depth; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)
Builds (define and train) a Decision Tree.
Given a dataset of features `x` and the corresponding dataset of labels `y`, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached.
The given tree is then returned.
# Parameters:
- `x`: The dataset's features (N × D)
- `y`: The dataset's labels (N × 1)
- `depth`: The current tree's depth. Used when calling the function recursively [def: `1`]
- `maxDepth`: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: `N`, i.e. no limits]
- `minGain`: The minimum information gain to allow for a node's partition [def: `0`]
- `minRecords`: The minimum number of records a node must holds to consider for a partition of it [def: `2`]
- `maxFeatures`: The maximum number of (random) features to consider at each partitioning [def: `D`, i.e. look at all features]
- `splittingCriterion`: Either `gini`, `entropy` or `variance` (see [`infoGain`](@ref) ) [def: `gini` for categorical labels (classification task) and `variance` for numerical labels(regression task)]
- `forceClassification`: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: `false`]
# Notes:
Missing data (in the feature dataset) are supported.
"""
function buildTree(x, y, depth=1; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=size(x,2), splittingCriterion = eltype(y) <: Number ? "variance" : "gini", forceClassification=false)
# Force what would be a regression task into a classification task
if forceClassification && eltype(y) <: Number
y = string.(y)
end
# Check if this branch has still the minimum number of records required and we are reached the maxDepth allowed. In case, declare it a leaf
if size(x,1) <= minRecords || depth >= maxDepth return Leaf(y, depth) end
# Try partitioing the dataset on each of the unique attribute,
# calculate the information gain,
# and return the question that produces the highest gain.
gain, question = findBestSplit(x,y;maxFeatures=maxFeatures,splittingCriterion=splittingCriterion)
# Base case: no further info gain
# Since we can ask no further questions,
# we'll return a leaf.
if gain <= minGain return Leaf(y, depth) end
# If we reach here, we have found a useful feature / value
# to partition on.
trueIdx, falseIdx = partition(question,x)
# Recursively build the true branch.
trueBranch = buildTree(x[trueIdx,:], y[trueIdx], depth+1, maxDepth=maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion=splittingCriterion, forceClassification=forceClassification)
# Recursively build the false branch.
falseBranch = buildTree(x[falseIdx,:], y[falseIdx], depth+1, maxDepth=maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion=splittingCriterion, forceClassification=forceClassification)
# Return a Question node.
# This records the best feature / value to ask at this point,
# as well as the branches to follow
# dependingo on the answer.
return DecisionNode(question, trueBranch, falseBranch, depth)
end
"""
buildForest(x, y, nTrees; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)
Builds (define and train) a "forest of Decision Trees.
See [`buildForest`](@ref). The parameters are exactly the same except here we have `nTrees` to define the number of trees in the forest [def: `30`] and the `maxFeatures` default to `√D` instead of `D`.
Each individual decision tree is built using bootstrap over the data, i.e. "sampling N records with replacement" (hence, some records appear multiple times and some records do not appear in the specific tree training). The `maxFeature` to square of the features inject further variability and reduce the correlation between the forest trees.
The predictions of the "forest" are then the aggregated predictions of the individual trees (from which the name "bagging": **b**oostrap **agg**regat**ing**).
The function returns a touple whose first argument is the forest ityself (array of Trees) and the second one is an array with the ids of the records that has _not_ being used to train the specific tree.
"""
function buildForest(x, y, nTrees=30; maxDepth = size(x,1), minGain=0.0, minRecords=2, maxFeatures=Int(round(sqrt(size(x,2)))), splittingCriterion = eltype(y) <: Number ? "variance" : "gini", forceClassification=false)
forest = Union{DecisionNode,Leaf}[]
errors = Float64[]
notSampledByTree = Array{Int64,1}[] # to later compute the Out of Bag Error
#jobIsRegression = (forceClassification || !(eltype(y) <: Number ) ? false : true # we don't need the tertiary operator here, but it is more clear with it...
(N,D) = size(x)
for i in 1:nTrees
toSample = rand(1:N,N)
notToSample = setdiff(1:N,toSample)
bootstrappedx = x[toSample,:] # "boosted is different than "bootstrapped": https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f
bootstrappedy = y[toSample]
#controlx = x[notToSample,:]
#controly = y[notToSample]
tree = buildTree(bootstrappedx, bootstrappedy; maxDepth = maxDepth, minGain=minGain, minRecords=minRecords, maxFeatures=maxFeatures, splittingCriterion = splittingCriterion, forceClassification=forceClassification)
#ŷ = predict(tree,controlx)
push!(forest,tree)
push!(notSampledByTree,notToSample)
end
return (forest,notSampledByTree)
end