The BetaML.Trees Module

BetaML.Trees — Module

BetaML.Trees module

Implement the functionality required to build a Decision Tree or a whole Random Forest, predict data and assess its performances.

Both Decision Trees and Random Forests can be used for regression or classification problems, based on the type of the labels (numerical or not). You can override the automatic selection with the parameter forceClassification=true, typically if your labels are integer representing some categories rather than numbers. For classification problems the output of predictSingle is a dictionary with the key being the labels with non-zero probabilitity and the corresponding value its proobability; for regression it is a numerical value.

Please be aware that, differently from most other implementations, the Random Forest algorithm collects and averages the probabilities from the trees, rather than just repording the mode, i.e. no information is lost and the output of the forest classifier is still a PMF.

Missing data on features are supported, both on training and on prediction.

The module provide the following functions. Use ?[type or function] to access their full signature and detailed documentation:

Model definition and training:

buildTree(xtrain,ytrain): Build a single Decision Tree
buildForest(xtrain,ytrain): Build a "forest" of Decision Trees

Model predictions and assessment:

predict(tree or forest, x): Return the prediction given the feature matrix
oobError(forest,x,y): Return the out-of-bag error estimate
Utils.accuracy(ŷ,y)): Categorical output accuracy
Utils.meanRelError(ŷ,y,p): L-p norm based error

Features are expected to be in the standard format (nRecords × nDimensions matrices) and the labels (either categorical or numerical) as a nRecords column vector.

Acknowlegdments: originally based on the Josh Gordon's code

Module Index

Detailed API

BetaML.Trees.AbstractQuestion — Type

Question

A question used to partition a dataset.

This struct just records a 'column number' and a 'column value' (e.g., Green).

BetaML.Trees.DecisionNode — Type

DecisionNode(question,trueBranch,falseBranch, depth)

A tree's non-terminal node.

Constructor's arguments and struct members:

question: The question asked in this node
trueBranch: A reference to the "true" branch of the trees
falseBranch: A reference to the "false" branch of the trees
depth: The nodes's depth in the tree

BetaML.Trees.Forest — Type

Forest{Ty}

Type representing a Random Forest.

Individual trees are stored in the array trees. The "type" of the forest is given by the type of the labels on which it has been trained.

Struct members:

trees: The individual Decision Trees
isRegression: Whether the forest is to be used for regression jobs or classification
oobData: For each tree, the rows number if the data that have not being used to train the specific tree
oobError: The out of bag error (if it has been computed)
weights: A weight for each tree depending on the tree's score on the oobData (see buildForest)

BetaML.Trees.Leaf — Type

Leaf(y,depth)

A tree's leaf (terminal) node.

Constructor's arguments:

y: The labels assorciated to each record (either numerical or categorical)
depth: The nodes's depth in the tree

Struct members:

predictions: Either the relative label's count (i.e. a PMF) or the mean
depth: The nodes's depth in the tree

Base.print — Function

print(node)

Print a Decision Tree (textual)

BetaML.Trees.buildForest — Method

buildForest(x, y, nTrees; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a "forest" of Decision Trees.

Parameters:

See buildTree. The function has all the parameters of bildTree (with the maxFeatures defaulting to √D instead of D) plus the following parameters:

nTrees: Number of trees in the forest [def: 30]
β: Parameter that regulate the weights of the scoring of each tree, to be (optionally) used in prediction (see later) [def: 0, i.e. uniform weigths]
oob: Whether to coompute the out-of-bag error, an estimation of the generalization accuracy [def: false]

Output:

The function returns a Forest object (see Forest).
The forest weights default to array of ones if β ≤ 0 and the oob error to +Inf if oob == false.

Notes :

Each individual decision tree is built using bootstrap over the data, i.e. "sampling N records with replacement" (hence, some records appear multiple times and some records do not appear in the specific tree training). The maxFeature injects further variability and reduces the correlation between the forest trees.
The predictions of the "forest" (using the function predict()) are then the aggregated predictions of the individual trees (from which the name "bagging": boostrap aggregating).
This function optionally reports a weight distribution of the performances of eanch individual trees, as measured using the records he has not being trained with. These weights can then be (optionally) used in the predict function. The parameter β ≥ 0 regulate the distribution of these weights: larger is β, the greater the importance (hence the weights) attached to the best-performing trees compared to the low-performing ones. Using these weights can significantly improve the forest performances (especially using small forests), however the correct value of β depends on the problem under exam (and the chosen caratteristics of the random forest estimator) and should be cross-validated to avoid over-fitting.
Note that this function uses multiple threads if these are available. You can check the number of threads available with Threads.nthreads(). To set the number of threads in Julia either set the environmental variable JULIA_NUM_THREADS (before starting Julia) or start Julia with the command line option --threads (most integrated development editors for Julia already set the number of threads to 4).

BetaML.Trees.buildTree — Method

buildTree(x, y, depth; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a Decision Tree.

Given a dataset of features x and the corresponding dataset of labels y, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached. The given tree is then returned.

Parameters:

x: The dataset's features (N × D)
y: The dataset's labels (N × 1)
maxDepth: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: N, i.e. no limits]
minGain: The minimum information gain to allow for a node's partition [def: 0]
minRecords: The minimum number of records a node must holds to consider for a partition of it [def: 2]
maxFeatures: The maximum number of (random) features to consider at each partitioning [def: D, i.e. look at all features]
splittingCriterion: Either gini, entropy or variance (see infoGain ) [def: gini for categorical labels (classification task) and variance for numerical labels(regression task)]
forceClassification: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: false]

Notes:

Missing data (in the feature dataset) are supported.

BetaML.Trees.findBestSplit — Method

findBestSplit(x,y;maxFeatures,splittingCriterion)

Find the best possible split of the database.

Find the best question to ask by iterating over every feature / value and calculating the information gain.

Parameters:

x: The feature dataset
y: The labels dataset
maxFeatures: Maximum number of (random) features to look up for the "best split"
splittingCriterion: The metric to define the "impurity" of the labels

BetaML.Trees.infoGain — Method

infoGain(left, right, parentUncertainty; splittingCriterion)

Compute the information gain of a specific partition.

Compare the "information gain" my measuring the difference betwwen the "impurity" of the labels of the parent node with those of the two child nodes, weighted by the respective number of items.

Parameters:

leftY: Child #1 labels
rightY: Child #2 labels
parentUncertainty: "Impurity" of the labels of the parent node
splittingCriterion: Metric to adopt to determine the "impurity" (see below)

You can use your own function as the metric. We provide the following built-in metrics:

gini (categorical)
entropy (categorical)
variance (numerical)

BetaML.Trees.match — Method

match(question, x)

Return a dicotomic answer of a question when applied to a given feature record.

It compares the feature value in the given record to the value stored in the question. Numerical features are compared in terms of disequality (">="), while categorical features are compared in terms of equality ("==").

BetaML.Trees.oobError — Method

oobError(forest,x,y)

Comute the Out-Of-Bag error, an estimation of the validation error.

This function is called at time of train the forest if the parameter oob is true, or can be used later to get the oob error on an already trained forest.

BetaML.Trees.partition — Method

partition(question,x)

Dicotomically partitions a dataset x given a question.

For each row in the dataset, check if it matches the question. If so, add it to 'true rows', otherwise, add it to 'false rows'. Rows with missing values on the question column are assigned randomly proportionally to the assignment of the non-missing rows.

BetaML.Trees.predict — Method

predict(forest,x)

Predict the labels of a feature dataset.

For each record of the dataset and each tree of the "forest", recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric (the mean of the different trees predictions, in turn the mean of the labels of the training records ended in that leaf node). If the labels were categorical, the prediction is a dictionary with the probabilities of each item and in such case the probabilities of the different trees are averaged to compose the forest predictions. This is a bit different than most other implementations where the mode instead is reported.

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.predict — Method

predict(tree,x)

Predict the labels of a feature dataset.

For each record of the dataset, recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric. If the labels were categorical, the prediction is a dictionary with the probabilities of each item.

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.predictSingle — Method

predictSingle(forest,x)

Predict the label of a single feature record. See predict.

BetaML.Trees.predictSingle — Method

predictSingle(tree,x)

Predict the label of a single feature record. See predict.

BetaML.Trees.updateTreesWeights! — Method

updateTreesWeights!(forest,x,y;β)

Update the weights of each tree (to use in the prediction of the forest) based on the error of the individual tree computed on the records on which it has not been trained. As training a forest is expensive, this function can be used to "just" upgrade the trees weights using different betas, without retraining the model.