The BetaML.Trees Module

Missing docstring.

Missing docstring for Trees. Check Documenter's build log for details.

Detailed API

Base.print — Function

print(node)

Print a Decision Tree (textual)

BetaML.Trees.buildTree — Function

buildTree(x, y, depth; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a Decision Tree.

Given a dataset of features x and the corresponding dataset of labels y, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached. The given tree is then returned.

Parameters:

x: The dataset's features (N × D)
y: The dataset's labels (N × 1)
depth: The current tree's depth. Used when calling the function recursively [def: 1]
maxDepth: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: N, i.e. no limits]
minGain: The minimum information gain to allow for a node's partition [def: 0]
minRecords: The minimum number of records a node must holds to consider for a partition of it [def: 2]
maxFeatures: The maximum number of (random) features to consider at each partitioning [def: D, i.e. look at all features]
splittingCriterion: Either gini, entropy or variance (see infoGain ) [def: gini for categorical labels (classification task) and variance for numerical labels(regression task)]
forceClassification: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: false]

Notes:

Missing data (in the feature dataset) are supported.

BetaML.Trees.predict — Method

predict(forest,x)

Predict the labels of a feature dataset.

For each record of the dataset and each tree of the "forest", recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric (the mean of the different trees predictions, in turn the mean of the labels of the training records ended in that leaf node). If the labels were categorical, the prediction is a dictionary with the probabilities of each item (again the probabilities of the different trees are averaged to compose the forest predictions).

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.predict — Method

predict(tree,x)

Predict the labels of a feature dataset.

For each record of the dataset, recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric. If the labels were categorical, the prediction is a dictionary with the probabilities of each item.

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.DecisionNode — Type

DecisionNode(question,trueBranch,falseBranch, depth)

A tree's non-terminal node.

Constructor's arguments and struct members:

question: The question asked in this node
trueBranch: A reference to the "true" branch of the trees
falseBranch: A reference to the "false" branch of the trees
depth: The nodes's depth in the tree

BetaML.Trees.Leaf — Type

Leaf(y,depth)

A tree's leaf (terminal) node.

Constructor's arguments:

y: The labels assorciated to each record (either numerical or categorical)
depth: The nodes's depth in the tree

Struct members:

rawPredictions: Either the label's count or the numerical labels of the members of the node
predictions: Either the relative label's count (i.e. a PMF) or the mean
depth: The nodes's depth in the tree

BetaML.Trees.Question — Type

Question

A question used to partition a dataset.

This struct just records a 'column number' and a 'column value' (e.g., Green).

BetaML.Trees.findBestSplit — Method

findBestSplit(x,y;maxFeatures,splittingCriterion)

Find the best possible split of the database.

Find the best question to ask by iterating over every feature / value and calculating the information gain.

Parameters:

x: The feature dataset
y: The labels dataset
maxFeatures: Maximum number of (random) features to look up for the "best split"
splittingCriterion: The metric to define the "impurity" of the labels

BetaML.Trees.infoGain — Method

infoGain(left, right, parentUncertainty; splittingCriterion)

Compute the information gain of a specific partition.

Compare the "information gain" my measuring the difference betwwen the "impurity" of the labels of the parent node with those of the two child nodes, weighted by the respective number of items.

Parameters:

leftY: Child #1 labels
rightY: Child #2 labels
parentUncertainty: "Impurity" of the labels of the parent node
splittingCriterion: Metrics to adopt to determine the "impurity" (see below)

Three "impurity" metrics are supported:

gini (categorical)
entropy (categorical)
variance (numerical)

BetaML.Trees.match — Method

match(question, x)

Return a dicotomic answer of a question when applied to a given feature record.

It compares the feature value in the given record to the value stored in the question. Numerical features are compared in terms of disequality (">="), while categorical features are compared in terms of equality ("==").

BetaML.Trees.partition — Method

partition(question,x)

Dicotomically partitions a dataset x given a question.

For each row in the dataset, check if it matches the question. If so, add it to 'true rows', otherwise, add it to 'false rows'. Rows with missing values on the question column are assigned randomply proportionally to the assignment of the non-missing rows.

BetaML.Trees.predictSingle — Method

predictSingle(forest,x)

Predict the label of a single feature record. See predict.

BetaML.Trees.predictSingle — Method

predictSingle(tree,x)

Predict the label of a single feature record. See predict.

The BetaML.Trees Module

Module Index

Detailed API