The BetaML.Trees Module

Missing docstring.

Missing docstring for Trees. Check Documenter's build log for details.

Module Index

Detailed API

Base.printFunction

print(node)

Print a Decision Tree (textual)

BetaML.Trees.buildTreeFunction

buildTree(x, y, depth; maxDepth, minGain, minRecords, maxFeatures, splittingCriterion, forceClassification)

Builds (define and train) a Decision Tree.

Given a dataset of features x and the corresponding dataset of labels y, recursivelly build a decision tree by finding at each node the best question to split the data untill either all the dataset is separated or a terminal condition is reached. The given tree is then returned.

Parameters:

  • x: The dataset's features (N × D)
  • y: The dataset's labels (N × 1)
  • depth: The current tree's depth. Used when calling the function recursively [def: 1]
  • maxDepth: The maximum depth the tree is allowed to reach. When this is reached the node is forced to become a leaf [def: N, i.e. no limits]
  • minGain: The minimum information gain to allow for a node's partition [def: 0]
  • minRecords: The minimum number of records a node must holds to consider for a partition of it [def: 2]
  • maxFeatures: The maximum number of (random) features to consider at each partitioning [def: D, i.e. look at all features]
  • splittingCriterion: Either gini, entropy or variance (see infoGain ) [def: gini for categorical labels (classification task) and variance for numerical labels(regression task)]
  • forceClassification: Weather to force a classification task even if the labels are numerical (typically when labels are integers encoding some feature rather than representing a real cardinal measure) [def: false]

Notes:

Missing data (in the feature dataset) are supported.

BetaML.Trees.predictMethod

predict(forest,x)

Predict the labels of a feature dataset.

For each record of the dataset and each tree of the "forest", recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric (the mean of the different trees predictions, in turn the mean of the labels of the training records ended in that leaf node). If the labels were categorical, the prediction is a dictionary with the probabilities of each item (again the probabilities of the different trees are averaged to compose the forest predictions).

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.predictMethod

predict(tree,x)

Predict the labels of a feature dataset.

For each record of the dataset, recursivelly traverse the tree to find the prediction most opportune for the given record. If the labels the tree has been trained with are numeric, the prediction is also numeric. If the labels were categorical, the prediction is a dictionary with the probabilities of each item.

In the first case (numerical predictions) use meanRelError(ŷ,y) to assess the mean relative error, in the second case you can use accuracy(ŷ,y).

BetaML.Trees.DecisionNodeType

DecisionNode(question,trueBranch,falseBranch, depth)

A tree's non-terminal node.

Constructor's arguments and struct members:

  • question: The question asked in this node
  • trueBranch: A reference to the "true" branch of the trees
  • falseBranch: A reference to the "false" branch of the trees
  • depth: The nodes's depth in the tree
BetaML.Trees.LeafType

Leaf(y,depth)

A tree's leaf (terminal) node.

Constructor's arguments:

  • y: The labels assorciated to each record (either numerical or categorical)
  • depth: The nodes's depth in the tree

Struct members:

  • rawPredictions: Either the label's count or the numerical labels of the members of the node
  • predictions: Either the relative label's count (i.e. a PMF) or the mean
  • depth: The nodes's depth in the tree
BetaML.Trees.QuestionType

Question

A question used to partition a dataset.

This struct just records a 'column number' and a 'column value' (e.g., Green).

BetaML.Trees.findBestSplitMethod

findBestSplit(x,y;maxFeatures,splittingCriterion)

Find the best possible split of the database.

Find the best question to ask by iterating over every feature / value and calculating the information gain.

Parameters:

  • x: The feature dataset
  • y: The labels dataset
  • maxFeatures: Maximum number of (random) features to look up for the "best split"
  • splittingCriterion: The metric to define the "impurity" of the labels
BetaML.Trees.infoGainMethod

infoGain(left, right, parentUncertainty; splittingCriterion)

Compute the information gain of a specific partition.

Compare the "information gain" my measuring the difference betwwen the "impurity" of the labels of the parent node with those of the two child nodes, weighted by the respective number of items.

Parameters:

  • leftY: Child #1 labels
  • rightY: Child #2 labels
  • parentUncertainty: "Impurity" of the labels of the parent node
  • splittingCriterion: Metrics to adopt to determine the "impurity" (see below)

Three "impurity" metrics are supported:

  • gini (categorical)
  • entropy (categorical)
  • variance (numerical)
BetaML.Trees.matchMethod

match(question, x)

Return a dicotomic answer of a question when applied to a given feature record.

It compares the feature value in the given record to the value stored in the question. Numerical features are compared in terms of disequality (">="), while categorical features are compared in terms of equality ("==").

BetaML.Trees.partitionMethod

partition(question,x)

Dicotomically partitions a dataset x given a question.

For each row in the dataset, check if it matches the question. If so, add it to 'true rows', otherwise, add it to 'false rows'. Rows with missing values on the question column are assigned randomply proportionally to the assignment of the non-missing rows.