A Comprehensive Guide to Tree-Based Methods

Thomas Shaw
12 min readMay 16


Photo by Adarsh Kummur on Unsplash

Tree-based methods are powerful algorithms widely used in machine learning and data science. They provide an intuitive and interpretable approach to solving regression and classification problems. In this blog post, we will delve into the basics of decision trees, examine their advantages and disadvantages, and explore the various ensemble techniques that enhance their performance.

Basics of Decision Trees

Decision trees are powerful hierarchical models that effectively partition data into subsets based on input feature values. By recursively applying a set of rules or conditions, decision trees provide a structured approach to solving regression and classification problems. To illustrate their functionality, let’s explore a scenario where we need to determine whether or not to go for a picnic based on weather conditions, and how a decision tree can assist us in making this decision.

A decision tree for determining whether to go on a picnic based on weather conditions

Root Node: The root node is the topmost node in a decision tree. In this example, the root node is “Weather”. It represents the initial question about the weather conditions and serves as the starting point for the decision tree.

Internal Node: The internal nodes are the non-terminal nodes in the decision tree. They represent the decision points where the data is split based on the conditions. In this example, the internal nodes are “Weather”, “Sunny”, “Cloudy”, “Humid”, “Dry”, “Windy”, and “Rainy.”

Branch: The branches connect the internal and leaf nodes, representing the possible paths or outcomes based on the conditions. In this example, we have four branches:

  • From the root node “Weather,” we have two branches going to “Sunny” and “Cloudy” based on the weather conditions.
  • From the “Sunny” node, we have two branches going to “Humid” and “Dry” based on the humidity level.
  • From the “Cloudy” node, we have two branches going to “Windy” and “Rainy” based on the wind conditions.
  • From the “Windy” and “Rainy” nodes, we have branches leading to the respective “No” leaf nodes.

Terminal Node: The terminal nodes or leaf nodes are the bottom-most nodes of the tree. They do not split further and represent the final decisions or outcomes. In this case, the leaf nodes are “Yes (Picnic)” and three instances of “No”, indicating whether or not to go for a picnic based on the weather conditions.

Regression Trees

Regression trees are used when the target variable is continuous. They are constructed using a top-down, greedy approach called recursive binary splitting. Each tree starts with a root node representing the entire dataset. At each step, the algorithm selects the best feature and threshold to split the data, aiming to minimize the residual sum of squares (RSS). The splitting process continues until a stopping criterion is met, such as reaching a maximum depth or having a minimum number of samples in a leaf node.

Residual Sum of Squares (RSS) formula

Let’s see how the regression trees can be implemented in practice. We are going to use the California Housing Dataset from the scikit-learn library. The California Housing dataset comprises housing information for various regions in California. It includes features like average income, average house age, and proximity to the ocean. The target variable is the median house value, expressed in hundreds of thousands of dollars ($100,000).

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing(as_frame=True)
df = data['data']
df['MedValue'] = data['target']

X = df.iloc[:, :-1]
y = df['MedValue']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a regression tree model
regression_tree = DecisionTreeRegressor(random_state=42)

# Fit the model on the training data
regression_tree.fit(X_train, y_train)

train_mse = mean_squared_error(y_train, regression_tree.predict(X_train))
test_mse = mean_squared_error(y_test, regression_tree.predict(X_test))

print(f"Training MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}")
Training MSE: 0.00, Test MSE: 0.50

We can also visualize the tree using plot_tree method. We use the parameter max_depth=2 to limit the depth of the tree that will be visualized. Only the first two levels of the tree will be displayed.

# Visualize the regression tree
plot_tree(regression_tree, filled=True, feature_names=X.columns, max_depth=2)

The process describes above has the potential to yield accurate predictions on the training set, but is likely to overfit the data, leading to poor test set performance. Overfitting often occurs when the resulting tree becomes overly complex, attempting to capture noise in the training data. To address the issue of overfitting, tree pruning is commonly applied. Pruning involves removing or collapsing nodes in the tree to simplify its structure and improve generalization performance. We can prune the tree by setting the max_depth parameter of DecisionTreeRegressor .

It’s important to note that selecting the appropriate max_depth value requires finding a balance between model simplicity and predictive power. Setting max_depth too low may result in an overly simplified model that underfits the data, while setting it too high may lead to an overly complex model that overfits the data.

Test MSE reaches its minimum at a tree size of 9, indicating the optimal complexity for the model, beyond which performance starts to deteriorate

By pruning a regression tree, we can control its complexity and prevent it from fitting the noise in the training data. Pruning helps to strike a balance between bias and variance, leading to better generalization performance. It encourages the tree to focus on the most informative features and capture the underlying patterns rather than the noise.

Classification Trees

A classification tree is very similar to a regression tree, except that it is used when the target is categorical instead of continuous. In a classification tree, the goal is to assign each observation to the most frequently occurring class among the training observations within its region. This means that the predicted class for a given observation is determined based on the majority class of similar observations in its region. This approach allows us to classify new data points based on the learned patterns from the training set. The process of growing a classification tree follows a similar recursive binary splitting approach as seen in regression trees. At each step, the algorithm selects the best feature and threshold to split the data, aiming to maximize the separation of the classes. However, unlike in regression trees where the residual sum of squares (RSS) is used as the criterion, classification trees utilize the classification error rate. The classification error rate represents the fraction of training observations within a specific region that does not belong to the most common class.

Formula for classification error rate

However, it turns out that classification error is not sufficiently sensitive for tree-growing. Therefore, alternative measures such as the Gini index and entropy are often preferred due to their enhanced sensitivity and ability to handle various scenarios.

Formula for Gini index

The Gini index measures how likely it is to misclassify a randomly chosen element in a dataset based on the distribution of class labels. It ranges from 0 to 1, where lower values indicate higher purity and better splits. The Gini index prefers partitions that have clear class distinctions and maximize the separation between different classes.

Formula for Entropy

Entropy is a measure that tells us how much disorder or uncertainty exists in a dataset. It quantifies the average amount of information needed to determine the class label of a randomly chosen data point. The values of entropy range from 0 to 1, where lower values indicate higher purity or more certainty. In the context of decision trees, entropy helps identify the splits that reduce randomness and create subsets with similar class labels. The goal is to find the splits that result in more homogeneous subsets, leading to better classification performance.

Let’s see how the classification trees can be implemented in practice. We are going to use the famous Iris dataset.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Build the classification tree model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict on the testing set
y_pred = clf.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Plot the decision tree
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)

Advantages and Disadvantages of Trees


  • Decision trees are easy to understand and interpret. Their visual representation provides insights into the decision-making process.
  • They handle both numerical and categorical features without the need for extensive data preprocessing.
  • Decision trees can capture nonlinear relationships between features and target variables.
  • They can handle missing values by assigning them to the most probable class or value.


  • Decision trees are prone to overfitting, especially when the tree becomes deep and complex.
  • They are sensitive to small variations in the training data and can produce different trees for slightly different datasets.

Ensemble Methods

Ensemble methods are powerful techniques that leverage the collective strength of multiple simple models, often referred to as weak learners. By combining these models, ensemble methods aim to create a single, robust model with powerful predictive capabilities. In this article, we will look into four popular ensemble methods: bagging, random forest, boosting, and Bayesian additive regression trees.


The basic idea behind bagging is to create multiple subsets of the original training data through a process called bootstrapping. Bootstrapping involves randomly sampling the training data with replacement, which means that each subset can contain duplicate observations as well as missing some of the original ones. These subsets are then used to train individual models, typically using the same learning algorithm.

Once the models are trained, bagging combines their predictions through an averaging process. For regression problems, the predictions are averaged to obtain the final predicted value. In classification problems, a majority vote is often used to determine the final predicted class.

By training multiple models on different subsets of the data and aggregating their predictions, bagging reduces the impact of outliers and noise in the training set. It helps to improve the stability and robustness of the model, leading to more reliable predictions.

Illustration of Bagging: Subsets of data are randomly sampled with replacement to create multiple subsets

Random Forests

Random Forests improve on the concept of bagging by introducing additional randomization in the tree-growing process. While bagging uses the same base model (e.g., Decision Tree) on bootstrap samples of the data, Random Forests introduce random feature selection during each split.

The main idea behind Random Forests is that by introducing random feature selection, the individual trees in the ensemble become more diverse and less correlated with each other.

In bagging, each tree is trained on a bootstrap sample of the data. In Random Forests, during each split in the tree-growing process, a random subset of features is selected. This means that instead of considering all the available features at each split, only a subset is used. By using a smaller set of features, Random Forests introduce more diversity among the trees, as they can focus on different aspects of the data.

The introduction of random subspace selection helps to decorrelate the trees in the ensemble. Each tree is trained on a different set of features, which reduces the correlation between their predictions.

Finally, Random Forests employ an ensemble voting mechanism to make predictions. For classification tasks, majority voting is used, where the class that receives the most votes from the trees is selected as the final prediction. For regression tasks, the predictions from all the trees are averaged. This aggregation of predictions from multiple trees helps to reduce the variance and increase the accuracy of the ensemble model.

Illustration of Random Forest: Multiple decision trees are trained independently on different subsets of the data using random feature subsets


The key idea behind boosting is to sequentially train weak models and give more attention to the samples that were misclassified or have high residual errors. The subsequent weak models are then built to focus on these difficult samples, learning from the mistakes made by the previous models.

Boosting algorithms, such as AdaBoost and Gradient Boosting, assign weights to the training samples. Initially, all samples are assigned equal weights, and the first weak model is trained on the weighted data. The model’s performance is evaluated, and the weights are adjusted to give more weight to misclassified samples. This ensures that subsequent models focus on these difficult samples during training.

Unlike bagging, which focuses on reducing variance by building independent models in parallel, boosting aims to reduce bias by iteratively improving the model’s performance on the training data. This sequential nature allows boosting to adapt and improve its performance as the iterations progress.

Illustration of Boosting: Weak learners are sequentially trained on the data, with each subsequent learner focusing on the misclassified samples from the previous iteration

Bayesian Additive Regression Tree

Bayesian Additive Regression Tree (BART) is based on the concept of boosting, where weak learners are combined to form a strong learner. However, BART takes a Bayesian approach and incorporates a prior distribution to model the uncertainty in the parameters.

The BART algorithm starts by initializing the model with a single decision tree. Then, it iteratively grows the ensemble by adding more decision trees. In each iteration, a new decision tree is added by considering the residuals (the differences between the true values and the current ensemble’s predictions). The new tree is trained to capture the patterns and relationships in the residuals that have not yet been captured by the existing ensemble.

One of the key advantages of BART is its ability to handle complex interactions and non-linear relationships in the data. The ensemble of decision trees allows for capturing intricate interactions between predictors and can model non-linearities in a flexible manner.

BART also incorporates a Bayesian framework, which enables the modeling of uncertainty in the parameters. It assigns prior distributions to the parameters and updates these distributions based on the observed data. This Bayesian approach allows for quantifying uncertainty in the model predictions and provides a measure of confidence for the estimated parameters.

In addition, BART performs automatic variable selection by determining the relevance of each predictor in the ensemble. It assigns weights to the predictors based on their importance in the model, which can help identify the most influential variables in the data.

Summary of Ensemble Methods

  • Bagging: Bagging combines multiple models trained on different subsets of the training data through bootstrapping. It reduces the impact of outliers and noise, improving model stability and robustness.
  • Random Forest: Random Forest is an ensemble method that combines multiple decision trees. It introduces randomness by selecting a random subset of features for each tree and averaging their predictions. It improves accuracy and handles high-dimensional data well.
  • Boosting: Boosting is an ensemble method that builds models sequentially, where each new model focuses on correcting the mistakes made by the previous models. It combines weak learners to create a strong learner and improves prediction accuracy.
  • BART (Bayesian Additive Regression Trees): BART is a Bayesian ensemble method that combines decision trees. It incorporates a Bayesian framework to model uncertainty in the parameters and allows for capturing complex interactions and non-linear relationships. It performs automatic variable selection and provides measures of confidence for predictions.


In this blog post, we explored the basics of tree-based methods and their advantages and disadvantages. Decision trees provide an interpretable approach to solving regression and classification problems, allowing us to understand the decision-making process. However, they are prone to overfitting and can be sensitive to variations in the training data.

To enhance the performance of decision trees, we explored ensemble methods: bagging, random forests, boosting, and Bayesian additive regression trees (BART). These techniques leverage the collective strength of multiple models to create a single, robust model with improved predictive capabilities.

Tree-based methods and their ensemble techniques offer powerful tools for solving a wide range of machine learning and data science problems. By understanding their principles and applying them effectively, we can build robust models that provide accurate predictions and valuable insights into our data.

Thank you for taking the time to read my article! If you found this article helpful or informative, don’t forget to press that clap icon as many times as you want. If you like my works and want to support me, please consider following me on Medium for more content like this in the future, or buy me a cup of coffee☕. Your support motivates me to continue creating valuable content for my readers.



Thomas Shaw

Artificial Intelligence | Machine Learning | Reinforcement Learning