Week 9 - Problem Representation Patterns: Ensembles and Rebalancing Patterns

Lecture recording (Nov 5, 2024) here.

Lab recording (Nov 7, 2024) here.

Introduction

This week we study problem representation patterns. The specific patterns we will look at are the ensembles and rebalancing patterns. The ensembles pattern combines multiple machine learning models and aggregate their results to make predictions where there is a bias-variance trade-off on small and medium scale problems. The rebalancing pattern uses downsampling, upsampling, or a weighted loss function for heavily imbalanced data.

Video(s)

Machine Learning Design Patterns

Ensemble Learning	Design Patterns for Machine Learning Projects By Anupama Natarajan (43:37-end)
The Rebalancing Pattern	Machine Learning Design Patterns \| Dr Ebin Deni Raj Rebalancing (59:30-1:12:50)
	Machine Learning Design Patterns \| Michael Munn, Google (14:10-)

Assignment(s)

Assignment 4 - Multi-View Machine Learning Predictor
Assignment 5 - Multimodal Input: An Autonomous Driving System

The Ensembles Design Pattern

The Rationale

The ensemble design pattern allows the combination of multiple models or algorithms which can often result in better overall performance as compared to using a single model or algorithm alone. Ensembling leverages the idea of "collective wisdom" by aggregating predictions from diverse models to make more accurate and robust predictions.

The UML

Here is a rough UML diagram for the Ensembles pattern:

  +-----------------------------------------+
  |               Ensemble                  |
  +-----------------------------------------+
  | - models: List<BaseModel>               |
  +-----------------------------------------+
  | + addModel(model: BaseModel)            |
  | + removeModel(model: BaseModel)         |
  | + predict(input: InputData): OutputData |
  +-----------------------------------------+
                    ^
                    |
          +-------------------+
          |     BaseModel     |
          +-------------------+
          |                   |
          |                   |
          +-------------------+

Here are the components of the ensembles design pattern:

Ensemble: The Ensemble class represents the ensemble itself and has a list of models that are part of the ensemble. It provides methods to add or remove models from the ensemble and a predict method to make predictions based on the ensemble's combined knowledge.
BaseModel: The BaseModel class represents the base class for individual models within the ensemble. The details of the individual models can be further specified in subclasses or related classes, depending on the specific requirements of the ensemble.

Code Example - Ensembles

The following is a simple example of the ensembles design pattern:
C++: Ensembles.cpp.
C#: EnsembleMain.cs.
Java: EnsembleMain.java.
Python: Ensemble.py.

Common Usage

In the field of machine learning, the ensembles design pattern is commonly utilized to improve the performance and robustness of predictive models. The following are some common usages of the ensembles design pattern:

Random Forest: Random Forest is a popular ensemble learning technique that combines multiple decision trees to make predictions. Each decision tree is considered a component of the ensemble, and the final prediction is obtained through a voting or averaging mechanism. Random Forests can handle both classification and regression tasks and are known for their ability to handle complex data and reduce overfitting.
Gradient Boosting: Gradient Boosting is another ensemble technique that builds a strong predictive model by combining multiple weak models, typically decision trees. Each weak model is trained to correct the mistakes made by the previous models in the ensemble. Gradient Boosting algorithms, such as XGBoost or LightGBM, have achieved remarkable success in various machine learning competitions and real-world applications.
Stacking: Stacking, also known as stacked generalization, involves training multiple different models and combining their predictions to generate a final prediction. The base models are treated as components of the ensemble, and a meta-model is trained to learn how to best combine their predictions. Stacking allows the ensemble to leverage the strengths of each base model and can often lead to improved performance.
Bagging: Bagging, short for bootstrap aggregating, is an ensemble method that involves training multiple instances of the same model on different subsets of the training data. The predictions of each model are combined through voting or averaging to obtain the final prediction. Bagging helps reduce the variance of the model and can improve its stability and generalization performance.
AdaBoost: AdaBoost, short for Adaptive Boosting, is an ensemble technique that focuses on iteratively improving the performance of a base model by giving more weight to misclassified instances in each iteration. The final prediction is obtained by combining the predictions of all the base models. AdaBoost is particularly effective in handling binary classification problems and has been widely used in areas such as face detection.
Neural Network Ensembles: Ensembles can also be applied to neural networks. Multiple neural networks with different initializations or architectures can be trained independently, and their predictions can be combined through voting or averaging. Neural network ensembles can help increase the model's robustness, improve generalization, and provide better performance on challenging tasks.

Code Problem - Random Forest and Gradient Boosting

In this example, we define a BaseModel interface that specifies the common behavior for all models. We then implement the DecisionTreeModel class as a concrete implementation of the base model. Next, we define two ensemble classes: RandomForestEnsemble and GradientBoostingEnsemble. The RandomForestEnsemble class represents a random forest ensemble, which internally contains multiple decision tree models. The GradientBoostingEnsemble class represents a gradient boosting ensemble, which can accept multiple models, including decision trees.

In the main() function, we create instances of the ensembles and perform predictions on a randomly generated input vector. The RandomForestEnsemble predicts the output by averaging the predictions from all decision trees, while the GradientBoostingEnsemble predicts the output by summing the predictions from all models.

The code is seen below:
BaseModel.h,
DecisionTreeModel.h,
RandomForestEnsemble.h,
GradientBoostingEnsemble.h,
EnsPrediction.cpp.

Code Problem - Multiple Decision Trees

The following example uses multiple decision trees to make a prediction. There is only one ensemble class and two decision tree models.
BaseModel.h,
DecisionTree.h,
DecisionTree2.h,
Ensemble.h,
EnsPrediction2.cpp.

The Rebalancing Design Pattern

The Rationale

The rebalancing machine learning design pattern, also known as class rebalancing or data rebalancing, is employed in machine learning to address class imbalance issues in datasets. Class imbalance refers to a situation where the number of samples in different classes of a classification problem is significantly imbalanced, with one class having a much larger number of instances than the others.

The UML Diagram

Here is a simple UML diagram for the rebalancing design pattern:

  _______________                 ________________
  |    Dataset   |<>------------>|   Rebalancer   |
  |______________|               |________________|
  | - data       |               | - rebalance()  |
  | - labels     |               | - get_data()   |
  | - num_classes|               | - get_labels() |
  |______________|               |________________|
          ^
          |
          |
  _____________________
  |   BaseModel       |
  |___________________|
  | - train()         |
  | - predict()       |
  | - evaluate()      |
  |___________________|
         ^
         |
         |
  _____________________
  |  RebalancedModel  |
  |___________________|
  | - rebalancer      |
  | - train()         |
  | - predict()       |
  | - evaluate()      |
  |___________________|

Here are the components of the rebalancing design pattern:

The Dataset class represents the original dataset with its associated features (data) and labels (labels). It also maintains information about the number of classes in the dataset (num_classes).
The Rebalancer class is responsible for rebalancing the dataset. It contains methods such as rebalance() to perform the rebalancing operation, and get_data() and get_labels() to retrieve the rebalanced data and labels, respectively.
The BaseModel class represents the base machine learning model that can be trained, used for prediction, and evaluated. It encapsulates common functionalities like train(), predict(), and evaluate().
The RebalancedModel class extends the BaseModel class and introduces a rebalancer object, which is an instance of the Rebalancer class. It utilizes the rebalanced data and labels obtained from the rebalancer during the training, prediction, and evaluation processes.

Code Example - Rebalancing design pattern

The following is a simple code example of the rebalancing design pattern:
C++: Rebalancing.cpp.
C#: Rebalancing.cs.
Java: Rebalancing.java.
Python: Rebalancing.py.

Common Usage

The rebalancing design pattern is commonly used in various domains within the software industry where dealing with imbalanced datasets is a challenge. The following are some common usages of the rebalancing design pattern:

Fraud detection: In fraud detection systems, the number of fraudulent instances is typically significantly lower than the number of non-fraudulent instances. Rebalancing techniques can be applied to ensure that the model is trained on a balanced dataset, improving the accuracy of fraud detection.
Medical diagnosis: Medical datasets often suffer from class imbalance, where the number of instances belonging to certain rare medical conditions is much smaller than others. Rebalancing the dataset can help prevent the model from being biased towards the majority class and improve the accuracy of diagnosis for rare conditions.
Anomaly detection: Anomaly detection involves identifying rare events or outliers in a dataset. Rebalancing can be useful in scenarios where the anomalies are significantly underrepresented compared to normal instances. By rebalancing the dataset, the model can be trained to better detect and classify anomalies.
Credit risk assessment: When evaluating credit risk, the occurrence of default events is usually low compared to non-default events. By rebalancing the dataset, credit risk models can be trained to account for the imbalanced nature of default instances, leading to more accurate risk assessment.

Code Problem - SMOTE rebalancer (simple)

The rebalancing design pattern in machine learning involves adjusting the class distribution in the training data to handle imbalanced datasets. Here's an example in C++ that demonstrates the rebalancing design pattern using the Synthetic Minority Over-sampling Technique SMOTE to handle imbalanced data. In a real - world scenario, you would need to integrate a more sophisticated SMOTE algorithm or other methods to effectively rebalance the data before training your machine learning models.

The code is seen below:
DataSample.h,
Rebalancer.h,
RebalanceMain.cpp.

Code Problem - SMOTE algorithm

The following code contains pseudocode for the SMOTE algorithm:
Sample.h,
Smote.h,
SmoteMain.cpp.

Code Problem - SMOTE rebalancer (complex)

As above, the Rebalancer class uses the SMOTE algorithm to generate synthetic samples for the minority classes in the dataset. The rebalanced dataset is then used to train the base model (DecisionTreeModel) using the RebalancedModel class. The rebalanced model can then be used for predictions.

The code is seen below:
Dataset.h,
Rebalancer.h,
Rebalancer.cpp,
BaseModel.h,
DecisionTreeModel.h,
RebalancedModel.h,
RebalPredictor.cpp.