Lecture recording (Nov 5, 2024) here.
Lab recording (Nov 7, 2024) here.
This week we study problem representation patterns. The specific patterns we will look at are the ensembles and rebalancing patterns. The ensembles pattern combines multiple machine learning models and aggregate their results to make predictions where there is a bias-variance trade-off on small and medium scale problems. The rebalancing pattern uses downsampling, upsampling, or a weighted loss function for heavily imbalanced data.
Ensemble Learning | Design Patterns for Machine Learning Projects By Anupama Natarajan (43:37-end) |
The Rebalancing Pattern | Machine Learning Design Patterns | Dr Ebin Deni Raj Rebalancing (59:30-1:12:50) |
Machine Learning Design Patterns | Michael Munn, Google (14:10-) |
Assignment 4 - Multi-View Machine Learning Predictor
Assignment 5 - Multimodal Input: An Autonomous Driving System
The Rationale
The ensemble design pattern allows the combination of multiple models or algorithms which can often result in better overall performance as compared to using a single model or algorithm alone. Ensembling leverages the idea of "collective wisdom" by aggregating predictions from diverse models to make more accurate and robust predictions.
The UML
Here is a rough UML diagram for the Ensembles pattern:
+-----------------------------------------+ | Ensemble | +-----------------------------------------+ | - models: List<BaseModel> | +-----------------------------------------+ | + addModel(model: BaseModel) | | + removeModel(model: BaseModel) | | + predict(input: InputData): OutputData | +-----------------------------------------+ ^ | +-------------------+ | BaseModel | +-------------------+ | | | | +-------------------+
Code Example - Ensembles
The following is a simple example of the ensembles design pattern:
C++: Ensembles.cpp.
C#: EnsembleMain.cs.
Java: EnsembleMain.java.
Python: Ensemble.py.
Common Usage
In the field of machine learning, the ensembles design pattern is commonly utilized to improve the performance and robustness of predictive models. The following are some common usages of the ensembles design pattern:
Code Problem - Random Forest and Gradient Boosting
In this example, we define a BaseModel interface that specifies the common behavior for all models. We then implement the DecisionTreeModel class as a concrete implementation of the base model. Next, we define two ensemble classes: RandomForestEnsemble and GradientBoostingEnsemble. The RandomForestEnsemble class represents a random forest ensemble, which internally contains multiple decision tree models. The GradientBoostingEnsemble class represents a gradient boosting ensemble, which can accept multiple models, including decision trees.
In the main() function, we create instances of the ensembles and perform predictions on a randomly generated input vector. The RandomForestEnsemble predicts the output by averaging the predictions from all decision trees, while the GradientBoostingEnsemble predicts the output by summing the predictions from all models.
The code is seen below:
BaseModel.h,
DecisionTreeModel.h,
RandomForestEnsemble.h,
GradientBoostingEnsemble.h,
EnsPrediction.cpp.
Code Problem - Multiple Decision Trees
The following example uses multiple decision trees to make a prediction. There is only one ensemble
class and two decision tree models.
BaseModel.h,
DecisionTree.h,
DecisionTree2.h,
Ensemble.h,
EnsPrediction2.cpp.
The Rationale
The rebalancing machine learning design pattern, also known as class rebalancing or data rebalancing, is employed in machine learning to address class imbalance issues in datasets. Class imbalance refers to a situation where the number of samples in different classes of a classification problem is significantly imbalanced, with one class having a much larger number of instances than the others.
The UML Diagram
Here is a simple UML diagram for the rebalancing design pattern:
_______________ ________________ | Dataset |<>------------>| Rebalancer | |______________| |________________| | - data | | - rebalance() | | - labels | | - get_data() | | - num_classes| | - get_labels() | |______________| |________________| ^ | | _____________________ | BaseModel | |___________________| | - train() | | - predict() | | - evaluate() | |___________________| ^ | | _____________________ | RebalancedModel | |___________________| | - rebalancer | | - train() | | - predict() | | - evaluate() | |___________________|
Code Example - Rebalancing design pattern
The following is a simple code example of the rebalancing design pattern:
C++: Rebalancing.cpp.
C#: Rebalancing.cs.
Java: Rebalancing.java.
Python: Rebalancing.py.
Common Usage
The rebalancing design pattern is commonly used in various domains within the software industry where dealing with imbalanced datasets is a challenge. The following are some common usages of the rebalancing design pattern:
Code Problem - SMOTE rebalancer (simple)
The rebalancing design pattern in machine learning involves adjusting the class distribution in the training data to handle imbalanced datasets. Here's an example in C++ that demonstrates the rebalancing design pattern using the Synthetic Minority Over-sampling Technique SMOTE to handle imbalanced data. In a real - world scenario, you would need to integrate a more sophisticated SMOTE algorithm or other methods to effectively rebalance the data before training your machine learning models.
The code is seen below:
DataSample.h,
Rebalancer.h,
RebalanceMain.cpp.
Code Problem - SMOTE algorithm
The following code contains pseudocode for the SMOTE algorithm:
Sample.h,
Smote.h,
SmoteMain.cpp.
Code Problem - SMOTE rebalancer (complex)
As above, the Rebalancer class uses the SMOTE algorithm to generate synthetic samples for the minority classes in the dataset. The rebalanced dataset is then used to train the base model (DecisionTreeModel) using the RebalancedModel class. The rebalanced model can then be used for predictions.
The code is seen below:
Dataset.h,
Rebalancer.h,
Rebalancer.cpp,
BaseModel.h,
DecisionTreeModel.h,
RebalancedModel.h,
RebalPredictor.cpp.