Lecture recording here.
This week we start design patterns for machine learning. We will look at one data representation pattern - the embeddings pattern. We will also look at one problem representation pattern - the rebalancing pattern. The embeddings pattern is for high-cardinality features where closeness relationships are important to preserve. It learns a data representation that maps high-cardinality data into a lower-dimensional space in such a way that the information relevant to the learning problem is preserved. The rebalancing pattern uses downsampling, upsampling, or a weighted loss function for heavily imbalanced data.
| Machine Learning, Supervised Learning: | #4 Machine Learning Specialization |
| #5 Machine Learning Specialization. | |
| Machine Learning, Unsupervised Learning: | #6 Machine Learning Specialization |
| #7 Machine Learning Specialization. | |
| Machine Learning Design Patterns: | ML Design Patterns by Lak (1 hour lecture) |
| Machine Learning Design Patterns (1 hour 20 minute lecture) | |
| The Embeddings Pattern | Machine Learning Design Patterns Embeddings (7:05-12:40) |
| Machine Learning Design Patterns | Dr Ebin Deni Raj Embeddings (43:42-59:35) | |
| The Rebalancing Pattern | Machine Learning Design Patterns | Dr Ebin Deni Raj Rebalancing (59:30-1:12:50) |
| Machine Learning Design Patterns | Michael Munn, Google (14:10-) |
Design patterns for machine learning can be broken into six categories: data representation, problem representation,
patterns that modify model training, resilience, reproducibility and responsible AI. Data representation design patterns
for machine learning focus on efficient and effective ways to represent and organize data for use in machine learning algorithms.
Problem representation design patterns for machine learning focus on how to represent and formulate machine learning
problems in a way that facilitates effective learning and modeling. Patterns that modify model training design patterns
for machine learning are focused on enhancing the training process of machine learning models to improve their performance,
convergence, and generalization capabilities. Resilience design patterns for machine learning are aimed at improving
the robustness and fault tolerance of machine learning systems. Reproducibility design patterns for machine learning
are focused on ensuring that machine learning experiments and results can be reproduced consistently. Responsible AI
design patterns for machine learning focus on ensuring that machine learning systems are developed and deployed in an ethical
and responsible manner. These are summarized in the image:

These are also summarized in the second half of Common Patterns.docx.
The bolded patterns are the patterns we will cover in class. The number in brackets shows the popularity rank of
a particular pattern. The patterns in red were covered last year but due to declining popularity will not be covered in
this year's class. The patterns in green will be covered for the first time this year due to increasing popularity.
Note that we cover the 16 most popular machine learning design patterns.
For a full course on machine learning, see the playlist Stanford CS229: Machine Learning Full Course taught by Andrew Ng, Autumn 2018. Of interest to our study of machine learning design patterns is the second lecture on linear regression and gradient descent. See Stanford CS229: Machine Learning - Linear Regression and Gradient Descent.
For a shorter course on machine learning, see the playlist Machine Learning Specialization by Andrew Ng. For shorter videos on training data, see the following videos on supervised learning: #4 Machine Learning Specialization and #5 Machine Learning Specialization. See also the following videos on unsupervised learning: #6 Machine Learning Specialization and #7 Machine Learning Specialization.
The Rationale
The rationale for the embeddings design pattern in machine learning is to represent high-dimensional categorical or discrete features in a lower-dimensional continuous vector space. Embeddings are learned representations that capture meaningful relationships and semantic information between different categories or entities present in the data.
The UML
Here is a very rough UML diagram for the embeddings pattern:
+------------------+ +------------------+ | EmbeddingLayer |<>--------------| Model | +------------------+ +------------------+ | - inputDim: int | | - embeddingLayer: EmbeddingLayer | - embeddingDim: int | | +------------------+ +------------------+ | + getEmbedding() | | + predict() | +------------------+ +------------------+The UML diagram for the embeddings design pattern would typically involve the following components:
Code Example - Embeddings Data Pattern
The following is a simple example of the embeddings data pattern:
C++: Embedding.cpp.
C#: Embeddings.cs.
Java: Embeddings.java.
Python: Embeddings.py.
Common Usage
The following are some common usages of the embeddings pattern:
Code Problem - Movie Recommendations
We want to implement a system that recommends movies to a user based on a list of
watched movies. We need an EmbeddingLayer class responsible for generating and
retrieving embeddings. We need a Movie class to represent a movie with an ID and
a title. We need a RecommenderSystem class that calls a recommendMovie
function for a specific user, passing their ID and the list of movies they've already
watched. The recommendMovie function takes a user ID and a list of watched movies
and recommends a movie based on a users embeddings and similarity metric. The code is seen
below.
Movie.h,
EmbeddingLayer.h,
RecommenderSystem.h,
MovieMain.cpp.
Code Problem - Predicting Financial Data
The following program uses historical prices as well as weights to predict a stock price for a given day.
The result is a dot product of the two vectors (historical prices, weights).
VectorOperations.h, vector dot product
FinancialData.h,
StockPredictionModel.h, contains the embedded data
FinancialDataMain.cpp.
The Rationale
The rebalancing machine learning design pattern, also known as class rebalancing or data rebalancing, is employed in machine learning to address class imbalance issues in datasets. Class imbalance refers to a situation where the number of samples in different classes of a classification problem is significantly imbalanced, with one class having a much larger number of instances than the others.
The UML Diagram
Here is a simple UML diagram for the rebalancing design pattern:
_______________ ________________
| Dataset |<>------------>| Rebalancer |
|______________| |________________|
| - data | | - rebalance() |
| - labels | | - get_data() |
| - num_classes| | - get_labels() |
|______________| |________________|
^
|
|
_____________________
| BaseModel |
|___________________|
| - train() |
| - predict() |
| - evaluate() |
|___________________|
^
|
|
_____________________
| RebalancedModel |
|___________________|
| - rebalancer |
| - train() |
| - predict() |
| - evaluate() |
|___________________|
Code Example - Rebalancing design pattern
The following is a simple code example of the rebalancing design pattern:
C++: Rebalancing.cpp.
C#: Rebalancing.cs.
Java: Rebalancing.java.
Python: Rebalancing.py.
Common Usage
The rebalancing design pattern is commonly used in various domains within the software industry where dealing with imbalanced datasets is a challenge. The following are some common usages of the rebalancing design pattern:
Code Problem - SMOTE rebalancer (simple)
The rebalancing design pattern in machine learning involves adjusting the class distribution in the training data to handle imbalanced datasets. Here's an example in C++ that demonstrates the rebalancing design pattern using the Synthetic Minority Over-sampling Technique SMOTE to handle imbalanced data. In a real - world scenario, you would need to integrate a more sophisticated SMOTE algorithm or other methods to effectively rebalance the data before training your machine learning models.
The code is seen below:
DataSample.h,
Rebalancer.h,
RebalanceMain.cpp.
Code Problem - SMOTE algorithm
The following code contains pseudocode for the SMOTE algorithm:
Sample.h,
Smote.h,
SmoteMain.cpp.
Code Problem - SMOTE rebalancer (complex)
As above, the Rebalancer class uses the SMOTE algorithm to generate synthetic samples for the minority classes in the dataset. The rebalanced dataset is then used to train the base model (DecisionTreeModel) using the RebalancedModel class. The rebalanced model can then be used for predictions.
The code is seen below:
Dataset.h,
Rebalancer.h,
Rebalancer.cpp,
BaseModel.h,
DecisionTreeModel.h,
RebalancedModel.h,
RebalPredictor.cpp.