Week 8 - Data Representation Patterns: Embeddings. Problem Representation Patterns: Rebalancing Patterns

Lecture recording here.

Introduction

This week we start design patterns for machine learning. We will look at one data representation pattern - the embeddings pattern. We will also look at one problem representation pattern - the rebalancing pattern. The embeddings pattern is for high-cardinality features where closeness relationships are important to preserve. It learns a data representation that maps high-cardinality data into a lower-dimensional space in such a way that the information relevant to the learning problem is preserved. The rebalancing pattern uses downsampling, upsampling, or a weighted loss function for heavily imbalanced data.

Videos

Machine Learning, Supervised Learning: #4 Machine Learning Specialization
#5 Machine Learning Specialization.
Machine Learning, Unsupervised Learning: #6 Machine Learning Specialization
#7 Machine Learning Specialization.
Machine Learning Design Patterns:ML Design Patterns by Lak (1 hour lecture)
Machine Learning Design Patterns (1 hour 20 minute lecture)
The Embeddings PatternMachine Learning Design Patterns Embeddings (7:05-12:40)
Machine Learning Design Patterns | Dr Ebin Deni Raj Embeddings (43:42-59:35)
The Rebalancing PatternMachine Learning Design Patterns | Dr Ebin Deni Raj Rebalancing (59:30-1:12:50)
Machine Learning Design Patterns | Michael Munn, Google (14:10-)

Assignment(s)

Assignment 4 - Multi-View Machine Learning Predictor

Common Categories of Machine Learning Design Patterns

Design patterns for machine learning can be broken into six categories:

1. Data Representation

These patterns are about how we prepare and organize data so a machine can understand it. They make sure the data is in the right form - clean, structured, and meaningful - before training a model.
Example: Using Embeddings to turn words into numbers that capture their meaning.

2. Problem Representation

These patterns deal with how we define the problem for the model to solve. They help us choose the best way to express the question - classification, regression, ranking, etc. - so the model can learn effectively.
Example: Turning "recommend a movie" into a ranking problem instead of a yes/no question.

3. Patterns That Modify Model Training

These patterns focus on how we train the model - improving accuracy, speed, and generalization. They might change how the model learns from data or how we combine multiple models.
Example: Using Ensemble Learning (combining several models) to get better results.

4. Resilience

These patterns help make ML systems strong and reliable in the real world. They handle unexpected inputs, missing data, or system failures, so the model doesn't break easily.
Example: Using Checkpoints to save progress during training, so it can restart if something fails.

5. Reproducibility

These patterns ensure that we can repeat the same experiment and get the same results. They standardize how data, models, and code are tracked and shared.
Example: Keeping exact versions of data and model configurations so results can be verified later.

6. Responsible AI

These patterns focus on doing AI the right way - fairly, safely, and ethically. They make sure models are transparent, unbiased, and respect privacy.
Example: Using a Fairness Lens pattern to detect and reduce bias in predictions.

Summary Table

Category Simple Meaning Example
Data Representation Getting data ready for ML Embeddings
Problem Representation Defining the problem clearly Ranking, classification
Modify Model Training Improving how models learn Ensembles
Resilience Making systems reliable Checkpoints
Reproducibility Ensuring results can be repeated Version tracking
Responsible AI Building fair, ethical systems Fairness Lens

These are summarized in the image:

ML Design Patterns

These are also summarized in the second half of Common Patterns.docx. The bolded patterns are the patterns we will cover in class. The number in brackets shows the popularity rank of a particular pattern. The patterns in red were covered last year but due to declining popularity will not be covered in this year's class. The patterns in green will be covered for the first time this year due to increasing popularity. Note that we cover the 16 most popular machine learning design patterns.

Becoming More Popular

1. Responsible AI

There is growing pressure for fairness, transparency, interpretability, privacy, and ethical AI use. Organizations must meet internal governance and external regulations such as GDPR and the EU AI Act.
Implication: Patterns like "fairness lens", "explainable predictions", and "heuristic benchmark" are gaining attention.

2. Resilience

As ML systems move from research to production (MLOps), robustness, monitoring, drift detection, and reliability have become critical. Systems must operate reliably at scale in changing environments.
Implication: Patterns such as "continued model evaluation", "stateless serving function", and "two-phase predictions" are becoming more common.

3. Reproducibility

With complex models and collaborative teams, it's vital to reproduce experiments and results. This is also important for audits and compliance.
Implication: Patterns like "workflow pipeline", "model versioning", and "feature store" are increasingly standard in production systems.

Becoming Less Popular or Plateauing

1. Data Representation

Foundational techniques such as embeddings, feature crosses, and hashing are well established. The focus is shifting from creating representations to managing operational and ethical challenges.

2. Problem Representation

Problem formulation patterns (classification, regression, ranking) are mature and standardized. There is less emphasis on new approaches in this area compared to production and governance patterns.

3. Patterns That Modify Model Training

While still useful (transfer learning, hyperparameter tuning, distributed training), most of these are handled by existing frameworks. Focus is moving toward deployment and lifecycle management.

Reasons for These Shifts

Summary Table

Category Trend Key Reasons
Data Representation Plateauing Foundational techniques are mature
Problem Representation Plateauing Standard formulations widely known
Modify Model Training Stable Frameworks automate common techniques
Resilience Increasing Focus on reliability, drift detection, and MLOps
Reproducibility Increasing Auditability and versioning demands
Responsible AI Increasing Ethical and regulatory requirements

Machine Learning Lectures - Stanford University

For a full course on machine learning, see the playlist Stanford CS229: Machine Learning Full Course taught by Andrew Ng, Autumn 2018. Of interest to our study of machine learning design patterns is the second lecture on linear regression and gradient descent. See Stanford CS229: Machine Learning - Linear Regression and Gradient Descent.

For a shorter course on machine learning, see the playlist Machine Learning Specialization by Andrew Ng. For shorter videos on training data, see the following videos on supervised learning: #4 Machine Learning Specialization and #5 Machine Learning Specialization. See also the following videos on unsupervised learning: #6 Machine Learning Specialization and #7 Machine Learning Specialization.

The Embeddings Design Pattern

The Rationale

The rationale for the embeddings design pattern in machine learning is to represent high-dimensional categorical or discrete features in a lower-dimensional continuous vector space. Embeddings are learned representations that capture meaningful relationships and semantic information between different categories or entities present in the data.

The embeddings pattern translates many categories (words, users, products) into small, meaningful numeric vectors that a model can use to make predictions. There are two main parts:

  1. Embedding Layer – does the translating.
  2. Model – uses those translations to make predictions.

The UML

Here is a very rough UML diagram for the embeddings pattern:

  +------------------+                +------------------+
  |  EmbeddingLayer  |<>--------------|      Model       |
  +------------------+                +------------------+
  | - inputDim: int  |                | - embeddingLayer: EmbeddingLayer
  | - embeddingDim: int               |                  |
  +------------------+                +------------------+
  | + getEmbedding() |                | + predict()      |
  +------------------+                +------------------+

1) Embedding Layer (the translator)

Example: instead of “word #37,” the layer outputs something like [0.2, -0.7, 0.5, ...].

2) Model (the learner)

Simplified Analogy. Teaching a computer about movies:
the Embedding Layer converts each movie into a short numeric description;
the Model uses those descriptions to predict which movie a user will like next.

It may also contain the following components:

3) Training Data Class: The training data class holds the data the model learns from, including input categories and the correct answers (targets) used to train and adjust the embeddings. It may contain the input categorical features, target variables, and other relevant data for training the embeddings. It serves as the input to the model during the training phase.

4) Inference Data Class: The inference data class holds new input data used after training so the model can generate embeddings and make predictions. It may contain the categorical features for which embeddings are generated, as well as any additional input data required for prediction.

Code Example - Embeddings Data Pattern

The following is a simple example of the embeddings data pattern:
C++: Embedding.cpp.
C#: Embeddings.cs.
Java: Embeddings.java.
Python: Embeddings.py.

Common Usage

The following are some common usages of the embeddings pattern:

  1. Natural Language Processing (NLP): Embeddings are widely used in NLP tasks such as text classification, sentiment analysis, named entity recognition, machine translation, and document similarity. Word embeddings, such as Word2Vec, GloVe, and fastText, capture semantic relationships between words and are used to represent textual data in a dense vector space.
  2. Recommender Systems: Embeddings play a crucial role in recommender systems to capture user preferences and item characteristics. User embeddings and item embeddings are learned from historical user-item interactions and used to generate personalized recommendations. Embeddings enable the system to find similar users or items based on their embedding vectors.
  3. Image and Video Processing: In computer vision tasks, embeddings are used to represent images and videos. Techniques like convolutional neural networks (CNNs) are used to learn image embeddings, which can be used for image classification, object detection, image retrieval, and more. Video embeddings can capture temporal information and are useful for tasks like action recognition and video summarization.
  4. Anomaly Detection: Embeddings can be used to detect anomalies in data. By learning embeddings that capture normal patterns, deviations from the normal behavior can be identified as anomalies. This approach is commonly used in fraud detection, network intrusion detection, and outlier detection.
  5. Knowledge Graphs: Embeddings are employed in knowledge graph applications to represent entities and relationships in a graph structure. Graph embeddings enable efficient similarity calculations and can be used for tasks like entity linking, link prediction, and graph-based recommendation systems.
  6. Sequence Modeling: Embeddings are utilized in sequence modeling tasks, such as natural language generation, machine translation, and speech recognition. Sequence embeddings capture dependencies and context in sequential data, enabling the model to understand and generate meaningful sequences.

Code Problem - Movie Recommendations

We want to implement a system that recommends movies to a user based on a list of watched movies. We need an EmbeddingLayer class responsible for generating and retrieving embeddings. We need a Movie class to represent a movie with an ID and a title. We need a RecommenderSystem class that calls a recommendMovie function for a specific user, passing their ID and the list of movies they've already watched. The recommendMovie function takes a user ID and a list of watched movies and recommends a movie based on a users embeddings and similarity metric. The code is seen below.
Movie.h,
EmbeddingLayer.h,
RecommenderSystem.h,
MovieMain.cpp.

Code Problem - Predicting Financial Data

The following program uses historical prices as well as weights to predict a stock price for a given day. The result is a dot product of the two vectors (historical prices, weights).
VectorOperations.h, vector dot product
FinancialData.h,
StockPredictionModel.h, contains the embedded data
FinancialDataMain.cpp.

The Rebalancing Design Pattern

The Rationale

The rebalancing machine learning design pattern, also known as class rebalancing or data rebalancing, is employed in machine learning to address class imbalance issues in datasets. Class imbalance refers to a situation where the number of samples in different classes of a classification problem is significantly imbalanced, with one class having a much larger number of instances than the others.

The UML Diagram

Here is a simple UML diagram for the rebalancing design pattern:

  _______________                 ________________
  |    Dataset   |<>------------>|   Rebalancer   |
  |______________|               |________________|
  | - data       |               | - rebalance()  |
  | - labels     |               | - get_data()   |
  | - num_classes|               | - get_labels() |
  |______________|               |________________|
          ^
          |
          |
  _____________________
  |   BaseModel       |
  |___________________|
  | - train()         |
  | - predict()       |
  | - evaluate()      |
  |___________________|
         ^
         |
         |
  _____________________
  |  RebalancedModel  |
  |___________________|
  | - rebalancer      |
  | - train()         |
  | - predict()       |
  | - evaluate()      |
  |___________________|


Here are the components of the rebalancing design pattern:
  1. The Dataset class represents the original dataset with its associated features (data) and labels (labels). It also maintains information about the number of classes in the dataset (num_classes).
  2. The Rebalancer class is responsible for rebalancing the dataset. It contains methods such as rebalance() to perform the rebalancing operation, and get_data() and get_labels() to retrieve the rebalanced data and labels, respectively.
  3. The BaseModel class represents the base machine learning model that can be trained, used for prediction, and evaluated. It encapsulates common functionalities like train(), predict(), and evaluate().
  4. The RebalancedModel class extends the BaseModel class and introduces a rebalancer object, which is an instance of the Rebalancer class. It utilizes the rebalanced data and labels obtained from the rebalancer during the training, prediction, and evaluation processes.

Code Example - Rebalancing design pattern

The following is a simple code example of the rebalancing design pattern:
C++: Rebalancing.cpp.
C#: Rebalancing.cs.
Java: Rebalancing.java.
Python: Rebalancing.py.

Common Usage

The rebalancing design pattern is commonly used in various domains within the software industry where dealing with imbalanced datasets is a challenge. The following are some common usages of the rebalancing design pattern:

  1. Fraud detection: In fraud detection systems, the number of fraudulent instances is typically significantly lower than the number of non-fraudulent instances. Rebalancing techniques can be applied to ensure that the model is trained on a balanced dataset, improving the accuracy of fraud detection.
  2. Medical diagnosis: Medical datasets often suffer from class imbalance, where the number of instances belonging to certain rare medical conditions is much smaller than others. Rebalancing the dataset can help prevent the model from being biased towards the majority class and improve the accuracy of diagnosis for rare conditions.
  3. Anomaly detection: Anomaly detection involves identifying rare events or outliers in a dataset. Rebalancing can be useful in scenarios where the anomalies are significantly underrepresented compared to normal instances. By rebalancing the dataset, the model can be trained to better detect and classify anomalies.
  4. Credit risk assessment: When evaluating credit risk, the occurrence of default events is usually low compared to non-default events. By rebalancing the dataset, credit risk models can be trained to account for the imbalanced nature of default instances, leading to more accurate risk assessment.

Code Problem - SMOTE rebalancer (simple)

The rebalancing design pattern in machine learning involves adjusting the class distribution in the training data to handle imbalanced datasets. Here's an example in C++ that demonstrates the rebalancing design pattern using the Synthetic Minority Over-sampling Technique SMOTE to handle imbalanced data. In a real - world scenario, you would need to integrate a more sophisticated SMOTE algorithm or other methods to effectively rebalance the data before training your machine learning models.

The code is seen below:
DataSample.h,
Rebalancer.h,
RebalanceMain.cpp.

Code Problem - SMOTE algorithm

The following code contains pseudocode for the SMOTE algorithm:
Sample.h,
Smote.h,
SmoteMain.cpp.

Code Problem - SMOTE rebalancer (complex)

As above, the Rebalancer class uses the SMOTE algorithm to generate synthetic samples for the minority classes in the dataset. The rebalanced dataset is then used to train the base model (DecisionTreeModel) using the RebalancedModel class. The rebalanced model can then be used for predictions.

The code is seen below:
Dataset.h,
Rebalancer.h,
Rebalancer.cpp,
BaseModel.h,
DecisionTreeModel.h,
RebalancedModel.h,
RebalPredictor.cpp.