Week 12 - Reproducibility Patterns: Workflow Pipeline Patterns, Feature Store and Model Versioning

Lecture recording here.

Lab recording here.

Introduction

We look at three more reproducibility patterns this week. This week we look at the the workflow pipeline pattern, the feature store design pattern and the model versioning pattern. The workflow pipeline pattern makes each step of the workflow a separate, containerized service that can be chained together to make a pipeline that can be run with a single call. The feature store design pattern ensures that features used for machine learning models can be consistently and accurately reproduced. This pattern is essential for maintaining the reliability and consistency of ML models over time. The model versioning pattern deploys a changed model as a microservice with a different endpoint to achieve backward compatibility for deployed models.

Videos

The Workflow Pipeline PatternML Design Patterns (2:19-4:53)
The Feature Store PatternWhat is a Feature Store for Machine Learning?
Machine Learning Design Patterns (11:15-14:35)
The Model Versioning PatternMachine Learning Design Patterns | Dr Ebin Deni Raj (1:32:30-1:44:05)

Assignment(s)

Assignment 6 - Investigation of Design Patterns for Machine Learning

The Workflow Pipeline Design Pattern

The Rationale

The Workflow Pipeline design pattern is a software design concept that aims to streamline the processing of a series of sequential steps or tasks in a system. It provides a structured and efficient approach for orchestrating complex workflows, often involving data processing or batch jobs. Many systems involve the execution of a series of tasks or steps that need to be performed in a specific order. The Workflow Pipeline pattern provides a structured and organized way to define, manage, and execute these sequential tasks. It ensures that the tasks are processed in a predefined order, allowing for efficient and controlled execution.

The UML

Here is a UML diagram for the workflow pipeline pattern:
The Workflow Pipeline Pattern
The Workflow Pipeline design pattern typically consists of the following components:

  1. Steps or Tasks: Steps represent the individual units of work or tasks in the workflow. Each step performs a specific operation or process on the input data and produces an output. Examples of steps can include data preprocessing, feature extraction, transformation, analysis, or any other operation necessary for the workflow.
  2. Input and Output: Each step takes input data from the previous step or an external source and produces output data that is passed to the next step. The input and output can take various forms, such as objects, data structures, files, or messages, depending on the requirements of the workflow.
  3. Pipeline Orchestration: The pipeline orchestration component manages the coordination and sequencing of the steps in the workflow. It ensures that the steps are executed in the desired order, passing the appropriate data between them. The orchestration component can be implemented using various mechanisms, such as function calls, method chaining, configuration files, or specialized workflow management systems.
  4. Control Flow: The control flow component defines the flow of execution between the steps in the workflow. It determines the conditions or rules for progressing from one step to the next. For example, a control flow mechanism might include conditional branching, looping, error handling, or parallel execution, depending on the requirements of the workflow.
  5. Error Handling and Exception Handling: The workflow pipeline should incorporate error handling and exception handling mechanisms to ensure robustness and fault tolerance. This includes handling errors or exceptions that may occur during step execution, such as invalid input data, resource unavailability, or unexpected failures. Proper error handling ensures the workflow can gracefully handle exceptions and recover from failures.
  6. Logging and Monitoring: Logging and monitoring components help track the execution of the workflow pipeline. They provide visibility into the progress of each step, capture relevant metrics or statistics, and allow for debugging and performance analysis. Logging and monitoring facilitate troubleshooting, performance optimization, and overall system health monitoring.

Code Example - Workflow Pipeline

Below is a simple code example with the workflow pipeline pattern:
C++: WorkflowPipeline.cpp.
C#: WorkflowPipelineMain.cs.
Java: WorkflowPipelineMain.java.
Python: WorkflowPipeline.py.

Common Usage

The Workflow Pipeline design pattern, also known as the Pipeline pattern or the Pipes and Filters pattern, is a popular design pattern in the software industry. It is commonly used in various domains and scenarios to process and transform data in a series of sequential steps or stages. Here are some common usages of the Workflow Pipeline design pattern:

  1. Data Processing and Transformation: Workflow pipelines are extensively used for data processing tasks such as data ingestion, data cleaning, data transformation, and data enrichment. Each stage of the pipeline represents a specific processing step, and the data flows through the pipeline, undergoing various transformations and manipulations.
  2. ETL (Extract, Transform, Load) Processes: ETL processes involve extracting data from various sources, transforming it into a desired format or structure, and loading it into a target system. Workflow pipelines provide a structured approach to defining and executing these ETL processes, where each stage of the pipeline represents a specific operation or transformation on the data.
  3. Batch Processing: Workflow pipelines are commonly used in batch processing scenarios, where large volumes of data need to be processed in a systematic manner. The pipeline stages can include tasks like data validation, filtering, sorting, aggregation, and generating reports. The pipeline enables efficient and scalable batch processing by breaking down the overall task into smaller, manageable steps.
  4. Data Integration and Orchestration: Workflow pipelines are used to integrate and orchestrate various systems and services. For example, in a service-oriented architecture or microservices environment, a workflow pipeline can coordinate the execution of multiple services, each performing a specific task, to achieve an overall business process.
  5. Continuous Integration and Delivery (CI/CD) Pipelines: In software development and DevOps practices, workflow pipelines are employed for CI/CD processes. Each stage of the pipeline represents a specific step in the software development lifecycle, such as code compilation, testing, code quality checks, deployment, and monitoring. CI/CD pipelines automate the software delivery process, ensuring consistency, reliability, and efficiency.
  6. Image and Signal Processing: Workflow pipelines are used in domains like image processing and signal processing. Each stage of the pipeline can perform operations like noise reduction, filtering, feature extraction, classification, and visualization, enabling complex data analysis and manipulation.
  7. Data Streaming and Real-time Processing: Workflow pipelines can be adapted for streaming data processing scenarios, where data is processed in real-time or near real-time. Each stage of the pipeline can perform operations on streaming data, such as filtering, aggregation, pattern matching, and anomaly detection.

Code Problem - Data Pipeline

In this example, we have a base class Step representing the individual steps in the workflow pipeline. We have concrete implementations of the steps:
DataPreparationStep, FeatureExtractionStep, ModelTrainingStep, and PredictionStep. Each step implements the execute method to perform its specific tasks.

The WorkflowPipeline class represents the pipeline itself. It has a collection of steps and provides methods to add steps to the pipeline and execute the pipeline.

In the main function, we create an instance of the WorkflowPipeline and instances of the concrete steps. We add the steps to the pipeline using the addStep method. Finally, we execute the pipeline using the execute method.
Step.h,
DataPreparationStep.h,
FeatureExtractionStep.h,
ModelTrainingStep.h,
PredictionStep.h,
WorkflowPipeline.h,
DataPipeline.cpp.

Code Problem - Image Pipeline

The image pipeline loads an image, pre-processes the image, extracts features from the image, trains a model, evaluates the model, and deploys the model. The code can be seen below.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
FeatureExtractor.h,
ModelTrainer.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
ImagePipelineMain.cpp.

Code Problem - Second Image Pipeline

This image pipeline is similar to the above. It loads an image, preprocesses the image, extracts CNN (convolutional neural network) features from the image, trains a CNN model, applies transfer learning, tunes hyperparameters, evaluates the model, and deploys the model.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
CNNFeatureExtractor.h,
CNNModelTrainer.h,
TransferLearning.h,
HyperparameterTuner.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
Image2PipelineMain.cpp.

The Feature Store Design Pattern

The Feature Store design pattern simplifies the management and reuse of features across projects by decoupling the feature creation process from the development of models using those features.

The Rationale

Good feature engineering is crucial for the success of many machine learning solutions. However, it is also one of the most time-consuming parts of model development. Some features require significant domain knowledge to calculate correctly, and changes in the business strategy can affect how a feature should be computed. To ensure such features are computed in a consistent way, it's better for these features to be under the control of domain experts rather than ML engineers.

The UML

The following is a basic UML diagram of the feature store design pattern.

  +-----------------------------------------------+
  |              Raw Data Source                  |
  |  - Collect data from various sources          |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |           Data Ingestion Layer                |
  |  - Ingest and store raw data                  |
  |  - Ensure immutability                        |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |       Data Transformation Engine              |
  |  - Apply transformations to data              |
  |  - Ensure transformations are deterministic   |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |            Feature Store Layer                |
  |  - Store transformed features                 |
  |  - Version control for features               |
  |  - Ensure feature immutability                |
  |  - Track feature lineage                      |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |  Feature Serving Interface                    |
  |  - Provide features for training and inference|
  |  - Ensure consistent access to feature versions|
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |    Machine Learning Model Training            |
  |  - Retrieve versioned features                |
  |  - Ensure reproducible training process       |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |       Model Serving and Deployment            |
  |  - Deploy trained models                      |
  |  - Use versioned features for prediction      |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |     Monitoring and Feedback Loop              |
  |  - Monitor model performance                  |
  |  - Collect feedback for feature improvement   |
  +-----------------------------------------------+
Raw Data Source: Collects data from various sources.
Data Ingestion Layer: Ingests raw data and ensures its immutability.
Data Transformation Engine: Applies deterministic transformations to data. Ensures the transformations are reproducible.
Feature Store Layer: Stores transformed features. Ensures features are versioned and immutable. Tracks feature lineage for traceability.
Feature Serving Interface: Provides features for training and inference. Ensures consistent access to feature versions.
Machine Learning Model Training: Retrieves versioned features for training. Ensures the training process is reproducible.
Model Serving and Deployment: Deploys trained models. Uses versioned features for making predictions.
Monitoring and Feedback Loop: Monitors model performance. Collects feedback to continuously improve features and model performance.

Code Example - Feature Store

Below is a simple example of using the feature store design pattern for a machine learning workflow. In this example, we'll create a basic feature store to manage and retrieve features for training and inference.
C++: FeatureStore.cpp.
C#: FeatureStore.cs.
Java: FeatureStore.java, FeatureStoreMain.java.
Python: FeatureStore.py.

Common Usage

The feature store design pattern offers a centralized repository for storing, sharing, and managing features. Here are some common usages:

  1. Centralized Feature Management
    Single Source of Truth: Acts as a centralized repository where features are stored, ensuring consistency and reliability across different teams and projects. Versioning and Lineage: Tracks the history and lineage of features, enabling reproducibility and traceability of feature engineering processes.
  2. Feature Sharing and Reusability
    Cross-Team Collaboration: Facilitates sharing of features across different teams, promoting reuse and reducing duplication of effort. Standardization: Ensures that features are standardized and comply with organizational guidelines, improving the quality and consistency of features used in models.
  3. Online and Offline Feature Serving
    Real-Time Feature Serving: Provides low-latency access to features for online model inference, enabling real-time predictions. Batch Feature Serving: Supplies features for offline model training and batch inference, ensuring that the same features are used in both training and production.
  4. Efficient Feature Engineering
    Pre-Computed Features: Stores pre-computed features that can be quickly retrieved and used, reducing the computational overhead during model training and inference. Feature Transformation Pipelines: Allows complex feature transformations to be defined, stored, and reused, streamlining the feature engineering process.
  5. Consistency and Synchronization
    Training-Serving Skew Mitigation: Ensures that the features used during model training are identical to those used during inference, preventing discrepancies that can degrade model performance. Temporal Consistency: Manages time-sensitive features, ensuring that features are consistently aligned with the correct time windows.
  6. Feature Monitoring and Governance
    Quality Monitoring: Continuously monitors the quality and distribution of features to detect data drift and anomalies. Access Control and Auditing: Implements access control and auditing to manage who can access and modify features, ensuring compliance with security and privacy policies.
  7. Scalability and Performance Optimization
    Scalable Storage: Utilizes scalable storage solutions to handle large volumes of feature data efficiently. Optimized Retrieval: Implements indexing and caching mechanisms to optimize the retrieval of features, improving the performance of model training and inference.
  8. Integration with ML Pipelines
    Seamless Integration: Integrates with existing ML pipelines and tools, providing a seamless workflow for data scientists and engineers. Automated Feature Extraction: Automates the extraction and computation of features from raw data, reducing the manual effort required in feature engineering.
  9. Experimentation and A/B Testing
    Consistent Features for Experiments: Ensures that the same features are used across different experimental setups and A/B tests, providing reliable and comparable results. Feature Experimentation: Allows data scientists to experiment with new features and quickly evaluate their impact on model performance.
  10. Compliance and Data Privacy
    Data Masking and Anonymization: Implements data masking and anonymization techniques to protect sensitive information in features. Regulatory Compliance: Ensures that feature storage and usage comply with relevant regulations and standards, such as GDPR or CCPA.

Code Problem - Predictions based on Features

The following code focuses on creating a feature store, adding features, and using these features for predictions: MultiFeature.cpp.
The FeatureStore class manages feature generators.
The addFeature() method adds a feature generator.
The getFeatures() method retrieves features for a single data point.
The getBatchFeatures() method retrieves features for a batch of data points.
The following example feature generators compute features from a data point: sumFeature(), meanFeature(), and maxFeature.
The main function defines a batch of data points, initializes the feature store and adds feature generators, retrieves features for the data batch, defines model coefficients and intercept, initializes the linear regression model and sets feature indexes, and performs batch predictions and outputs the results.

The Model Versioning Design Pattern

The Rationale

The Model Versioning Pattern is a software design concept that focuses on managing and evolving machine learning models over time. Machine learning models are typically developed through an iterative process of training, evaluation, and refinement. As new data becomes available or new insights are gained, models need to be updated and improved. The Model Versioning Pattern enables the management of different versions of the model, allowing for easy tracking, comparison, and deployment of new iterations.

The UML

Here is a UML diagram for the model versioning pattern:

  +------------------+
  |   ModelVersion   |
  +------------------+
  | - Version Number |
  | - Model          |
  | - Training Data  |
  +------------------+
  | + train()        |
  | + predict()      |
  +------------------+
In this UML diagram, the Model Versioning Design Pattern consists of a single component:
  1. ModelVersion: The ModelVersion component represents a specific version of a machine learning model. It encapsulates the version number, the trained model, and the associated training data used to train the model.
The ModelVersion component can be instantiated multiple times to represent different versions of the model. Each instance of ModelVersion corresponds to a specific trained model with its own version number and associated data. For more information, see [WIP] Data model versioning pattern.

Code Example - Model Versioning

Below is a simple code example with the model versioning pattern:
C++: ModelVersioning.cpp.
C#: ModelVersioning.cs.
Java: ModelVersioning.java.
Python: ModelVersioning.py.

Common Usage

The Model Versioning pattern is commonly used in the software industry to manage and control the deployment, usage, and evolution of machine learning models and other types of models. Here are some common usages of the Model Versioning pattern:

  1. A/B Testing: Model versioning is often used in A/B testing scenarios, where different versions of a model are deployed simultaneously to measure their performance and compare their effectiveness. By maintaining multiple model versions and switching between them for different users or segments, organizations can evaluate the impact of model changes on key metrics and make informed decisions about model selection and improvements.
  2. Continuous Deployment and Rollback: Model versioning allows for continuous deployment and rollback of models in production environments. By assigning version numbers to models, organizations can easily switch between different versions, deploy new models, and roll back to previous versions if issues or regressions are detected. This enables iterative development and rapid experimentation with models while maintaining the ability to revert to a stable state if necessary.
  3. Model Governance and Auditing: Model versioning helps establish a governance framework for managing models throughout their lifecycle. Organizations can track and document changes to models, record performance metrics for different versions, and maintain an audit trail of model deployments. This supports compliance requirements, enables reproducibility, and facilitates model monitoring and analysis.
  4. Model Ensembling and Stacking: Model versioning is utilized in ensemble modeling techniques, where multiple models are combined to improve prediction accuracy. By maintaining different versions of models with varying architectures, hyperparameters, or training data, organizations can create diverse ensembles that leverage the strengths of different models and enhance overall predictive power.
  5. Model Retraining and Evolution: Model versioning enables the management of model evolution over time. By incrementing version numbers and maintaining a history of model versions, organizations can iterate on models, retrain them with updated data, fine-tune hyperparameters, and introduce new features or algorithms. This allows models to adapt to changing business requirements and improve their performance and relevance.
  6. Model Serving and Deployment: Model versioning facilitates the deployment and serving of models in production environments. Organizations can maintain different versions of models concurrently, manage their deployment configurations, and handle traffic routing to specific versions. This enables seamless updates and can prevent disruption to critical systems relying on model predictions.
  7. Collaboration and Experimentation: Model versioning supports collaborative model development and experimentation. Teams can work on different versions of models in parallel, compare their performance, share findings, and collaborate on model improvements. Versioning also allows for controlled experimentation, where different teams or individuals can propose and test modifications to models while maintaining isolation and avoiding interference with production systems.

Code Problem - Improved Model Versioning

In the main function, we create instances of ModelVersion1, ModelVersion2, and ModelVersion3. We register these model versions with the ModelVersionManager by calling the registerModelVersion method. We then demonstrate loading and prediction using specific model versions.

Later in the example, we create improved versions of Model Version 2 and Model Version 3, represented by ModelVersion2Improved and ModelVersion3Improved. We register these improved model versions with the manager, effectively replacing the previous versions.

Finally, we again demonstrate loading and prediction using the improved model versions.
ModelVersion.h,
ModelVersion1.h,
ModelVersion2.h,
ModelVersion3.h,
ModelVersionManager.h,
ModelVersionManager.cpp,
ImprovedModelVersioning.cpp.

Code Problem - Algorithm Models

The following code represents a scenario where you have multiple algorithms for the same task, and you want to version each algorithm independently. We'll use a factory pattern for creating instances of models, and each model will have its own versioning. Additionally, we'll introduce a model manager to coordinate the training and prediction processes.
MLModel.h,
VersionedMLModel.h,
DecisionTreeModel.h,
RandomForestModel.h,
ModelFactory.h,
ModelManager.h,
AlgorithmModelMain.cpp.

Standard Design Patterns Applied to Machine Learning (optional)

Below are examples of standard design patterns applied to machine learning problems. These implement the command pattern, the observer pattern, and the strategy pattern.
CommandML.cpp, ObserverML.cpp and StrategyML.cpp.