Week 12 - Reproducibility Patterns: Repeatable Splitting, Workflow Pipeline Patterns and Feature Store

Lecture recording (Nov 26, 2024) here.

Lab recording (Nov 28, 2024) here.

Introduction

We look at three more reproducibility patterns this week, and one next week. This week we look at the repeatable splitting pattern, the workflow pipeline pattern and the feature store design pattern. For the repeatable splitting pattern, it is important to have a method that is lightweight and repeatable when creating data splits, regardless of the programming language or random seeds. The workflow pattern makes each step of the workflow a separate, containerized service that can be chained together to make a pipeline that can be run with a single call. The feature store reproducibility design pattern ensures that features used for machine learning models can be consistently and accurately reproduced. This pattern is essential for maintaining the reliability and consistency of ML models over time.

Videos

The Workflow Pipeline PatternML Design Patterns (2:19-4:53)
The Feature Store PatternWhat is a Feature Store for Machine Learning?
Machine Learning Design Patterns (11:15-14:35)

Assignment(s)

Assignment 6 - Investigation of Design Patterns for Machine Learning

The Repeatable Splitting Design Pattern

The Rationale

The repeatable splitting design pattern divides a system or application into smaller, more manageable components. These components can be split and replicated multiple times to distribute the workload and increase scalability and resilience. Splitting a system or dataset into smaller components allows for better scalability. Each split component can be processed or analyzed independently, enabling parallel execution and distributed processing. This pattern is particularly useful when dealing with large datasets or when the system needs to handle a high volume of requests. By distributing the workload across multiple components, the overall system can achieve improved performance and throughput.

Repeatable Splitting can capture the way data is split among training, validation, and test datasets to ensure that a training example that is used in training is never used for evaluation or testing even as the dataset grows.

The UML

Here is the UML diagram for the repeatable splitting pattern:

  +----------------------------------+
  |          Split Component         |
  +----------------------------------+
  | - Split Logic                    |
  | - Data/Workload                  |
  |                                  |
  | + process(): Result              |
  |                                  |
  +----------------------------------+
                  |
                  |
                  V
  +----------------------------------+
  |          Split Component         |
  +----------------------------------+
  | - Split Logic                    |
  | - Data/Workload                  |
  |                                  |
  | + process(): Result              |
  |                                  |
  +----------------------------------+
                  |
                  |
                  V
  +----------------------------------+
  |          Split Component         |
  +----------------------------------+
  | - Split Logic                    |
  | - Data/Workload                  |
  |                                  |
  | + process(): Result              |
  |                                  |
  +----------------------------------+

In this rough UML representation, the Split Component represents the individual components resulting from the splitting of the system or dataset. Each component has its own split logic and is responsible for processing a portion of the data or workload. The process() method represents the functionality specific to each split component, which may vary based on the requirements.

Code Example - Repeatable Splitting

Below is a simple code example with the repeatable splitting pattern:
C++: RepeatableSplitting.cpp.
C#: RepeatableSplitting.cs.
Java: RepeatableSplitting.java.
Python: RepeatableSplitting.py.

Common Usage

The Repeatable Splitting design pattern is a concept that does not have a widely recognized or established presence in the software industry. Here are some potential usages of a similar concept in the software industry:

  1. Modularization: Breaking down a complex system into smaller, independent modules or components that can be developed, tested, and maintained separately. Each module can have a well-defined interface and encapsulate a specific functionality or feature.
  2. Microservices Architecture: Designing a system as a collection of loosely coupled and independently deployable services. Each microservice focuses on a specific business capability and can be developed and deployed independently, allowing for scalability, fault tolerance, and ease of maintenance.
  3. Service-Oriented Architecture (SOA): Decomposing a system into reusable and autonomous services that communicate with each other through well-defined interfaces. Services can be developed and deployed independently, promoting interoperability and flexibility.
  4. Function-based or Serverless Architecture: Designing applications as a composition of small, stateless functions that perform specific tasks. Functions are independently deployable and can be triggered by events or invoked as needed. This approach enables scalability and cost optimization by only running the necessary functions.
  5. Component-Based Development: Building software systems by assembling reusable components. Components encapsulate specific functionality and can be combined and configured to create complex systems. This promotes code reuse, maintainability, and flexibility.
These are just a few examples of common design patterns and architectural styles that involve the splitting or decomposition of software systems into smaller, more manageable units. While the term "Repeatable Splitting" may not be widely used, the underlying concept of breaking down tasks into smaller, repeatable components is a fundamental principle in software engineering.

Code Problem - Odd/Even Splitter

In this example, we have a base class SplittingAlgorithm representing the splitting algorithms, and a concrete implementation OddEvenSplittingAlgorithm that performs the odd-even splitting of data. The SplittingAlgorithm class defines the splitData method that is implemented by concrete algorithms.

The DataSplitter class represents the context that uses the splitting algorithm. It has a setSplittingAlgorithm method to set the desired splitting algorithm and a splitData method that delegates the splitting operation to the chosen algorithm.

In the main function, we create an instance of DataSplitter and an instance of OddEvenSplittingAlgorithm. We set the splitting algorithm to be the OddEvenSplittingAlgorithm using setSplittingAlgorithm. Finally, we call the splitData method to split the input data using the chosen algorithm.

In this complex example, we demonstrate the Repeatable Splitting design pattern by allowing different splitting algorithms to be used interchangeably. By decoupling the splitting logic from the context class, we achieve flexibility and maintainability in handling different splitting requirements.
SplittingAlgorithm.h,
OddEvenSplittingAlgorithm.h,
OddEvenSplittingAlgorithm.cpp,
DataSplitter.h,
OddEvenSplitter.cpp.

Code Problem - Node Splitter

The following program creates a tree with one type of concrete node (concrete node 1). If split is called, two more concrete nodes are added to the tree - concrete node 1 and concrete node 2.
TreeNode.h,
ConcreteNode1.h,
ConcreteNode2.h,
CompositeNode.h,
NodeSplittingMain.cpp.

Code Problem - Department Splitting

The following code splits a department recursively into two.
OrganizationalUnit.h,
Department.h,
DepartmentSplittingMain.cpp.

The Workflow Pipeline Design Pattern

The Rationale

The Workflow Pipeline design pattern is a software design concept that aims to streamline the processing of a series of sequential steps or tasks in a system. It provides a structured and efficient approach for orchestrating complex workflows, often involving data processing or batch jobs. Many systems involve the execution of a series of tasks or steps that need to be performed in a specific order. The Workflow Pipeline pattern provides a structured and organized way to define, manage, and execute these sequential tasks. It ensures that the tasks are processed in a predefined order, allowing for efficient and controlled execution.

The UML

Here is a UML diagram for the workflow pipeline pattern:
The Workflow Pipeline Pattern
The Workflow Pipeline design pattern typically consists of the following components:

  1. Steps or Tasks: Steps represent the individual units of work or tasks in the workflow. Each step performs a specific operation or process on the input data and produces an output. Examples of steps can include data preprocessing, feature extraction, transformation, analysis, or any other operation necessary for the workflow.
  2. Input and Output: Each step takes input data from the previous step or an external source and produces output data that is passed to the next step. The input and output can take various forms, such as objects, data structures, files, or messages, depending on the requirements of the workflow.
  3. Pipeline Orchestration: The pipeline orchestration component manages the coordination and sequencing of the steps in the workflow. It ensures that the steps are executed in the desired order, passing the appropriate data between them. The orchestration component can be implemented using various mechanisms, such as function calls, method chaining, configuration files, or specialized workflow management systems.
  4. Control Flow: The control flow component defines the flow of execution between the steps in the workflow. It determines the conditions or rules for progressing from one step to the next. For example, a control flow mechanism might include conditional branching, looping, error handling, or parallel execution, depending on the requirements of the workflow.
  5. Error Handling and Exception Handling: The workflow pipeline should incorporate error handling and exception handling mechanisms to ensure robustness and fault tolerance. This includes handling errors or exceptions that may occur during step execution, such as invalid input data, resource unavailability, or unexpected failures. Proper error handling ensures the workflow can gracefully handle exceptions and recover from failures.
  6. Logging and Monitoring: Logging and monitoring components help track the execution of the workflow pipeline. They provide visibility into the progress of each step, capture relevant metrics or statistics, and allow for debugging and performance analysis. Logging and monitoring facilitate troubleshooting, performance optimization, and overall system health monitoring.

Code Example - Workflow Pipeline

Below is a simple code example with the workflow pipeline pattern:
C++: WorkflowPipeline.cpp.
C#: WorkflowPipelineMain.cs.
Java: WorkflowPipelineMain.java.
Python: WorkflowPipeline.py.

Common Usage

The Workflow Pipeline design pattern, also known as the Pipeline pattern or the Pipes and Filters pattern, is a popular design pattern in the software industry. It is commonly used in various domains and scenarios to process and transform data in a series of sequential steps or stages. Here are some common usages of the Workflow Pipeline design pattern:

  1. Data Processing and Transformation: Workflow pipelines are extensively used for data processing tasks such as data ingestion, data cleaning, data transformation, and data enrichment. Each stage of the pipeline represents a specific processing step, and the data flows through the pipeline, undergoing various transformations and manipulations.
  2. ETL (Extract, Transform, Load) Processes: ETL processes involve extracting data from various sources, transforming it into a desired format or structure, and loading it into a target system. Workflow pipelines provide a structured approach to defining and executing these ETL processes, where each stage of the pipeline represents a specific operation or transformation on the data.
  3. Batch Processing: Workflow pipelines are commonly used in batch processing scenarios, where large volumes of data need to be processed in a systematic manner. The pipeline stages can include tasks like data validation, filtering, sorting, aggregation, and generating reports. The pipeline enables efficient and scalable batch processing by breaking down the overall task into smaller, manageable steps.
  4. Data Integration and Orchestration: Workflow pipelines are used to integrate and orchestrate various systems and services. For example, in a service-oriented architecture or microservices environment, a workflow pipeline can coordinate the execution of multiple services, each performing a specific task, to achieve an overall business process.
  5. Continuous Integration and Delivery (CI/CD) Pipelines: In software development and DevOps practices, workflow pipelines are employed for CI/CD processes. Each stage of the pipeline represents a specific step in the software development lifecycle, such as code compilation, testing, code quality checks, deployment, and monitoring. CI/CD pipelines automate the software delivery process, ensuring consistency, reliability, and efficiency.
  6. Image and Signal Processing: Workflow pipelines are used in domains like image processing and signal processing. Each stage of the pipeline can perform operations like noise reduction, filtering, feature extraction, classification, and visualization, enabling complex data analysis and manipulation.
  7. Data Streaming and Real-time Processing: Workflow pipelines can be adapted for streaming data processing scenarios, where data is processed in real-time or near real-time. Each stage of the pipeline can perform operations on streaming data, such as filtering, aggregation, pattern matching, and anomaly detection.

Code Problem - Data Pipeline

In this example, we have a base class Step representing the individual steps in the workflow pipeline. We have concrete implementations of the steps:
DataPreparationStep, FeatureExtractionStep, ModelTrainingStep, and PredictionStep. Each step implements the execute method to perform its specific tasks.

The WorkflowPipeline class represents the pipeline itself. It has a collection of steps and provides methods to add steps to the pipeline and execute the pipeline.

In the main function, we create an instance of the WorkflowPipeline and instances of the concrete steps. We add the steps to the pipeline using the addStep method. Finally, we execute the pipeline using the execute method.
Step.h,
DataPreparationStep.h,
FeatureExtractionStep.h,
ModelTrainingStep.h,
PredictionStep.h,
WorkflowPipeline.h,
DataPipeline.cpp.

Code Problem - Image Pipeline

The image pipeline loads an image, pre-processes the image, extracts features from the image, trains a model, evaluates the model, and deploys the model. The code can be seen below.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
FeatureExtractor.h,
ModelTrainer.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
ImagePipelineMain.cpp.

Code Problem - Second Image Pipeline

This image pipeline is similar to the above. It loads an image, preprocesses the image, extracts CNN (convolutional neural network) features from the image, trains a CNN model, applies transfer learning, tunes hyperparameters, evaluates the model, and deploys the model.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
CNNFeatureExtractor.h,
CNNModelTrainer.h,
TransferLearning.h,
HyperparameterTuner.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
Image2PipelineMain.cpp.

The Feature Store Design Pattern

The Feature Store design pattern simplifies the management and reuse of features across projects by decoupling the feature creation process from the development of models using those features.

The Rationale

Good feature engineering is crucial for the success of many machine learning solutions. However, it is also one of the most time-consuming parts of model development. Some features require significant domain knowledge to calculate correctly, and changes in the business strategy can affect how a feature should be computed. To ensure such features are computed in a consistent way, it's better for these features to be under the control of domain experts rather than ML engineers.

The UML

The following is a basic UML diagram of the feature store design pattern.

  +-----------------------------------------------+
  |              Raw Data Source                  |
  |  - Collect data from various sources          |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |           Data Ingestion Layer                |
  |  - Ingest and store raw data                  |
  |  - Ensure immutability                        |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |       Data Transformation Engine              |
  |  - Apply transformations to data              |
  |  - Ensure transformations are deterministic   |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |            Feature Store Layer                |
  |  - Store transformed features                 |
  |  - Version control for features               |
  |  - Ensure feature immutability                |
  |  - Track feature lineage                      |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |  Feature Serving Interface                    |
  |  - Provide features for training and inference|
  |  - Ensure consistent access to feature versions|
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |    Machine Learning Model Training            |
  |  - Retrieve versioned features                |
  |  - Ensure reproducible training process       |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |       Model Serving and Deployment            |
  |  - Deploy trained models                      |
  |  - Use versioned features for prediction      |
  +-----------------------------------------------+
                     |
                     v
  +-----------------------------------------------+
  |     Monitoring and Feedback Loop              |
  |  - Monitor model performance                  |
  |  - Collect feedback for feature improvement   |
  +-----------------------------------------------+
Raw Data Source: Collects data from various sources.
Data Ingestion Layer: Ingests raw data and ensures its immutability.
Data Transformation Engine: Applies deterministic transformations to data. Ensures the transformations are reproducible.
Feature Store Layer: Stores transformed features. Ensures features are versioned and immutable. Tracks feature lineage for traceability.
Feature Serving Interface: Provides features for training and inference. Ensures consistent access to feature versions.
Machine Learning Model Training: Retrieves versioned features for training. Ensures the training process is reproducible.
Model Serving and Deployment: Deploys trained models. Uses versioned features for making predictions.
Monitoring and Feedback Loop: Monitors model performance. Collects feedback to continuously improve features and model performance.

Code Example - Feature Store

Below is a simple example of using the feature store design pattern for a machine learning workflow. In this example, we'll create a basic feature store to manage and retrieve features for training and inference.
C++: FeatureStore.cpp.
C#: FeatureStore.cs.
Java: FeatureStore.java, FeatureStoreMain.java.
Python: FeatureStore.py.

Common Usage

The feature store design pattern offers a centralized repository for storing, sharing, and managing features. Here are some common usages:

  1. Centralized Feature Management
    Single Source of Truth: Acts as a centralized repository where features are stored, ensuring consistency and reliability across different teams and projects. Versioning and Lineage: Tracks the history and lineage of features, enabling reproducibility and traceability of feature engineering processes.
  2. Feature Sharing and Reusability
    Cross-Team Collaboration: Facilitates sharing of features across different teams, promoting reuse and reducing duplication of effort. Standardization: Ensures that features are standardized and comply with organizational guidelines, improving the quality and consistency of features used in models.
  3. Online and Offline Feature Serving
    Real-Time Feature Serving: Provides low-latency access to features for online model inference, enabling real-time predictions. Batch Feature Serving: Supplies features for offline model training and batch inference, ensuring that the same features are used in both training and production.
  4. Efficient Feature Engineering
    Pre-Computed Features: Stores pre-computed features that can be quickly retrieved and used, reducing the computational overhead during model training and inference. Feature Transformation Pipelines: Allows complex feature transformations to be defined, stored, and reused, streamlining the feature engineering process.
  5. Consistency and Synchronization
    Training-Serving Skew Mitigation: Ensures that the features used during model training are identical to those used during inference, preventing discrepancies that can degrade model performance. Temporal Consistency: Manages time-sensitive features, ensuring that features are consistently aligned with the correct time windows.
  6. Feature Monitoring and Governance
    Quality Monitoring: Continuously monitors the quality and distribution of features to detect data drift and anomalies. Access Control and Auditing: Implements access control and auditing to manage who can access and modify features, ensuring compliance with security and privacy policies.
  7. Scalability and Performance Optimization
    Scalable Storage: Utilizes scalable storage solutions to handle large volumes of feature data efficiently. Optimized Retrieval: Implements indexing and caching mechanisms to optimize the retrieval of features, improving the performance of model training and inference.
  8. Integration with ML Pipelines
    Seamless Integration: Integrates with existing ML pipelines and tools, providing a seamless workflow for data scientists and engineers. Automated Feature Extraction: Automates the extraction and computation of features from raw data, reducing the manual effort required in feature engineering.
  9. Experimentation and A/B Testing
    Consistent Features for Experiments: Ensures that the same features are used across different experimental setups and A/B tests, providing reliable and comparable results. Feature Experimentation: Allows data scientists to experiment with new features and quickly evaluate their impact on model performance.
  10. Compliance and Data Privacy
    Data Masking and Anonymization: Implements data masking and anonymization techniques to protect sensitive information in features. Regulatory Compliance: Ensures that feature storage and usage comply with relevant regulations and standards, such as GDPR or CCPA.

Code Problem - Predictions based on Features

The following code focuses on creating a feature store, adding features, and using these features for predictions: MultiFeature.cpp.
The FeatureStore class manages feature generators.
The addFeature() method adds a feature generator.
The getFeatures() method retrieves features for a single data point.
The getBatchFeatures() method retrieves features for a batch of data points.
The following example feature generators compute features from a data point: sumFeature(), meanFeature(), and maxFeature.
The main function defines a batch of data points, initializes the feature store and adds feature generators, retrieves features for the data batch, defines model coefficients and intercept, initializes the linear regression model and sets feature indexes, and performs batch predictions and outputs the results.

Standard Design Patterns Applied to Machine Learning (optional)

Below are examples of standard design patterns applied to machine learning problems. These implement the command pattern, the observer pattern, and the strategy pattern.
CommandML.cpp, ObserverML.cpp and StrategyML.cpp.