Lecture recording (Nov 26, 2024) here.
Lab recording (Nov 28, 2024) here.
We look at three more reproducibility patterns this week, and one next week. This week we look at the repeatable splitting pattern, the workflow pipeline pattern and the feature store design pattern. For the repeatable splitting pattern, it is important to have a method that is lightweight and repeatable when creating data splits, regardless of the programming language or random seeds. The workflow pattern makes each step of the workflow a separate, containerized service that can be chained together to make a pipeline that can be run with a single call. The feature store reproducibility design pattern ensures that features used for machine learning models can be consistently and accurately reproduced. This pattern is essential for maintaining the reliability and consistency of ML models over time.
The Workflow Pipeline Pattern | ML Design Patterns (2:19-4:53) |
The Feature Store Pattern | What is a Feature Store for Machine Learning? |
Machine Learning Design Patterns (11:15-14:35) |
Assignment 6 - Investigation of Design Patterns for Machine Learning
The Rationale
The repeatable splitting design pattern divides a system or application into smaller, more manageable components. These components can be split and replicated multiple times to distribute the workload and increase scalability and resilience. Splitting a system or dataset into smaller components allows for better scalability. Each split component can be processed or analyzed independently, enabling parallel execution and distributed processing. This pattern is particularly useful when dealing with large datasets or when the system needs to handle a high volume of requests. By distributing the workload across multiple components, the overall system can achieve improved performance and throughput.
Repeatable Splitting can capture the way data is split among training, validation, and test datasets to ensure that a training example that is used in training is never used for evaluation or testing even as the dataset grows.
The UML
Here is the UML diagram for the repeatable splitting pattern:
+----------------------------------+ | Split Component | +----------------------------------+ | - Split Logic | | - Data/Workload | | | | + process(): Result | | | +----------------------------------+ | | V +----------------------------------+ | Split Component | +----------------------------------+ | - Split Logic | | - Data/Workload | | | | + process(): Result | | | +----------------------------------+ | | V +----------------------------------+ | Split Component | +----------------------------------+ | - Split Logic | | - Data/Workload | | | | + process(): Result | | | +----------------------------------+
Code Example - Repeatable Splitting
Below is a simple code example with the repeatable splitting pattern:
C++: RepeatableSplitting.cpp.
C#: RepeatableSplitting.cs.
Java: RepeatableSplitting.java.
Python: RepeatableSplitting.py.
Common Usage
The Repeatable Splitting design pattern is a concept that does not have a widely recognized or established presence in the software industry. Here are some potential usages of a similar concept in the software industry:
Code Problem - Odd/Even Splitter
In this example, we have a base class SplittingAlgorithm representing the splitting algorithms, and a concrete implementation OddEvenSplittingAlgorithm that performs the odd-even splitting of data. The SplittingAlgorithm class defines the splitData method that is implemented by concrete algorithms.
The DataSplitter class represents the context that uses the splitting algorithm. It has a setSplittingAlgorithm method to set the desired splitting algorithm and a splitData method that delegates the splitting operation to the chosen algorithm.
In the main function, we create an instance of DataSplitter and an instance of OddEvenSplittingAlgorithm. We set the splitting algorithm to be the OddEvenSplittingAlgorithm using setSplittingAlgorithm. Finally, we call the splitData method to split the input data using the chosen algorithm.
In this complex example, we demonstrate the Repeatable Splitting design pattern by allowing different splitting algorithms to
be used interchangeably. By decoupling the splitting logic from the context class, we achieve flexibility and maintainability
in handling different splitting requirements.
SplittingAlgorithm.h,
OddEvenSplittingAlgorithm.h,
OddEvenSplittingAlgorithm.cpp,
DataSplitter.h,
OddEvenSplitter.cpp.
Code Problem - Node Splitter
The following program creates a tree with one type of concrete node (concrete node 1). If split is called, two more concrete nodes
are added to the tree - concrete node 1 and concrete node 2.
TreeNode.h,
ConcreteNode1.h,
ConcreteNode2.h,
CompositeNode.h,
NodeSplittingMain.cpp.
Code Problem - Department Splitting
The following code splits a department recursively into two.
OrganizationalUnit.h,
Department.h,
DepartmentSplittingMain.cpp.
The Rationale
The Workflow Pipeline design pattern is a software design concept that aims to streamline the processing of a series of sequential steps or tasks in a system. It provides a structured and efficient approach for orchestrating complex workflows, often involving data processing or batch jobs. Many systems involve the execution of a series of tasks or steps that need to be performed in a specific order. The Workflow Pipeline pattern provides a structured and organized way to define, manage, and execute these sequential tasks. It ensures that the tasks are processed in a predefined order, allowing for efficient and controlled execution.
The UML
Here is a UML diagram for the workflow pipeline pattern:
The Workflow Pipeline design pattern typically consists of the following components:
Code Example - Workflow Pipeline
Below is a simple code example with the workflow pipeline pattern:
C++: WorkflowPipeline.cpp.
C#: WorkflowPipelineMain.cs.
Java: WorkflowPipelineMain.java.
Python: WorkflowPipeline.py.
Common Usage
The Workflow Pipeline design pattern, also known as the Pipeline pattern or the Pipes and Filters pattern, is a popular design pattern in the software industry. It is commonly used in various domains and scenarios to process and transform data in a series of sequential steps or stages. Here are some common usages of the Workflow Pipeline design pattern:
Code Problem - Data Pipeline
In this example, we have a base class Step representing the individual steps in the workflow pipeline. We have
concrete implementations of the steps:
DataPreparationStep, FeatureExtractionStep, ModelTrainingStep,
and PredictionStep. Each step implements the execute method to perform its specific tasks.
The WorkflowPipeline class represents the pipeline itself. It has a collection of steps and provides methods to add steps to the pipeline and execute the pipeline.
In the main function, we create an instance of the WorkflowPipeline and instances of the concrete steps. We add
the steps to the pipeline using the addStep method. Finally, we execute the pipeline using the execute method.
Step.h,
DataPreparationStep.h,
FeatureExtractionStep.h,
ModelTrainingStep.h,
PredictionStep.h,
WorkflowPipeline.h,
DataPipeline.cpp.
Code Problem - Image Pipeline
The image pipeline loads an image, pre-processes the image, extracts features from the image, trains a model, evaluates the model,
and deploys the model. The code can be seen below.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
FeatureExtractor.h,
ModelTrainer.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
ImagePipelineMain.cpp.
Code Problem - Second Image Pipeline
This image pipeline is similar to the above. It loads an image, preprocesses the image, extracts CNN (convolutional neural network)
features from the image, trains a CNN model, applies transfer learning, tunes hyperparameters, evaluates the model,
and deploys the model.
Image.h,
ImageLoader.h,
ImagePreprocessor.h,
CNNFeatureExtractor.h,
CNNModelTrainer.h,
TransferLearning.h,
HyperparameterTuner.h,
ModelEvaluator.h,
ModelDeployer.h,
MLWorkflowPipeline.h,
Image2PipelineMain.cpp.
The Feature Store design pattern simplifies the management and reuse of features across projects by decoupling the feature creation process from the development of models using those features.
The Rationale
Good feature engineering is crucial for the success of many machine learning solutions. However, it is also one of the most time-consuming parts of model development. Some features require significant domain knowledge to calculate correctly, and changes in the business strategy can affect how a feature should be computed. To ensure such features are computed in a consistent way, it's better for these features to be under the control of domain experts rather than ML engineers.
The UML
The following is a basic UML diagram of the feature store design pattern.
+-----------------------------------------------+ | Raw Data Source | | - Collect data from various sources | +-----------------------------------------------+ | v +-----------------------------------------------+ | Data Ingestion Layer | | - Ingest and store raw data | | - Ensure immutability | +-----------------------------------------------+ | v +-----------------------------------------------+ | Data Transformation Engine | | - Apply transformations to data | | - Ensure transformations are deterministic | +-----------------------------------------------+ | v +-----------------------------------------------+ | Feature Store Layer | | - Store transformed features | | - Version control for features | | - Ensure feature immutability | | - Track feature lineage | +-----------------------------------------------+ | v +-----------------------------------------------+ | Feature Serving Interface | | - Provide features for training and inference| | - Ensure consistent access to feature versions| +-----------------------------------------------+ | v +-----------------------------------------------+ | Machine Learning Model Training | | - Retrieve versioned features | | - Ensure reproducible training process | +-----------------------------------------------+ | v +-----------------------------------------------+ | Model Serving and Deployment | | - Deploy trained models | | - Use versioned features for prediction | +-----------------------------------------------+ | v +-----------------------------------------------+ | Monitoring and Feedback Loop | | - Monitor model performance | | - Collect feedback for feature improvement | +-----------------------------------------------+Raw Data Source: Collects data from various sources.
Code Example - Feature Store
Below is a simple example of using the feature store design pattern for a machine learning workflow. In this
example, we'll create a basic feature store to manage and retrieve features for training and inference.
C++: FeatureStore.cpp.
C#: FeatureStore.cs.
Java: FeatureStore.java,
FeatureStoreMain.java.
Python: FeatureStore.py.
Common Usage
The feature store design pattern offers a centralized repository for storing, sharing, and managing features. Here are some common usages:
Code Problem - Predictions based on Features
The following code focuses on creating a feature store, adding features, and using these features for predictions:
MultiFeature.cpp.
The FeatureStore class manages feature generators.
The addFeature() method adds a feature generator.
The getFeatures() method retrieves features for a single data point.
The getBatchFeatures() method retrieves features for a batch of data points.
The following example feature generators compute features from a data point: sumFeature(), meanFeature(), and maxFeature.
The main function defines a batch of data points, initializes the feature store and adds feature generators, retrieves features for the data batch,
defines model coefficients and intercept,
initializes the linear regression model and sets feature indexes, and
performs batch predictions and outputs the results.
Below are examples of standard design patterns applied to machine learning problems. These
implement the command pattern, the observer pattern, and the strategy pattern.
CommandML.cpp,
ObserverML.cpp and
StrategyML.cpp.