Machine Learning Platform considerations
Core capabilities and features to achieve MLOps goals
In this article, I describe the core capabilities of Machine Learning platforms that support end-to-end DS/ML lifecycle. Also, based on several years of platform development experience and tracking benefits, I also share insights into key features that provide greater value than others.
There’s been a lot of emphasis on MLOps in recent years, and rightly so. As AI/ML grew popular, several companies invested significantly in hiring data scientists and implementing models to increase revenue/reduce costs. But they struggled to realize the benefits as it took much longer to productionalize models. MLOps helps with collaboration and model-deployment processes to make it easy to deploy and monitor models and realize value faster.
MLOps was conceived as a parallel to the DevOps process used in software development — both of them focus on reducing the time to realize value and test/learn. However, if you look at the overall Data Science lifecycle below, MLOps supports only a portion of the E2E Data Science process. That’s the case with software development lifecycle too. However, unlike software development, the Data Science lifecycle also depends on the availability of different types and sources of data to explore and experiment with. In addition, there’s a wide variety of algorithms, libraries, and hardware needs. If there’s no automated way to support these needs, you are not going to realize the goals of MLOps.
Source: https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle
What you need is a platform that also automates processes used for gathering and transforming data and for modeling. The core capabilities of such a platform are organized into the following four major platform components.
1. Data/Feature Engineering
Data/features are at the heart of implementing ML Models. Key features of this capability include:
Engineering Pipelines
Support for creation of pipelines for data/feature engineering: Ability to quickly create pipelines to read data from various sources and join, transform, aggregate, and engineer features.
Configuring and Enabling automated execution/orchestration of pipelines: Once data/feature engineering pipelines are created, in most cases, they need to be executed continuously (for streaming/real-time data) or at scheduled times (batch data). In addition, there will often be dependencies (e.g., features dependent on other features or model outputs). This feature supports configuration and automated orchestration of pipeline execution by considering dependencies.
Feature Store
A Feature Store is a common data repository that houses and serves features for various ML models — during both the training and execution phases. Feature stores have gained a lot of traction in recent years, and many open-source and commercial Feature store products are available in the market now. featurestores.org has good information on what they are and the options available.
Some of the key benefits of a feature store include:
- Saving time in sourcing features needed by a particular model
- Avoiding duplicative work in generating the same features for different use cases
- Having a central source for metadata and lineage information about the features
- Another key benefit that is often overlooked is that it enables domain SMEs to identify and add features and metadata relevant to the business domain to the feature store, thereby enabling data scientists (who might lack the domain knowledge) to focus on building and testing models
It also aligns very well with the concept of Data-centric AI (https://datacentricai.org/), which postulates that with advances in ML algorithms, you get better performance by focusing on data improvements over model improvements. Despite the recent dramatic advances in foundational/Large Language Models (LLMs), this still holds for the overwhelming majority of companies.
“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” — Andrew Ng, CEO and Founder of LandingAI
Feature store can be an important capability to support Data-centric AI approach. In addition to ensuring the availability of features relevant to the problem domain (that the data scientists working on certain use cases might not be aware of), you could also add meta-data about balance/bias in feature groups. You could even take a step further and address data shortcomings via data augmentations in the feature store.
Based on my experience, investments in a feature store have the potential to provide the most return on investment.
Data Lineage
Data lineage is another key feature related to data pipelines and feature stores. It refers to the ability to trace the origin of data used to train and test models — including information such as the source of the data, preprocessing, cleaning, and feature engineering performed, and the specific version of the data used. Data lineage is also important for understanding the ensuring that correct data is used for model predictions.
2. ML Workbench
This is the core component used by data scientists to perform their day-to-day activities. Key features include:
Support for different compute engines/clusters
Different use cases need data of different sizes (for feature engineering and training), and different compute architectures (e.g., CPU/GPU based) for different modeling algorithms. Correspondingly, there’s a need to be able to provision clusters of different sizes and types (e.g., CPUs vs. GPUs, # of CPU/GPUs, memory/storage/processing optimized) on demand within a few mins.
Ideally, you want to provide recommendations/create policies on when to use what configuration so that you’re not over/under-sizing. You might also want to limit the options available to manage costs and enable long-term support.
Support for different libraries, algorithms, and frameworks
Similar to the ability to choose compute engines, this feature enables data scientists to use any of the popular ML libraries (e.g., scikit-learn, PyTorch, TensorFlow, etc.). This means that agreed-upon versions of the popular libraries are preloaded for data scientists to use and supported by the platform team.
Again, it’s highly beneficial to have recommendations/policies around what versions of libraries should be used and processes to upgrade to newer versions.
Collaboration, Standardization and Reproducibility
This includes enabling standard software engineering practices like integrating notebooks/code with version control tools (e.g., GitHub) that enable collaboration.
One considerable challenge with scaling Data Science at organizations is establishing standards in how models are developed — similar to having developers follow specific frameworks/standards. A best practice to address this is to create templates providing examples of best practices and standards the team decides to follow (e.g., using modular and unit-tested code for different steps of the ML pipeline — feature engineering, transformation, training, etc.). This has been invaluable to 1) enable collaboration between data scientists by making it easier to understand others’ work, 2) onboard new data scientists, and 3) drastically reduce the time to deploy models to production. This is another investment that provides inordinate returns.
Additionally, the workbench should enable multiple data scientists to collaborate and reproduce results on projects by being able to review details of experiments (parameters, accuracy metrics, tags, graphs, etc.) including the data and code used to run the experiment.
Fairness/Bias Detection
The workbench should also include tools and algorithms to easily evaluate developed models against the organizations’ standards for fairness and bias.
3. Model Deployment
This is where the scope of MLOps traditionally starts. It begins with ML Engineers reviewing the notebooks/code created by data scientists and helping standardize them by organizing them as pipelines, refactoring to use coding best practices, etc. Key features include:
Continuous Integration (CI)
Enabling automated execution of unit/integrated tests of the different components of the ML pipeline. This helps ensure that changes to individual components can be independently tested.
Continuous Deployment (CD) pipeline
Enabling the creation/execution of a deployment pipeline including steps to get approval. The pipeline should support deploying changes to the model or model development pipeline as well as changes to the code that uses/exposes the model (e.g., via REST endpoints, batch execution) - either independently or together as needed.
Automated re-training pipeline
Support for automated retraining of models by stitching together and orchestrating modular ML pipeline components.
Support for different modes of deployment
The platform should be capable of deploying models in batch, streaming, or real-time mode depending on the needs of specific use cases.
Versioning
Support for versioning of models including supporting deployment and execution of multiple versioned models concurrently.
Deployment Strategies
Support for some/all of the different common deployment strategies — Blue/Green, Canary, A/B Testing and Shadow.
4. Monitoring/Observability
Key features of this capability include:
Execution Monitoring
Continuous monitoring of data/feature engineering and model execution pipelines/REST endpoints for availability and throughput/responsiveness.
Model Performance Monitoring
Includes monitoring and alerting for model performance degradation resulting from data/concept drift. Could also include fairness/bias monitoring.
Security/Vulnerability Monitoring
Automated execution of security checks/alert generation for vulnerability identification in code and infrastructure. This is very important given the wide variety of third-party/open-source code used as well as the rapid pace at which algorithms/tools/services evolve and new ones are explored.
Ideally, should be handled as part of the overall security by infrastructure team.
Cost Monitoring
Includes automation of monthly reporting/alerting of costs (including break-up by project, resources, etc.) as well as actual vs. planned spend. Advanced features can include identifying recommendations to reduce costs.
This is especially important if your platform is on the cloud. Without proper monitoring and management of costs, you’ll be in for rude surprises.
Model Registry
A model registry is a central repository that tracks models developed and deployed across the organization. Information tracked includes model stage (e.g., dev, test, prod), how it was trained (data, algorithm, etc.), features needed for predictions, performance metrics, uses/users of the models, etc.
Model Registry can be very useful to re-use/avoid duplication as well as collaborate on similar problems. For organization leaders, it provides a bird’s eye view of the overall landscape including collaboration/re-use, time to value, overall value realized/return on investment (if the registry also tracks benefits), etc.
Conclusion
I hope that this article helps you think about MLOps more holistically — not just something that starts once models are developed and ready to be deployed. I also hope it helps you think about ML platform capabilities and features you should be considering and evaluating — whether you have an existing platform you’re looking to enhance or you’re embarking on a journey to build or buy one. You will find it very advantageous to start by thinking about the end-to-end experience you want to enable for your data scientists and ML Engineers and how to minimize friction, hand-offs, and the time to value.