LLMOps: Bridging the Gap from PoCs to Production
The past year has been all about building and demonstrating fascinating LLM-based PoCs to leaders. As these PoCs got approved as use cases, many are facing the reality that it’s much more difficult to implement production-grade LLM solutions than building and demonstrating impressive looking PoCs. One key reason is the lack of established standard processes, tooling, and automated pipelines for developing, evaluating, experimenting, and monitoring solutions for the wide variety of use cases enabled by LLMs — i.e. an LLMOps strategy and solution. In this article, I go over the key considerations and decisions teams and organizations need to make to devise an effective LLMOps strategy and implementation. The LLMOps strategy should include support for:
- Developing LLM based solutions
- Evaluating the solutions
- Experimentation and Collaboration
- Continuous Integration/Continuous Delivery
- Monitoring solutions in production
1. Developing LLM based solution
Choosing the right LLM framework(s) is the first step for efficient and standardized LLM based solution development. Aligning on one framework makes the learning curve and support easier but it’s fine to have multiple as long as the reasons and implications are thought through. So, how do you choose the LLM framework(s)? One approach is to identify the LLM based use cases you plan to implement over the next 6 months to a year and identify and map them to corresponding solution patterns (or Langchain’s cognitive architectures) and decide one or more frameworks that are good at supporting those solution patterns/cognitive architectures. Listed below is one way to categorize the solution patterns and some of the frameworks supporting those solution patterns.
In addition, with so many LLMs in the market and the rapid clip at which new versions and new LLMs are introduced, it’s extremely helpful to have a curated list of approved LLMs based on enterprise constraints/strategy (e.g., certain LLM APIs are only available on a specific cloud provider) along with cost information and guidelines on the suitability of different LLMs for different use cases. However the decision on the specific LLM within the approved list should be based on the specific use case needs in terms of accuracy, cost and latency. The most common recommendation is to start with the best performing LLMs (currently Claude 3 Opus, GPT 4o) to validate the use case and then choose the LLM based on the above factors for development and deployment. With most LLM frameworks/products, it’s fairly easy to swap the underlying LLM.
2. Evaluating the solution
This is perhaps the biggest impediment in going from PoC to production. It’s easy to showcase realistic-looking results in PoCs but you need hard metrics that show acceptable performance to be comfortable deploying the solutions to production. Compared to traditional ML, evaluating LLM solutions is more complex because there are more diverse use cases and generally produce non-numeric output. Correspondingly, they have diverse evaluation mechanisms, metrics and tooling/products available. Jane Huang, et al have a great article on the challenges and best practices for evaluating LLM based solutions.
A key part of LLMOps strategy is deciding the evaluation metrics (Responsible AI and use case category-specific), datasets, and tools/products to use for evaluation.
a. Responsible AI categories and metrics
Determine what categories of responsible AI, you want to focus on based on your use cases. Some common categories are Harmful content (hate, sexual, violence, etc.), Regulation (copyright, privacy and security, etc) and hallucination (non-factual content).
b. Use case specific metrics
Evaluation metrics for LLM solutions vary widely based on the use case. Researching the pros and cons of different metrics and providing guidance on metrics to use for evaluating different use case categories will help standardize the evaluation process and also speed up use case implementations.
Listed below are some of the most common use case categories and commonly used metrics.
c. Deciding the product/tool for LLM solution evaluation
There are so many open source and commercial tools that selecting the right one is a big challenge. Most of these tools support multiple solution patterns and use case categories with some having a greater focus on a subset. For e.g., there are are lot of tools/products that heavily focus on evaluating and and helping troubleshoot RAG solutions. Sharing some of the popular ones below. This is by no means comprehensive.
This is perhaps one of the most active areas or research and innovation — from research into the most suitable metrics for different use case categories to tools and products that make it simple to perform the evaluation. Note that a lot of evaluation and metrics are focused on intermediate results as well (e.g., how relevant were the chunks retrieved in a RAG solution) which help greatly in troubleshooting and improving the performance of the solution.
3. Experimentation and Collaboration
Just like in traditional ML, you want the capability to experiment, track the parameters, hyperparameters and performance metrics, and collaborate with your colleagues. In the case of LLM solutions, some of the things you want to experiment with are using different LLMs (e.g., how is the accuracy impacted if you use a smaller mistral model like 8x7B Instruct instead of the much larger and costlier GPT 4), different prompts (prompt engineering) and in case of RAG based use cases — using different text splitting strategies and sizes.
Fortunately, most of the evaluation products mentioned above have the capability to track experiments and corresponding metrics and enable collaboration. Also, most of them have both a managed and self-hosted option.
While the innovation happening in the evaluation space is positive and very helpful, one unfortunate implication is that the metrics for most of the use case categories are not standard — different products and coming up with different metrics (both final and intermediate) to differentiate themselves. Because of this, if you use different products for different use case categories, you can’t track the metrics and experiments in a uniform location like you could with MLFlow for traditional ML use cases. A way around that is to feed the experiment info and the evaluation results from these tools as custom metrics to MLflow Tracking.
4. Continuous Integration/Continuous Delivery (CI/CD)
Interestingly, the CI/CD for most solutions looks like the CI/CD for software applications rather than typical ML solutions. This is because, in the vast majority of cases, you are using the LLM model API instead of training a model (unless the solution involves fine tuning). The traditional software development CI/CD tools like Jenkins, Github Actions, Bamboo, CircleCI can be used by integrating with the evaluation tools above to run unit/functional tests.
One consideration for the CI portion is cost. In most use cases, LLMs are used to test and evaluate the LLM solution performance with test cases. Depending on the size of the test cases and the frequency of updates, costs could become a factor that needs to be optimized. One option to address this is to have code based tests (e.g. testing that RAG solution doesn’t answer a question there is no data for, it includes expected keywords in responses, etc.) for feature branch commits and executing the complete test suite when merging with the main/release branch. DeepEval is a great option to execute unit/functional testing in your CI/CD cycle.
Another factor to consider is that both the LLM based solution and LLM based evaluation might not return the exact same result after each run. There are ways to make them more deterministic (E.g., in OpenAI, using the same seed) but it’s not guaranteed.
Finally, this is a great place to incorporate security related to LLM solutions. Similar to adding CodeQL scans in the CI/CD pipeline for software development, enterprise and solution specific security checks could be added as part of the CI/CD process.
5. Monitoring solutions in production
In addition to evaluating metrics to determine adequate performance before providing the signoff for production deployment, it’s also important to monitor performance in production. The considerations highlighted in “Evaluating the solution” above also apply to monitoring the solution. The same metrics can be used and even the same tools — all of them have production monitoring capabilities.
In addition, it’s valuable to monitor and learn from user engagement metrics like number of users using the application, how often/how much they use it, etc. as well as direct feedback from users like Thumbs up/Thumbs down metrics and comments/suggestions.
Summary
Given the pace of advancements in LLMs, frameworks for building LLM applications and tools/products for evaluation, it might feel like focusing on LLMOps is premature or will slow down value realization. On the contrary, thinking through the different types of use cases in the pipeline, categorizing them and thinking through the different aspects off LLMOps described above will help you craft a tailored LLMOps strategy, thereby helping reduce time to value as well as increase the throughput of use case implementations.
References:
- Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices | by Jane Huang | Data Science at Microsoft | Mar, 2024 | Medium
- OpenAI’s Bet on a Cognitive Architecture
- Ragas Introduction
- Traditional Versus Neural Metrics for Machine Translation Evaluation
- Out of the BLEU: how should we assess quality of the Code Generation models?
- LLM Evaluation Metrics: Everything You Need for LLM Evaluation
- LLM Testing in 2024: Top Methods and Strategies
- OWASP Top 10 for LLM Applications