Reduce LLM costs effectively

9 min readSep 15, 2024

Practical Strategies for Cutting Costs While Maximizing Performance

Many AI engineers and startups often encounter an unexpected and daunting challenge: the high costs associated with running LLMs. Stories abound of startups receiving unexpectedly large bills due to unforeseen usage patterns or overly complex models, and realizing that their cost per user is too high to sustain a profitable business model.

Managing these costs effectively is crucial for building a viable AI product. In this post, I will discuss practical strategies to reduce LLM costs without compromising performance or user experience.

Strategies to Reduce LLM Costs

1. Select the Right Model for Each Task

Choosing the appropriate model for each specific task is essential. The cost differences between models can be dramatic, and leveraging these differences can lead to substantial savings.

LLM cost cmparison

Here’s how you can strategically select and use the right models:

Use High-Performance Models Sparingly: Start with a powerful model (like GPT-4 or GPT-4 Turbo) for initial development and data collection. These models offer superior performance and accuracy, making them ideal for gathering the high-quality data needed to train your application effectively. However, due to their high costs, it’s crucial to use these models only when absolutely necessary, such as when building your initial dataset or handling tasks that require complex reasoning or nuanced understanding.
Switch to Smaller Models for Specific Tasks: Once you have sufficient data, fine-tune a smaller, cheaper model (such as Mistral 7B or LLaMA) for specific, repetitive tasks. Smaller models can perform nearly as well as larger ones for narrow domains like data extraction, sentiment analysis, or specific customer service inquiries. Fine-tuning involves training the model further on a particular dataset to enhance its performance on a specific task. By leveraging this approach, you can significantly cut costs while maintaining acceptable levels of accuracy and performance.
For example, if your AI product needs to categorize customer emails or extract key details from standardized documents, a smaller model that has been fine-tuned for these tasks will suffice. This approach allows you to reserve the expensive, high-performance models for more complex or less predictable tasks, thereby optimizing overall costs.
Implement a Model Cascade: A model cascade involves setting up a system where multiple models are used in sequence, starting with the cheapest and simplest model and escalating to more complex and expensive models only if necessary. For instance, a smaller model (like Mistral or LLaMA) can handle initial queries. If this model is uncertain or cannot confidently provide a satisfactory answer, the query is escalated to a more sophisticated model, such as GPT-4.
This cascading strategy leverages the fact that the cost difference between models can be enormous, sometimes over 100 times. By using the cheaper models first, you ensure that the high-cost models are only utilized when absolutely necessary, reducing the overall expenses while maintaining a high level of accuracy and user satisfaction. Moreover, setting confidence thresholds for when to escalate can further fine-tune this process, balancing cost efficiency and performance.

2. Optimize Token Usage

Every token (word or character) processed by a model contributes to your costs. Therefore, minimizing the number of tokens used is a vital strategy for controlling expenses. Here’s how you can optimize token usage effectively:

Pre-Process Inputs to Minimize Tokens: Before sending data to a large and expensive model, use smaller models or simpler algorithms to clean and summarize the input. For example, Microsoft’s “LLM Lingua” method reduces token usage by stripping away unnecessary words and focusing on the core content that matters most to the query. By pre-processing data, you can cut down the tokens that the expensive model needs to process, potentially reducing token usage by a substantial factor.
For instance, if your AI application needs to summarize long-form text, instead of directly sending the entire document to an LLM, use a smaller model to extract only the most relevant sentences or paragraphs. Then, send this pre-processed, condensed version to the LLM. This method reduces the token count significantly, saving costs while still achieving the desired output quality.
Improve Memory Management for AI Agents: AI agents that engage in multi-turn conversations can accumulate a large amount of context over time. Many developers use a “conversation buffer memory” that stores the entire conversation history, which can lead to ballooning token usage as the conversation lengthens. A more cost-effective approach is to use “conversation summary memory,” where the conversation history is summarized periodically to keep the context manageable.
For example, instead of storing every word of a customer service chat, the system can periodically summarize what has been discussed. This keeps the token count lower, reducing the cost for generating subsequent responses. Another alternative is the “summary buffer memory” technique, where the most recent part of the conversation is stored in detail, while older parts are summarized. This approach maintains essential context while minimizing token usage, striking a balance between memory and cost.

3. Monitor and Analyze Costs Regularly

To manage costs effectively, continuous monitoring and analysis are crucial. By understanding where and how expenses are incurred, developers can identify opportunities for optimization. Several tools and platforms can help with this:

Track Model Performance and Costs: Tools like LangSmith from LangChain offer detailed insights into the costs associated with each model call. They log every task completion attempt, tracking how long it takes, how many tokens it consumes, and providing a breakdown of token usage for each model. This data is invaluable for identifying which tasks or models are driving up costs.
Identify Cost-Intensive Tasks: Regularly reviewing the logs and performance metrics allows you to pinpoint tasks that are unusually expensive or models that are not cost-effective. For example, you might find that a certain task, such as generating lengthy responses or processing large amounts of unstructured data, is consuming more tokens than anticipated. By understanding this, you can make informed decisions on whether to switch models, pre-process data differently, or change the way tasks are handled.
Experiment with Optimization Strategies: Armed with data on model performance and costs, you can experiment with various optimization strategies. This might include swapping to cheaper models for certain tasks, implementing token-efficient pre-processing methods, or adjusting your model cascade thresholds. Continuous experimentation and iteration will help refine your approach, leading to more effective cost management over time.

Real-World Cost Optimization Techniques

To better understand how to reduce costs when using large language models (LLMs), let’s explore two practical approaches that many AI developers and companies have successfully implemented: Model Cascades and Routers, and Pre-Summarizing Input Data.

1. Model Cascades and Routers

The concept of model cascades and routers is built on the principle that different tasks require different levels of complexity, and not every query or task demands the most powerful model available. By strategically using a sequence of models with increasing levels of sophistication, companies can handle most queries with cheaper models and escalate to more expensive ones only when necessary. Here’s a closer look at how this works:

How Model Cascades Work: Imagine a system designed to answer customer service queries. Instead of using a single, powerful model (like GPT-4) for every query, you first deploy a simpler, cheaper model (such as GPT-3.5 Turbo or Mistral 7B) to handle the initial response. If this model can provide a confident answer based on the query’s complexity and context, the system stops there, resulting in a low-cost interaction.
However, if the initial model is unsure or cannot provide an adequate response (e.g., it lacks context or encounters a more nuanced question), the system automatically escalates the query to a more advanced model, like GPT-4. This cascade approach ensures that high-cost models are used sparingly, only when absolutely necessary. The cost savings can be substantial, especially when many interactions can be resolved by the cheaper models.
Benefits of Model Cascades and Routers:
Cost Efficiency: This approach leverages the massive cost differential between simpler models and more advanced ones. For example, running a single query on GPT-4 could cost up to 100 times more than running it on Mistral 7B. By handling most queries with a cheaper model, total operational costs are significantly reduced.
Maintained Accuracy: Properly tuned, this system maintains a high level of accuracy for end-users. Simple questions are resolved quickly by the smaller models, while complex ones still benefit from the depth and sophistication of more advanced models when needed.
Improved Performance and Response Time: Since simpler models typically require less computational power and time to produce results, the initial response time for many queries can be faster. This contributes to a better user experience, particularly in time-sensitive applications like customer service or real-time chatbots.
Real-World Examples:
Hugging Face’s “Hugging GPT”: Hugging Face, a prominent player in the AI space, introduced the concept of using multiple models in a cascading setup to optimize both performance and cost. “Hugging GPT” uses a primary model to route tasks to specialized smaller models that can handle specific subtasks more efficiently. For example, a user request to analyze an image might first involve converting the image to text, then applying a sentiment analysis model, and finally summarizing the results. This approach optimizes the model usage chain to ensure only the necessary models are engaged.
NeurIPS Multi-Agent Systems: In another example, multi-agent systems have been used in research to coordinate tasks among different AI agents, where less complex agents handle simpler tasks and escalate to more powerful agents only when required. This dynamic allocation of tasks helps in distributing the computational load efficiently and minimizes overall costs.

2. Pre-Summarizing Input Data

Another highly effective strategy for cost reduction is to minimize the number of tokens that an LLM needs to process by pre-summarizing input data. This technique involves using smaller models or even simpler algorithms to clean, condense, and summarize the input before sending it to the more expensive, larger model for the final output.

How Pre-Summarizing Input Works: Suppose your AI product involves processing long documents to answer user queries. Instead of directly sending the entire document to a costly LLM like GPT-4, a smaller, more efficient model (like GPT-3.5 Turbo or Mistral 7B) can first analyze the text to identify key points, summarize the content, and remove unnecessary information.
This summarized version, now containing far fewer tokens, is then sent to the expensive model. The larger model now processes a much smaller input, which drastically reduces the token usage and thus the overall cost. This method is particularly useful when dealing with large datasets or when the input data includes a lot of noise (irrelevant information).
Benefits of Pre-Summarizing Input Data:
Reduced Token Usage: By condensing the input data before processing it with a larger model, you significantly lower the number of tokens being sent and received. This reduction directly translates to cost savings, as LLMs charge based on the total number of tokens processed.
Improved Accuracy and Relevance: Pre-summarizing input data helps to eliminate irrelevant information that might confuse the model, leading to clearer, more accurate outputs. This is especially beneficial in applications like document summarization, customer support, or content generation, where precision is critical.
Enhanced Processing Speed: Smaller models require less time and computational power to generate a summarized output, speeding up the overall response time of your application. This is particularly useful for real-time applications where latency can significantly impact user experience.
Real-World Examples:
Microsoft’s “LLM Lingua” Approach: Microsoft developed an approach that uses smaller models to pre-process and clean inputs before sending them to the larger LLM. For example, in the context of summarizing lengthy meeting transcripts, a smaller model can identify and extract the most relevant sentences, condensing thousands of tokens into a much shorter version. This reduced input is then fed into a more powerful model for generating a polished summary or answering specific questions, leading to substantial cost savings and faster response times.
Commercial AI Platforms Using Pre-Summarization: Many commercial AI platforms that offer document analysis, like contract review tools or content summarization services, employ a similar technique. They utilize a smaller model or a rule-based engine to filter out non-essential content and focus only on key sections, thereby reducing the token load sent to more sophisticated models. This process has been shown to save costs while maintaining high-quality outputs.

Final Words:

By implementing these real-world strategies, using model cascades and routers, and pre-summarizing input data,developers and AI startups can effectively manage and reduce LLM costs while maintaining high performance and user satisfaction. These techniques illustrate that with careful planning and optimization, it is possible to leverage the power of large language models without incurring prohibitive expenses.

Whether you’re developing an AI-driven customer service bot, a content generation tool, or any other application involving LLMs, consider how these strategies might be adapted to your specific needs. The key to sustainable AI development lies in continuous experimentation, monitoring, and optimization.