Optimizing LLM Apps
by Stephen M. Walker II, (Co-Founder / CEO)
Optimizing Generative Performance and User Retention
Large Language Models (LLMs) have revolutionized the field of natural language processing, enabling a broad range of applications, from sentiment analysis to customer support chatbots.
However, extracting the full potential of these powerful models requires a deep understanding of their inner workings and the application of optimization techniques.
In this guide, we will explore the world of optimizing LLM app features, covering methods such as prompt engineering, Retrieval Augmented Generation (RAG), fine-tuning, and iterative refinement to help you unleash the true capabilities of LLMs.

Key Takeaways
When optimizing LLM apps, there are four primary techniques, each playing a unique role in optimization:
- Prompt Engineering – Effectively aligning LLM responses to your expected output with prompts.
- Retrieval-Augmented Generation (RAG) – Retrieving on-demand, unique data or knowledge new to the LLM.
- Fine-tuning – Increase LLM's ability to follow instructions specific to your app or features.
- All of the above – The best AI Teams combine all of these techniques, iterating to increase outcomes.
Prompt engineering, including uncertainty disclosure, is the easiest and often most effective technique. Teams should exhaust all efforts with prompt engineering before electing to fine-tune a model. It is key to continually leverage user feedback and continual evalution of changes to make evidence-based iterations.
Prompt Engineering: Crafting Effective Prompts
Prompt engineering is employed to direct LLMs' conduct and segregate user input query from instructions in natural language processing. Creating precise prompts is a key aspect of enhancing large language models and boosting their utility and effectiveness.

Several techniques have been suggested, including chain-of-thought prompting, generated knowledge prompting, and uncertainty disclosure, which ensure responses that are accurate and tailored to individual requirements within the same context.
The upcoming sections will delve further into chain-of-thought prompting and the encouragement of uncertainty disclosure, emphasizing how these techniques can improve LLM performance by steering their behavior and producing more precise responses.
Chain-of-Thought Prompting
Chain-of-thought prompting allows LLMs to utilize multi-step reasoning processes for problem-solving, addressing user queries more effectively. Given a question, a chain-of-thought (CoT) prompt provides the answer through a sequence of intermediate natural language reasoning steps, guiding the generative AI model to break down multi-step problems into more manageable components.
let's think step-by-step through this:
How much money does Janet make each day by selling her ducks' eggs at the farmers' market, given that her ducks lay 16 eggs per day, she eats three for breakfast, uses four to bake muffins, and sells the remaining eggs at $2 each?
To find the final answer, a CoT prompt guides the generative AI model through a single path of thought, break down the problem step-by-step and playing to the strengths of the model's desire to generate the most probabilistic next token.
To calculate how much money Janet makes each day by selling her ducks' eggs at the farmers' market, we can follow these steps:
1. Determine the total number of eggs laid by the ducks each day.
2. Subtract the number of eggs Janet uses for herself (for breakfast and baking) from the total number of eggs.
3. Calculate the number of eggs available for sale after personal use.
4. Multiply the number of eggs available for sale by the selling price per egg to find the total daily earnings from selling the eggs.
Let's apply these steps to Janet's situation:
1. Janet's ducks lay a total of **16 eggs per day**.
2. Janet eats **3 eggs for breakfast** and uses **4 eggs to bake muffins**. That's 3 + 4 = **7 eggs used for personal consumption**.
3. Subtract the 7 personal eggs from the total laid eggs: 16 - 7 = **9 eggs available for sale**.
4. She sells the remaining eggs at **$2 each**, so her daily earnings from selling eggs are 9 eggs * $2/egg = **$18 per day**.
Therefore, Janet makes $18 each day by selling her ducks' eggs at the farmers' market.
This enhances the model's reasoning capability and improves its performance on tasks like arithmetic or commonsense reasoning.
Encouraging Uncertainty Disclosure
Disclosing uncertainty in LLMs' outputs assists users in comprehending the model's trust in its responses and recognizing potential inaccuracies, providing relevant information. Uncertainty disclosure enhances user interaction with language models by providing transparency and clarity, allowing users to make more informed decisions and take appropriate actions based on the model's output.
However, the implementation of uncertainty disclosure in language models is associated with several difficulties, such as users' incorrect utilization of uncertainty information, insufficient technical knowledge of uncertainty concepts, and biases in the model's training data.
Tackling these challenges is key to making sure that uncertainty disclosure effectively helps users understand the limitations and potential errors in the responses generated.
Retrieval Augmented Generation (RAG) for Contextual Responses
Retrieval Augmented Generation (RAG) is a method that combines the retrieval of pertinent documents with generative language models to generate contextually suitable responses.
RAG not only bolsters the performance of LLM-based applications by integrating external context but also builds user trust by grounding LLMs in verifiable information.
The upcoming sections will cover the implementation of RAG in LLM apps and explain how this strategy helps foster trust with users by providing verifiable facts, all while improving the model's performance.
Implementing RAG in LLM Apps
Incorporating RAG into LLM applications enables models to:
- Draw on external knowledge beyond their training data
- Enhance their capacity to supply precise and up-to-date information
- Improve user experiences and information accuracy
- Retrieve pertinent information from a knowledge base
- Generate more precise and pertinent responses
The implementation of RAG in LLM applications can yield a significant improvement in these areas.
The best way to implement RAG in LLM apps is to:
-
Integrate external knowledge sources, such as a knowledge base or a database of facts, into the model.
-
This allows the model to access external information and utilize it to generate more precise and up-to-date responses.
-
Ultimately, this improves the overall performance of LLM-based applications.
Building Trust with Verifiable Facts
Employing RAG in LLM applications facilitates trust through providing verifiable information, diminishing ambiguity, and minimizing the likelihood of incorrect predictions. Its implementation is relatively straightforward and cost-effective compared to retraining a model with supplementary datasets.
Within the RAG process, using external knowledge sources relevant to the specific use case or industry where the LLM will be used is vital. Confirming the accuracy and timeliness of these sources ensures the LLM's reliability, aiding in the establishment of user trust and enhancement of their experience with the model.

RAG evaluations can be efficiently performed using the RAGAS library. This library provides a comprehensive set of tools for assessing the performance of RAG implementations in LLM applications.
Fine-Tuning for Enhanced LLM Performance
Fine-tuning involves utilizing a pre-existing model and further training it on specific datasets or tasks to adjust its broad general knowledge to more specific purposes. It is important to consider specific needs, available computational resources, and targeted outcomes when determining a fine-tuning strategy.
The upcoming sections will cover the selection of the appropriate dataset and the tracking of fine-tuning progress, aiming to boost the performance of LLMs while maintaining their cost-effectiveness.
Selecting the Right Dataset
Selecting the right dataset for fine-tuning is essential for achieving optimal large language model (LLM) performance and ensuring the model meets specific operational needs. The effect of dataset size on the fine-tuning process of a language model is considerable, with larger datasets generally leading to better performance and higher accuracy.
However, it is essential to judiciously select a dataset that is both large and pertinent to the particular task. Ensuring the quality and relevance of the data is crucial for the fine-tuning process, as simply augmenting the size of the dataset does not always guarantee improved performance.
Monitoring Fine-Tuning Progress
Monitoring fine-tuning progress is vital for identifying potential issues and optimizing the model's performance on the target task or domain. Metrics to consider while optimizing a language model include:
- Perplexity
- Accuracy
- Precision
- Recall
- F1-score
- Any domain-specific metrics
These metrics help track the progress and performance of the embedding model during fine-tuning.
To identify potential issues while monitoring fine-tuning progress, it is recommended to:
- Establish clear performance metrics
- Periodically assess the behavior of the fine-tuned model
- Monitor training session parameters
- Monitor validation loss
- Monitor key indicators
Regular assessment and monitoring help fine-tune the model's performance and maximize its overall effectiveness.
Iterative Refinement: Continuous LLM Optimization
Iterative refinement in LLM optimization is a continuous process that involves assessing initial outputs, gathering user feedback, and adjusting parameters to enhance model performance. By utilizing iterative refinement, the model can learn from its own errors and make adjustments to generate better quality, more pertinent, and precise results, ultimately improving its performance on specific tasks.
The upcoming sections will cover the evaluation of initial outputs and the collection of user feedback, which are both vital elements of the iterative refinement process for continuous LLM optimization.
Assessing Initial Outputs
Evaluating initial LLM outputs is essential for recognizing areas that need to be improved and any potential biases that may be present in the model's responses. Metrics such as Word Error Rate (WER), Perplexity, and Truthfulness can be utilized to assess the initial outputs of a language model.
Baseline evaluation, which entails assessing the initial outputs of the LLM in terms of relevance, accuracy, and any potential inconsistencies or ambiguities, allows for the identification of areas where the model may require refinement or adjustment.
Gathering User Feedback
User feedback, including user query analysis, is essential for machine learning models, as it supplies valuable data that can be utilized to refine the performance and effectiveness of the models. Collecting and analyzing user feedback, including sentiment analysis, enables continuous improvement and enhancement of the model's outputs, resulting in improved predictions and outcomes.
Strategies for collecting user feedback on LLM performance include:
- Integrating feedback components into the AI app
- Tracking user interactions and engagement with LLM features
- Allowing users to rate the usefulness of responses
- Administering surveys, interviews, and user testing
- Creating a questionnaire with targeted questions
LLM Architectures: Understanding the Fundamentals
Understanding LLM architectures is essential for optimizing inference time and enhancing performance through precision and clarity. It is crucial to familiarize oneself with the fundamentals of LLMs and their architectures, as this knowledge enables developers to harness their full potential and optimize their applications.
The upcoming sections will cover optimizing inference time and boosting performance through precision and clarity, both of which are key to understanding LLM architectures and maximizing their potential.
Optimizing Inference Time
Reducing inference time is vital for efficient LLM app performance, as it decreases the computational resources needed for real-world prompts. Memory utilization in NLP models can significantly influence inference time, with efficient memory utilization allowing for more effective utilization of hardware resources, such as CPU and GPU, resulting in improved performance and decreased latency during inference.
Additionally, increasing the batch size can result in quicker inference time as multiple samples are processed in parallel, although there is a limitation to the batch size based on elements such as sequence length and available memory. It is essential to identify the ideal batch size that balances efficiency and resource restrictions.
Enhancing Performance Through Precision and Clarity
Focusing on precision and clarity in LLM outputs ensures effective communication between users and models, providing accurate and contextually appropriate information. Measures such as utilizing multiple language model instances to propose and debate responses or deploying multiple AI models to communicate and refine reasoning can be employed to increase the accuracy of LLM outputs.
Clarity in language model outputs is paramount for successful communication and comprehension. Clear outputs enable users to:
- Comprehend the information provided by the model accurately
- Eliminate any potential ambiguity and confusion
- Enhance the user experience
- Guarantee that the intended message is effectively conveyed
From OpenAI DevDay: A Survey of Techniques for Maximizing LLM Performance
At OpenAI's inaugural developer conference, a palpable sense of excitement filled the air as John Allard, an engineering lead at OpenAI, and Colin Jarvis, EMEA Solutions Architect, took the stage. Amidst the applause and the hum of anticipation, he shared a wealth of knowledge on enhancing large language model (LLM) performance through fine-tuning. This session promised to unravel the complexities and offer practical insights for developers eager to solve the problems that matter most to them.
In this breakout session from OpenAI DevDay, a comprehensive survey of techniques designed to unlock the full potential of Large Language Models (LLMs) was presented. The session, which garnered over 72,000 views on YouTube since its broadcast on November 13, 2023, explored strategies such as fine-tuning, Retrieval-Augmented Generation (RAG), and prompt engineering to maximize LLM performance. These techniques are crucial for developers and researchers aiming to optimize the performance of their LLMs and deliver exceptional results. The advice mirrors exactly what we advise our customers: starting with prompt engineering, moving to retrieval for better context, and bringing in fine-tuning to improve instruction following.
Here are the key takeaways from the session:
A Framework For Success
Colin's message was clear: there is no one-size-fits-all solution for optimization. Instead, he aimed to provide a framework to diagnose issues and select the appropriate tools for resolution. He emphasized the difficulty of optimization, citing the challenges in identifying and measuring problems and in choosing the correct approach to address them.

Optimization Challenges Optimization is difficult due to separating signal from noise, measuring performance abstractly, and choosing the right approach to solve identified problems.
Maximizing Performance The goal is to provide a mental model for optimization options, an appreciation of when to use specific techniques, and confidence in continuing optimization.
Evaluation and Baseline Establishing consistent evaluation methods to determine baselines is crucial before moving on to other optimization techniques.

Context vs. Action Problems Determining whether a problem requires more context (RAG) or consistent instruction following (fine-tuning) helps in choosing the right optimization path.
The Journey of Optimization
Non-linear Optimization Optimizing LLMs isn't linear; it involves context optimization (what the model needs to know) and LLM optimization (how the model needs to act). The typical optimization journey involves starting with prompt engineering, adding few-shot examples, connecting to a knowledge base with retreieval (RAG), fine-tuning, and iterating the process.
Prompt Engineering The starting point for optimization. It's quick for testing and learning but doesn't scale well. Prompt engineering is best for early testing and learning, and setting baselines, but not for introducing new information, replicating complex styles, or minimizing token usage.
Retrieval-Augmented Generation Retrieval-augmented generation (RAG) is introduced to provide more context when prompt engineering is insufficient.

Fine-tuning Fine-tuning is used when consistent instruction following is needed, even after RAG has been implemented.

OpenAI's Commitment To Their Customers
As the session concluded, the audience was left with a profound understanding of the typical optimization journey. From prompt engineering to incorporating few-shot examples, connecting to a knowledge base, fine-tuning, and iterative refinement, developers were equipped with a roadmap to enhance LLM performance.
With the insights and experiences shared by John and Colin, OpenAI's developer conference was not just a gathering but a beacon for those looking to push the boundaries of artificial intelligence. The knowledge imparted here promised to lead the way for more sophisticated and efficient AI solutions, unlocking the full potential of LLMs in the process.
Bringing this all together
Optimizing Large Language Models is a multifaceted process that involves prompt engineering, retrieval augmented generation, fine-tuning, and iterative refinement. By understanding LLM architectures and their fundamentals, developers can maximize LLM performance, enhance user experiences, and unlock the full potential of these powerful models. As the field of natural language processing continues to evolve, staying informed and adapting to new optimization techniques will be vital for ensuring that LLMs remain at the forefront of innovation and continue to deliver exceptional results.
Frequently Asked Questions
How can I improve my LLM performance?
To improve your LLM performance, utilize prompt engineering, implement RAG, and consider fine-tuning to maximize parameter efficiency. Adopt a structured optimization framework to identify and tackle issues, and iterate through prompt engineering, few-shot learning, retrieval, and fine-tuning based on consistent evaluation metrics for ongoing improvement.
How can I make my LLMs faster?
You can try three things for your app: try fewer input tokens, pick a faster model, or fine-tune the model for your use case. Generally, to make LLMs faster, try model pruning, quantification, model distillation, parallel processing, subword tokenization, optimized libraries, batch inference workloads, and adapters. These steps can help to reduce the size of the model, precision of numerical values, and optimize performance.
How much GPU memory (VRAM) is needed for LLM?
The amount of memory (VRAM) needed for LLMs can vary greatly depending on the size of the model and the complexity of the tasks it is performing. As a general rule, larger models and more complex tasks will require more VRAM. However, there are techniques to optimize VRAM usage, such as model quantization and pruning, which can help to reduce the VRAM requirements of your LLM.
What is the difference between fine-tuning and prompt engineering?
Fine-tuning is the process of adapting a pre-trained language model to a specific task or domain, whereas prompt engineering is the craft of crafting effective prompts to align the output of a language model with user objectives. Over time, you can fine-tune a model for the optimal performance of your best prompts and responses.
What is the purpose of iterative refinement in LLM optimization?
Iterative refinement in LLM optimization is a continual process that assesses initial outputs, gathers user feedback and adjusts parameters to refine the model's performance. It enables LLMs to learn from their mistakes and improve their results to be more precise and pertinent.