Scaling AI Applications

OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

The OpenAI DevDay 2024 session delved into the complexities of scaling AI applications efficiently while maintaining a balance between accuracy, latency, and cost. As the user base for an application grows, the strategies that work at a smaller scale often falter. OpenAI's AI Solutions Team leader Colin Jarvis and API platform team product lead Jeff Harris discussed the necessity of optimizing models for accuracy first, followed by cost and latency. The session highlighted methods for evaluating model performance, setting accuracy targets that align with business goals, and employing prompt engineering, retrieval-augmented generation (RAG), and fine-tuning to achieve desired outcomes. Additionally, the discussion covered techniques to enhance latency and reduce costs, offering insights on network optimization and the impact of choosing appropriate model sizes. The presentation emphasized that there is no one-size-fits-all solution, and encouraged developers to experiment with different strategies.

Highlights

Scaling sustainably requires critical decisions on LLMs, cost prediction, and response time management. 🚀
OpenAI's enhancements have drastically reduced token costs, opening new possibilities. 🔍
Different models bring unique advantages—use them wisely for accuracy and speed. 🤹‍♂️
Prompt engineering and RAG can significantly impact model performance and efficiency. ⚙️
Evaluate and iterate: Regular evaluations and accuracy targets ensure effective deployment. 🔄
BatchAPI provides a cost-effective way to manage non-urgent requests asynchronously. 📈

Key Takeaways

Scaling AI is an art: Balancing accuracy, latency, and cost is crucial for efficient AI applications. 🎨
Accurate first: Optimize for accuracy before tackling latency and cost. 🧠
Prompt and tune: Use prompt engineering, RAG, and fine-tuning to hit accuracy targets. 🔧
Mind the speed: Output latency often consumes the most time—think small models and concise outputs. 🏃‍♂️
Cost efficiency: Utilize prompt caching and BatchAPI for smarter budget use. 💰
Stay flexible: No single playbook exists—adapt strategies to your specific needs. 📚

Overview

The OpenAI DevDay discussion emphasized the vital need to scale AI applications thoughtfully by juggling three essential factors: accuracy, latency, and cost. As a developer sees their app's user base surge, what once worked can quickly unravel unless attention is paid to optimizing these areas. Leading the session, Colin Jarvis and Jeff Harris shared their insights into the often complex art of maintaining balance and achieving precision without breaking the bank or slowing down user interactions.

Focusing first on accuracy, the presentation advised using intelligent models to meet performance goals before shifting attention to latency and cost efficiency. Techniques like prompt engineering and retrieval-augmented generation were encouraged to fine-tune models towards business-specific accuracy targets. The speakers also walked through various strategies translators could employ to counter common obstacles—such as setting an ROI-tied accuracy target to help decide when things are 'good enough' to release.

Harris closed with insights on reducing costs and latency, revealing just how much output latency could affect overall performance. He underscored the advantage of employing strategies like prompt caching and OpenAI's BatchAPI, which can halve costs for certain tasks by processing them asynchronously. Furthermore, the talk highlighted how model size and context choice can influence latency, framing these optimizations as essential to creating faster, more affordable, and accurate AI experiences.

Chapters

00:00 - 00:30: Introduction and Scalability Challenges The chapter discusses the trajectory of an application's growth and the challenges of scalability that come with a rapidly expanding user base. When an app's user base doubles or grows even more, the initial solutions that worked for a smaller user group may no longer suffice. This necessitates making pivotal decisions, particularly concerning the choice of Language Learning Models (LLMs) to maintain sustainable scaling and continued success.
00:30 - 18:00: Optimizing for Accuracy The chapter titled 'Optimizing for Accuracy' mainly discusses strategies for enhancing the accuracy of AI applications while managing costs and maintaining fast response times. Colin Jarvis, who leads the AI Solutions Team at OpenAI, alongside Jeff Harris, a product lead on the API platform team, explore various common pitfalls and techniques when scaling applications on their platform. They emphasize that there isn't a one-size-fits-all approach to optimization, but rather multiple techniques and trade-offs to consider.
18:00 - 32:00: Improving Latency and Cost The chapter "Improving Latency and Cost" focuses on providing approaches and best practices for optimization that can greatly benefit what you are building. It emphasizes the importance of optimizing applications as a core aspect of continuous improvement. They highlight the development and release of more intelligent and faster models, such as GPT-4o, which is twice as fast as its predecessor, 4 Turbo. The relentless push to reduce costs is also underscored. A chart illustrating progress since the release of text-davinci-003 is mentioned as an example of these efforts.
32:00 - 35:00: Conclusion and Final Thoughts The chapter discusses the significant advancements in AI language models, particularly highlighting the cost reduction and efficiency improvements. It mentions a comparison between the 2022 model and the newer 4o mini model, highlighting a 99% decrease in cost per token. The chapter also contrasts the expenses of using the 32k version of GPT-4 for generating a million tokens, which was $120, to the much more affordable 60 cents with the GPT-4o mini. This represents a remarkable 200 times improvement in just about a year and a half.

OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale Transcription

00:00 - 00:30 [ Applause ] -You built an app and its user base is expanding, quickly. Doubling or more. And the growth isn't slowing down. The success is exciting, but the challenge is scaling sustainably. What worked for a 1,000 users, often doesn't extend to one million. You'll need to make critical decisions about which LLM's
00:30 - 01:00 to use, how to predict and manage costs. And how to keep response times fast. I'm Colin Jarvis, I lead the AI Solutions Team here at OpenAI. -And I'm Jeff Harris, I'm one of the product leads on the API platform team. Today we're going to cover common pitfalls and techniques for stealing your apps on our platform. And to start I wish I could tell you that there was one playbook that would just work to give you the perfect way to optimize now in the app. Of course there isn't. There are lots of techniques and many tradeoffs to be had.
01:00 - 01:30 But we're hoping that this talk gives you a set of approaches and best practices to consider and that you walk away with some optimization ideas that do end up working really great for what you're building. And the first thing I'll say before we talk about how you should optimize, is, we think optimizing your apps is central to what we do. So we push out models that are more intelligent, we make models that are faster, GPT-4o is twice as fast as 4 Turbo. And we push relentlessly on cost. So, I love this chart since text-davinci-003,
01:30 - 02:00 which came out in 2022. That was the less capable model than 4o mini. And our cost per token has decreased since then by about 99%. So that's a really good trend to be on the back of. Another example, the 32k version of GPT-4, if you wanted a million tokens from that model, that cost $120. So, quite expensive. Today, if you do that with GPT-4o mini, that's 60 cents. So, 200 times improvement in just about a year and a half.
02:00 - 02:30 Now, these lower costs, combined with fine tuned, smaller models like 4o mini, they've unlocked a lot of new use cases. And we see this in our data. Since we released 4o mini, in July, token consumption on our platform has more than doubled. So a ton of new ways of using our API. But, more models brings more questions. When to use 4o mini? When to use 4o? When is reasoning required? We've worked with lots of builders like you across many different sizes and use cases.
02:30 - 03:00 And we've developed a pretty good mental model, to help you work through those kinds of decisions. So Colin's going to talk about how to improve accuracy. The quality of the outputs of the model. And then I'm going to be back in just a few minutes to talk about how to optimize latency and cost. -Cool. Skill and AI apps involves balancing accuracy, latency and costs in a way that maintains accuracy at the lowest cost and speed. It's important to keep that in mind as we go through, because clearly none of this is going to be silver bullets,
03:00 - 03:30 we're just trying to kind of share some of the tradeoffs that we've seen work well in this area. To accomplish this we've worked out an approach that generalizes pretty well across most of our customers. This is a start by optimizing for accuracy. So use the most intelligent model you have until you hit your accuracy target. Accuracy target is typically something that means that hasn't meaning in business terms. Like 90% of our customer service tickets are going to be routed correctly on the first attempt. Once you've hit your accuracy target, you then optimize for latency and cost.
03:30 - 04:00 With the aim being to maintain that accuracy, with the cheapest fastest model possible. I'm going to begin with how to ensure your solution is going to be accurate in production. So, optimizing for accuracy is all that starting with that accuracy target that means something. So first we need to build the right evaluations, so that we can measure whether or not our models performing as expected. I know you folks know that, so we're going to just kind of recap that and then we're going to share some best practices that we've seen from folks built in the real world.
04:00 - 04:30 Then we want to establish a minimum accuracy target to deliver ROI. This is an area that a lot of folks skip. But what we find is a lot of folks start arguing about like, what accuracy is good enough for production. It's an area that a lot of folks get stuck at and can't actually like take the final leap. And we want to share some ways that we seen customers kind of communicate that to get past that. And lastly, once you have that accuracy target, you get to the actual optimization bit. Of like choosing your prompt engineering, rag, fine tuning techniques, to actually reach that accuracy target.
04:30 - 05:00 So, first step, is to develop your baseline evals. So, it's being like almost two years since ChatGPT came out, but a lot of people still skip this step. So I want to just start here just to recap, kind of make sure that we're on the same page. I know all of you know what evals are, so I'll just recap the basics quickly. So, we encourage our customers to embrace eval driven development. So, within an LOM you can never consider a component like built into you've got an eval to actually test whether it runs end-to-end and actually produces that results that you're intending.
05:00 - 05:30 Evals come in a lot of flavors, but the two like QA's we try and flip frame them are component evals. So, like simple evals that function like a unit test, usually deterministic, usually tests like a single component to make sure it's working. And then end-to-end evals. Which are more of like your kind of black block text. Where you're going to put in the input at the start of the process. And then you're going to measure what comes out at the very end and it might be like a multi-step network or something that's working through that eval. Probably sounds like a lot of work, but, we found some effective ways of scaling these
05:30 - 06:00 that I want to share with you guys that we've been working through with some of our customers. So, I want to go through an example for customer service, because this a pretty common use case that we run to with our customers. Reasonably complex, you usually got a network of multiple assistants with instructions and tools that are trying to do things. So, a couple examples, simple component, evals, like did this question get routed to the right intent? Simple true false check. Did it call the right tools for this intent? And then, did the customer achieve their objective? They came in looking to get a return for $23.
06:00 - 06:30 Did they actually get that? Now the, I guess the best practice that we found here is customers using LLM's to pretend to be customers and red team their solution. So, effectively what we'll do is like, spin up a bunch of customer LLM's and then run those through our network. And those are going to act as like an automated test. That's going to track diversions, where if we like change a prompt here, we want to check that actually everything's still being routed and completed correctly.
06:30 - 07:00 So, I'll just quickly show a network here. If this works. If, cool. So, I want to talk to this network and kind of share how this works. So, a lot of details here. The main thing to take away here, is this a fairly typical customer service network. You've got a bunch of intents, each of those intents routes to an instruction and some tools. So, the way that we're approaching this with all customer service customers now is, we'll start off by mining historic conversations.
07:00 - 07:30 And for every one we're going to set an objective that customers trying to do. They were trying to get a refund for $12, they were trying to book a return, whatever it might be. Then we give those to an LM and get them to pretend to be that customer and run through the network and then we mark whether all these evals pass. So, in this case, this customer tried to purchase plan b and they went through, they got the upgrade plan intent and they successfully did that. Then we run through another customer. And they failed to get routed correctly, so they fail.
07:30 - 08:00 And then we run through another one. And they get routed correctly, but then they fail to follow the instructions. So, this won't be 100% accurate, but we seen customers use this to scale, their customer service networks, because they're able to test like, it's almost acts like every time you raise a PR, you're going to rerun this unmade tests and figure out whether your customer service network has regressed. The reason I wanted to share this, I mean, seems pretty basic, but a lot of the time the biggest problem with customer service networks is divergence.
08:00 - 08:30 So, we change something way over here in the network, how does it affect the whole network? And the reason why I'm sharing this particular approach, we did this with a customer, where they started off with 50 routines that were being covered. We eventually scaled to like over 400, using this method. And the roll-out of the new routines got faster and faster just because this was kind of our base and we ended up with over 1,000 evals that we ran each time that we made a material change the network. So, once you have your evals, then the next step is probably the biggest area of friction
08:30 - 09:00 for customers are deploying to production. Which is deciding like when is enough to go to production? How accurate is good enough? You're never going to get 100% with LLM's. So, how much is good enough? So, I want to take an example from one of our customers. We had done two pilots for their customer service application. And accuracy was sitting at between like 80 and 85%. The business wasn't comfortable to ship, so we needed to agree on a target to aim for. So, what we did was we built a cost model to help them model out the different scenarios and decide how, like where to set that accuracy target.
09:00 - 09:30 So, we took 100,000 cases and we set a bunch of assumptions. The first one was that we're going to save $20 for everyone with the AI successfully triage's on the first attempt. For the ones that get escalated, we're going to lose $40. Because each of those is going to need a human to work on it for a series of minutes. And then of those ones that get escalated, 5% of customers are going to get so annoyed that they're going to churn. So, we're going to leave, we're going to lose $1,000 for each of those customers. What we ended up with was a break-even point of 81.
09:30 - 10:00 5% accuracy to basically break even on the solution. Then we agreed with management, right, let's go for 90% and that is going to be an area that we're actually comfortable to ship at. Now, the interesting like addendum to this story, is that people often have a higher expectation of accuracy for LLM's, than they do for people, it's probably not a surprise from folks in the room. And in this case, when we set this 90% accuracy marker and met it, they also wanted to do an AB test on humans. So what we did was do a parallel check of human agents.
10:00 - 10:30 So the fully human tickets, we took a few of this and we tested those to see how they performed. And they ended up at 66% accuracy. So, in the end, it was fairly simple decision to ship at 90. So, I think that the key thing here is, once you got your evals, and your accuracy target, then you know how much to optimize. So this stage we got the business onboard, we've got evals that measure it, we know that 90% is the target. So then we need to hill climb against that 90% and actually get there. So, here, I want to revisit our optimization techniques.
10:30 - 11:00 So, this four-box model is pretty common asset that we use within OpenAI. So, typically you start in the bottom left corner. Prompt engineering, best place to start. Again with a prompt of explicit instructions, make an eval, figure out why your, where your evals are feeling and then decide where to optimize from there. If your models failing because it needs new context, so it needs new information that it hasn't been trained on, then you retrieval augmented generation or RAG.
11:00 - 11:30 It's failing because it's following instructions inconsistently, or needs more examples to the task or style you're ask for, you need fine-tuning. We have a lot of other resources that a go into a bit more detail on this. So for now I just want to call out a couple of like key principles that we've seen about these over the last I"d say six months. Where the Meta is like changing slightly for each of these techniques. First one, it's prompt and engineering. So, we get a question a lot of like, does long context replace RAG? And the answer is, it doesn't for now,
11:30 - 12:00 but I think it allows you to scale prompt engineering much more effectively. So we see long context is really good way of like figuring out how far you can push the context window. So, you might not need as effective a RAG application, for example, if you can stuff it with like 60k tokens rather than 5k tokens. And still maintain performance against your evals. The other thing that we're seeing is automating prompt optimization. So, there's a concept called meta prompting, where you effectively make a wrapping prompt
12:00 - 12:30 that looks at your application. You give it the prompt and the eval results and then it tries to iterate your prompt for you. And you basically do this almost grid search where you're running evals, it's iterating your prompt, then you're feeding back into this like wrapping prompt and it's improving your prompt again. And this is an area where we're seeing, I think all the stuff we've talked about so far, it's like a lot of manual iteration. And I think the real difference with meta prompting is that people are starting to lean models to actually accelerate that for you. And I think one area that we're seeing that have like a lot of effect is actually with o1.
12:30 - 13:00 One of the use cases that o1 is like best at is actually meta prompting. We did it with a customer for, to generate customer service routines and then optimized them. And it saved them like probably like, I think we did a couple weeks work into like about a day or two. So, it's definitely that I encourage you guys to try and we do have some cookbooks showing as well, coming as well to like show you how to do that. So, once you move past prompt engineering, there's RAG. So, I know everybody in the room knows about RAG, so I just want to call out a couple of things. Which are probably most key
13:00 - 13:30 that I've seen over the last few months to making RAG applications work. First one, is that not every problem needs a semantic search. Probably fairly obvious, but, we see a lot of I think one of the most like obvious and simple optimizations that people do, is put in a classification step at the start and saying, based on what this customer is asked, does this require a semantic search? Does it require a keyword search? Or does it require like an analytical question answer the question. And we're seeing a lot more folks choose databases where you can have vectors, as well as keyword search, as well as like write in sequel, to answer certain questions.
13:30 - 14:00 And it's pretty simple optimization, but I think it's actually one that you get like a ton of mileage one that I'd suggest people try. The other one, is extending your evals to cover retrieval. So, once you add retrieval you have like a new access on your evals. And like frameworks like RAG apps and this sort of thing just kind of formalize this. But the important thing is like extending every eval example, to show the context that you've got. And then highlight, you know, are we retrieving the right context? Could the model actually answer with the content that it was given?
14:00 - 14:30 Again, fairly straightforward, but again, something that a lot of folks don't do when they're working with RAG applications, so. There's a couple things there. The last one is fine-tuning. And I don't want to labor fine-tuning, because I know that you folks have heard a lot about distillation today. So the main thing with fine-tuning is to start small and build up. You'd be surprised how few examples fine-tuning needs to perform well, typically for like 50 plus examples is enough to start with. Making a feedback loop. So making a way to like log positive and negative responses and then feed those back in.
14:30 - 15:00 Again, fairly straightforward but, a useful like kind of aspect here. And the last thing is considering distillation. So, just like meta prompting, having a way to like scale up and actually let the system learn and actually feed those like feed those positive and negative examples back into the retrain whatever model you've got. It is a fairly like kind of key aspect but, for fine-tuning. So, to kind of bring this to life how we've seen this like actually work in real life, I want to share a customer example here. And for this one,
15:00 - 15:30 we had a customer that had two different domains that they were using for a RAG pipeline. And they were fairly nuanced questions that they were trying to answer. So the baseline we were working from was 45%, which is pretty bad, that was like a standard retrieval with cosine similarity. They also worked in a regulated industry. So, what we did was, get our baseline evals, 45%, not great. And then we set our accuracy target. For that, we decided to maximize for false negatives rather than false positives. So we'd be happy if the model says it doesn't know rather
15:30 - 16:00 than it hallucinate. And we used that set our kind of tolerance for this one. Then we targeted 95% accuracy on an eval set. And kind of had at it. And the route that we took to optimization is here. The reason I included the like ticks and crosses with each of these is, to show that we tried a ton of stuff and not everything worked. But, the kind of key, the key things that improved performance, first of all, were chunking and embedding. So doing a grid search across different chunking strategies and figuring out what like the right area to set our tolerance was.
16:00 - 16:30 To get to 85% we then added that classification step I talked about. So, if somebody types in one word, you want to do a keyword search rather than a semantic search. Pretty straightforward, but again 20% percentage point boost. As well as a reranking step. So taking all the context and then training a domain specific reranker. So depending which domain the question was for, we had a specific reranker which would rerank the content of the chunks. And then the last thing that got us from 85 to 95 is a lot of time people would ask analytical questions of this RAG system.
16:30 - 17:00 And often RAG systems will just like fetch the. 10 documents and give you the wrong answer for a lot of these questions. So, we added some tools so that it could write sequel on the content and we also added query expansion. Where we fine-tuned a query expansion model to basically infer what the customer wanted. And that was what we eventually shipped with. So, I guess just bringing all this together that we talked about for optimizing for accuracy, start with building evals to understand how the apps performing. You then set an accuracy target that makes an ROI
17:00 - 17:30 and keeps your application safe. And then you optimize using prompt engineering, RAG and fine-tuning until you hit your target. And that point you have an accurate app. But the problem is it might be slow and expensive. And solve those problems, I'll pass it over to Jeff. So thank you all very much. -Fantastic. So, first, you focus on accuracy you need to build a product that works. But once you've achieved that desired accuracy it's time to improve latency and cost. And it's both for the obvious reason and saving their users time and yourselves money.
17:30 - 18:00 But also more profoundly because if you can reduce the latency and cost for each of your requests, then you can do more inference with the same dollar and time budget, which is just another way of implying more intelligence to the same request. And another pretty central way of improving accuracy. So, let's start by talking about techniques to improve latency. And then we'll get to cost. The first thing to understand about latency is an LLM is not a database.
18:00 - 18:30 You should not be fixating on total request latency, that's not the right way to think about latency for LLM. Instead you want to break it down into three different subcomponents that make up total request latency. And those are the network latency, that's how much time it takes for the request once its entered our system. To land on the GPU's and then get back to you. Input latency, commonly called a time to first token. That's the amount of time it takes to process the prompt and then output latency, for every token that's generated,
18:30 - 19:00 you pay the pretty much fixed latency cost for each generated token. You combine those things together and that's total request latency. What we see for many customers is that the vast majority of their time is spent on output latency, it's been generating tokens. Usually like 90% plus of the time. So that's probably the first thing you'd think to optimize, but there are cases, like if you're building a classifier on long documents, where input token speed is actually going to be the thing that dominates. So really depends on your use case.
19:00 - 19:30 So we broke down total request latency and I'll say first, that when you think about total request latency, I'm telling you not to fixate on that, but most customers will be paying attention to how long it takes for the LLM to complete. Unless you're doing a streaming chat application, the thing they're going to care about, okay, the answer is done. But even with that frame, even knowing that the final thing customers care about, for as developers you want to be a little bit more focused on the details.
19:30 - 20:00 So we talked about this network latency, TTFT, that's the prompt latency time to first token. And then the output latency, we call that TBT or time between tokens. And you take the time between tokens times the number of tokens that you generate, and that's what the composes the output latency. So that's the basic formula that you should just always have in the back of your mind, when you're thinking about why are my requests taking so long and how can I make them faster. Now let's just break this up one at a time and talk about each component and how it works and what you can do.
20:00 - 20:30 So first network latency. Unfortunately network latency is on us, it is the time from when the request enters our system, to when it lands on GPUs, and when the request finishes, to when it get backs to you. It'll add about 200 milliseconds to your total request time just to be routed through our services. One of the really nice pieces of news is this is a central that we've been focusing on for the last six months or so. And I'll say historically, our system has been pretty dumb where all of our requests had been routed through a few central choke points in the US.
20:30 - 21:00 And then they land on GPU's and then they come back to the US and they get back to you. Which means that for a standard request you're often hopping across the ocean multiple times. Not ideal for really lower network latency. But we're actually just throwing out right now regionalizations, so if you'd pay attention to your metric, you should be seeing network latency go down over these last few weeks. And first I'll say, I'm not allowed to tell where our actual data centers are, since this a little illustrative, but you can see that instead of having
21:00 - 21:30 just centralized choke points, we now take requests, we find that data centers that are closest to those requests, we try to process them completely locally. So it's one really, really meaningful way of reducing network latency. Alright. So stuff that you can affect real easily, time to first token. This is the latency that it takes to process prompts and there's a few important techniques that really help optimize here. The first, the obvious one is just to use shorter prompts.
21:30 - 22:00 And I know Colin just said, you want to put more context, more examples in your prompts. And I wish I could tell you that that didn't come at a latency cost, it does come at a minor latency cost. The longer your prompts are, the longer it will take to process the prompts. Usually prompts are about 20 to 50 times faster per token than output latency, depending on the model. So, that is the tradeoff you have to make between the amount of data in your prompt and how fast you want your prompt to be processed. One of the really nice things that we just released today is generations in our playground.
22:00 - 22:30 Where you can actually tell a model that we've tuned for prompts engineering to say, I want the prompt to be able to do these thing and I want it to be concise. And I want to it be brief and the model will actually help form a prompt that meets this criteria and does the right balance between verbosity and also gravity. So, that's one technique. Second technique for improving prompts and we'll talk about this a lot, is choosing a model the right size. Depending on which model you choose the time to first token varies quite meaningfully. So, o1 in GPT-4o,
22:30 - 23:00 both have a time to first token of about 200 milliseconds. But 1 mini is about 50 milliseconds, so super, super fast. And then GPT-4o mini is somewhere in between, it's about 140 milliseconds, it's the standard time to first token that we see across many, many requests in our system. And then the third way to improve prompt latency, I'll just dangle, it's what we call prompt cashing. We're just really seeing this today, but it's a way to speed up your requests if you have repeated prompts and we'll talk about it
23:00 - 23:30 a little bit more in the cost section in just a couple minutes. So that's time to first token prompt latency. Then the final component, the component is probably where you're spending the most time, is time between token or output latency. So, time between tokens takes most of the time and that's true for our classic models, like GPT-4o and 4o mini. But it's really true for our raising models, like o1 preview and o1 mini. Where each of those train of thought tokens is an output token. So, each token that the model is thinking inside its head
23:30 - 24:00 is actually adding a fixed cost to the latency. And there could be a lot of tokens there. So what can you do? Well, the first thing to understand about time between tokens is, one of the biggest determiners of this speed is just simply supply and demand. How many tokens our system is processing, versus how much capacity the system is provisioned against. And what you can see here is a pretty typical week for us. Where the weekends tend to be the fastest time where had the least demands, the tokens are being spent out the fastest.
24:00 - 24:30 And then during weekdays, typically the morning specific time, is when we have the most demand and the models will be at their slowest over the course of the week. And the way that we optimize this internally, is we have latency targets that we set on a per model basis. The units here are tokens per second, so want fast higher numbers mean that they're faster. And what these numbers mean is this is the slowest we ever want the model to be. So at 8:00 a.m. on Monday, which is typically one of our slowest times,
24:30 - 25:00 we want GPT-4o to be at least 22 tokens per second, if not faster. And you can see here, GPT-4o and o1 generate tokens at about the same speed, 4o mini meaning only faster and then o1 mini is hugely faster, extremely fast model. But it's also generating a lot of chain of thought tokens. So you can pick smaller models if that's possible. You can't really move all of your traffic to weekends, although if you can that's a very straightforward way
25:00 - 25:30 to make things faster. But one of the other things you can think about to improve time between tokens latency is reducing the number of output tokens. This is, can make a huge difference in total requests. So if you have one request that's generating 100 output tokens and then you have another request that's generating 1,000 output tokens, that second request is literally going to take about ten times as long, it's going to be substantially slower. So you really want to be thinking about how to prompt the model to be concise in its output and just give you the bear minimum amount of information
25:30 - 26:00 that you need to build the experiences you want. And then the last way that, just talked about, I should also mention that, one thing that's a little bit subtle in time between token latency is the length of the prompt actually makes a difference too. So if you have one request that has 1,000 prompt tokens in it and another request that 100,000 prompt tokens in it, that second request for each token it generates is going to be a little bit slower. Because for each token it has to be paying attention
26:00 - 26:30 to the whole 100,000 long context. So shortening prompts does have a small affect on the generation speed. And then the final thing, the final way to improve time between token latency is choose smaller models. If you can get away with 4o mini, that's a very straightforward way to make your application faster. Alright. So we've broken down latency, we talked about network latency, prompt latency and then output latency. Let's close out by talking about cost, the different ways you can make more requests with less money.
26:30 - 27:00 So, the first thing to know is that many of techniques that we've already touched in improve both latency and cost. A lot of ways that you make requests faster are, will also just save you money. So, as a very straightforward example, we build by token and also tokens cost a fixed amount of time, so if you use fewer tokens it's obviously going to go meaningfully faster. But there are some optimizations that apply just to cost. The first thing I wanted to just plug is, we have great usage limits in our developer console
27:00 - 27:30 where you can see how much you're spending on a per project basis and get alerts when your spend is more than what you're expecting. This is a really straightforward way to manage costs to make sure if there's different efforts happening inside your company, that you're really aware of how much each of them are spending and you're not getting surprised by sudden surge in usage over the weekend. So just always good to use those tools and really setup projects on a granular way to control your costs. And just keep that good visibility. One of the things you can do
27:30 - 28:00 to actually reduce costs is with prompt caching. This is a feature that we're just launching today. And the way it works, is if a prompt lands on an engine that's already seen that prompt and has it in cache, it won't recompute the activation's for that prompt, those are like the internal model states associated with that prompt. And instead it can just go straight to the generation step. So this is a way to both speed up the prompt processing time, but also to just save money. One of the key things that you should think about with prompt caching,
28:00 - 28:30 is the way it works is it's doing a prefix match. If you can see in this example, you have first request on the left. And then if you send the exact same request, plus a couple things at the end. It's all a prompt match. But if you change just one character at the very beginning of the prompt, it's a complete mess. None of the other activation's are going to help your prompt speed at all. So what that means if you're building applications, is you really want to put the static stuff at the beginning of the prompt. So that's like your instructions for how your agent should work,
28:30 - 29:00 your one shot examples, your function calls, all of that stuff belongs at the beginning of the prompt. And then at the end of the prompt you want to put the things that are more variable, like information that are specific user, or what's been said previously in the conversation. Typically our system will keep prompt caches alive for about five to 10 minutes. This will all just happen automatically without needing to be involved at all. But one of the second ways that you can kind of improve prompt caching rate, is just keep a steady cadence of requests.
29:00 - 29:30 We will always clear the prompt cache within an hour, but as long as you keep hitting with the same prompt prefix, within about five to 10 minutes you'll keep those prompts active the whole time. And how does that manifest in terms of money saved? Well we just announcing this today, so, for all of our most recent models, you can immediately start saving 50% on any cache token. And one of the really nice things about this implementation, is that there's no extra work required. Hopefully your bills are just going down as of today,
29:30 - 30:00 if you have your cache prompts. You don't need to pay extra to use this feature, you just save from making their traffic more efficient. The last thing that I wanted to cover it terms of saving cost, is our BatchAPI. And I think our BatchAPI is a little bit of a sleep or hit. But it is 50% off both prompts and output tokens. By running requests asynchronously. So instead of hitting the model and the model applying as quickly as possible, you create a batch file, which is a sequence of really a lot of requests,
30:00 - 30:30 huge numbers of requests. You submit to the BatchAPI and it will complete this job within 24 hours, it's often much faster if you're submitting the job not at peak time. And of the really nice things about this service, is that anything you pass to the BatchAPI, it doesn't count against your regular rate limits. So it's a completely separate pool of capacity that you can just use and pay half price on. What we have generally found is that most developers have at least a few cases they really benefit from Batch.
30:30 - 31:00 So that could be content generation, you're making a bunch of sites, it could be running evals, ingesting and categorizing a really big backlog of data, indexing data from embedding based retrieval, doing bit translation jobs. Those are all things that don't need to be run where every token is generated within 40 milliseconds, so they're all really good use cases for Batch. And to the extent you can offer work there, you're both saving half of the spend and then you're also of course not using any of your other rate limits. You just can scale more for the services that need to be more synchronous.
31:00 - 31:30 Just to give you one example of how people use this. One of our partners, Echo AI, they categorized customer calls, so they summarized the customer call. After they transcribed it, they classify it. They put takeaways and follow ups and the BatchAPI saves them 50% of the cost, because they don't need to be purely synchronous. And what's let's them do, is by reducing their cost they're actually able to pass on lower prices to customers. The way that they built this is that they've designed it near real time system that processes every call that comes in.
31:30 - 32:00 Creates batch jobs, sometimes really small batch jobs, with just a few requests. And then it's tracking all of the batch jobs that are in the system, so as soon at they complete, they can notify the customer. This isn't going to be less than one minute response times necessarily. But it is way cheaper than running requests synchronously and it lets them scale a lot more and just ask them more questions about their costs. So, we just released the BatchAPI in the spring. They've been using it for a few months.
32:00 - 32:30 So for save them tens of thousands of dollars, it's pretty good for early prototyping here. They're expecting today they used 16% of their traffic, is moved over to the BatchAPI. They're expecting for their particular use case that get to that 75% of their traffic. And are aiming to save about $1 million over the next six months. So it's a really substantial amount that you can save if you can really bifurcate your workloads, to the ones that you need to be synchronous and the ones that don't. Alright. So, we have hit you with many, many different ideas.
32:30 - 33:00 We've shared a lot of things. The good news is that a lot of the techniques that we have for optimizing cost and latency, they're highly overlapping. You optimize one you tend to optimize the other. And if you can apply a good smattering of these best practices, you're going to built experiences that are state of the art in terms of balancing intelligence, speed and affordability. Also say that I'm sure there are many techniques that we didn't mention here and there probably lots that we don't know. So, if there's something we missed that you found effective,
33:00 - 33:30 we'd love to hear about it after the talk. And to close I'll just say, once more that I wish there was one playbook for balancing accuracy, latency and cost. There is not that thing. This is actually the central art in building LLM applications, is making the right tradeoffs between these three different key strengths. So, I hope that we gave you some ideas to think about and on behalf of myself, Colin and the team, we're very, very excited to see what you build. Thank you.
33:30 - 34:00 [ Applause ]