Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.
Summary
In this in-depth discussion, the hosts are joined by Nicolay Savanghov, a staff research scientist at Google DeepMind, to explore the intricacies of long context in AI models. They delve into the concept of tokens, the importance of context windows, and the interplay between in-context and in-weight memory. Nicolay shares insights into retrieval-augmented generation (RAG) systems and their roles compared to long context capabilities. The future of AI with advancements in long context, specifically in coding applications, is also discussed, highlighting its potential to transform how large datasets are processed and understood.
Highlights
Nicolay Savanghov highlights how important tokens are in AI, explaining they are often less than a word and can include punctuation. 🔍
The discussion reveals how models view the world differently due to tokenization, impacting how information is processed and retrieved. 🔄
Insights into the relationship between in-weight and in-context memory, showcasing the flexibility and limitations of each. 🧩
Exploration of Retrieval-Augmented Generation (RAG) sheds light on its engineering process for handling vast information corpuses. 💡
The future potential of long context is shown, specifically in coding applications, with predictions on its transformative impact. 🌐
Key Takeaways
Long context enhances AI's ability to handle more data, making it pivotal for coding tasks and knowledge retrieval. 🧠
Tokens are crucial for breaking down input data efficiently, allowing AI to process information faster than character-level analysis. ⚡
In-weight memory and in-context memory serve different purposes; understanding them helps optimize AI capabilities. 📚
Retrieval-augmented generation (RAG) and long context should work hand-in-hand, offering broader recall and precision. 🤝
Innovation in long context is vital for scaling its application beyond current limits, like 1 to 2 million tokens. 🚀
Overview
This episode explores the dynamics of long context in AI models, as discussed by Nicolay Savanghov from Google DeepMind. He joins the conversation to break down complex concepts such as tokens, which are integral in AI processing, helping to convert text into a more manageable form for analysis. Nicolay emphasizes the differences between in-weight and in-context memory, providing clarity on how AI manages and recalls data through different mechanisms.
The discussion takes a turn into the realm of retrieval-augmented generation, or RAG systems, which Nicolay explains as a technique to enhance AI's ability to summon relevant data without overloading the system's processing limits. The synergy between RAG and long context is crucial, as it allows AI to handle extensive datasets more effectively. This, in turn, can lead to breakthroughs in various sectors, especially in coding where such capabilities can greatly improve efficiency.
Looking towards the future, the dialogue projects an optimistic view on the trajectory of long context applications. As AI continues to evolve, the integration of extended context windows could lead to game-changing advancements in how we interact with technology. The discussion wraps up with insights into the continuous work being done to enhance long context capacities and the anticipation of more revolutionary applications soon, particularly in complex data analysis and coding environments.
Chapters
00:00 - 01:30: Introduction and Context of Discussion The chapter titled "Introduction and Context of Discussion" focuses on the impressive work conducted by the inference team. It discusses questions related to rapid-answer generation using context caching for efficiency in both speed and cost. The discussion includes the limitations of scaling context beyond 1 to 2 million, with an optimistic view of its potential, especially in coding applications. The chapter hints at future discussions and developments in handling long context efficiently.
01:30 - 05:30: Understanding Tokens and Tokenization In this chapter titled 'Understanding Tokens and Tokenization', the focus is on a discussion with Nicolay Savanghov, a staff research scientist at Google Deep Mind. Nicolay is one of the co-leads for the long context project. The chapter aims to delve into the concepts surrounding tokens and their tokenization, providing insights from a leading expert in the field as part of the release notes series.
05:30 - 08:30: Context Windows in LLMs The chapter begins with an introduction by Nikolai, who is greeted and invited to discuss foundational concepts. The focus shifts to the concept of a 'token' in the context of pre-training in language models. Nikolai explains that a token can be understood as being slightly less than one word. It might be a complete word, a part of a word, or another unit depending on the context. This foundational understanding is critical for further discussions on pre-training and context windows in language models.
08:30 - 11:30: Exploration of RAG Systems In the chapter titled 'Exploration of RAG Systems', the discussion revolves around the fundamental concept of tokens in AI and LLMs, particularly in handling textual data. It highlights the differentiation between characters familiar to humans and tokens used by AI systems, exploring the necessity of tokens in processing and interpreting textual inputs. The conversation briefly touches upon the handling of images and audio in AI systems, indicating that while there is a commonality in concept, each data type requires specific approaches.
11:30 - 16:00: Scaling Long Context Capabilities This chapter explores the challenges and methodologies associated with scaling long context capabilities in AI models. It highlights the debate among researchers about moving from token-based to character-level generation, noting that while this transition may offer certain benefits, it also introduces significant drawbacks, primarily the slower generation speed due to processing at a character level.
16:00 - 22:00: Advancements in Long Context Quality The chapter discusses the inefficiencies associated with generating content one character at a time, suggesting that generating a word in one go would be faster. However, despite efforts to improve this process, token-based systems remain in use. The chapter references Andre Karpathy's videos and tweets, which highlight the intricacies and challenges posed by tokenizers in language models (LM).
22:00 - 32:00: Evaluations and Benchmarks for Long Context The chapter discusses the challenges of evaluating and benchmarking models with long context inputs. A key issue highlighted is the model's focus on token level analysis rather than character level, leading to difficulties in processing certain tasks, like counting specific characters in a word. This is due to tokenizers breaking words into parts, which complicates character recognition.
32:00 - 41:00: Future Prospects of Long Context in AI The chapter explores the fundamental differences in perception between AI models and humans, particularly focusing on the challenges posed by tokenization. It highlights how AI models process and understand concepts in a manner distinct from human cognition, using the example of how such models perceive a 'strawberry.' The discussion delves into how these models might condense complex ideas into single tokens, representing a vastly different and simplified view of reality compared to human observation.
41:00 - 49:40: Developer Best Practices for Long Context The chapter discusses the complexity of acquiring specific knowledge from pre-training language models, particularly focusing on the challenge of associating distinct elements such as counting letters within tokens. An example provided is the difficulty in associating the token for the letter 'R' encountered during pre-training with a single token word like 'strawberry'. The explanation highlights the mental load and intricacies involved in such processes when dealing with language models.
49:40 - 59:00: Long-term Predictions and Innovations in Long Context The chapter discusses the concept of long-term predictions and the role innovations might play in understanding and utilizing long context for technology. It raises questions about the limitations of current models, including weaknesses observed when seemingly simple tasks, such as counting the letters in a word, can't be performed by Artificial General Intelligence (AGI). The chapter reflects on how tasks that are easy for a child can be perplexing for advanced models, mentioning insights from Andrzej Karpathy's videos that highlight issues, such as handling white spaces, in model systems. The discussion underscores the challenges and intricacies of advancing AI capabilities and the hurdles in achieving true Artificial General Intelligence.
59:00 - 69:00: Interplay with Agents and User Accessibility The chapter explores the complexities that arise when interacting with AI agents, particularly focusing on user accessibility. It highlights how most tokens are prefixed with a whitespace, which can lead to unexpected behaviors during concatenations. This unusual concatenation poses challenges for modeling and understanding these interactions effectively.
69:00 - 72:00: Conclusion and Reflections The chapter focuses on explaining the concept of context windows, particularly in the realm of building and using AI models like LLMs (Large Language Models). The discussion delves into why understanding context windows is crucial, as they relate to context tokens and their impact on model functionality. Although the transcript is incomplete, it suggests an exploration of these foundational concepts to help users and builders of AI comprehend their significance for effective AI application.
Deep Dive into Long Context Transcription
00:00 - 00:30 I'm really impressed by the work of our inference team. I've got a bunch of spicy uh rag versus long context questions for you. You can rely on context caching to make it both cheaper and faster to answer. What's the limitation of like continuing to scale up beyond 1 to 2 million? This thing is going to be incredible for coding applications. We will have lots more exciting long context stuff to to share with folks. [Music]
00:30 - 01:00 Welcome back to release notes everyone. How's it going? Today we're joined by Nicolay Savanghov who's a staff research scientist at Google Deep Mind and one of the co-leads for long context
01:00 - 01:30 pre-training. Nikolai, how are you? Yeah, hi. Thanks for inviting me. Let's start off with um at the most fun foundational level and we'll sort of build up from that. Uh what is what is a token um and how folks should should think about that. So the way you should think about a token, it's basically slightly less than one word in case of text. So token could be a word, part of a word or it could be
01:30 - 02:00 things like um yeah punctuation like uh commas uh full stops etc. And um for images and audio it's slightly different but um for text just think of it as um slightly less than one word. Yeah. And why do we need tokens? Like why like like humans generally are sort of familiar with characters? Why does AI and LLMs have this special concept of a of a token? What's like what does it
02:00 - 02:30 actually enable? Well, this is a great question and uh actually many researchers uh asked this question themselves. So there were actually quite some papers trying to get rid of tokens and uh just rely on character level generation. But the thing is while there are some benefits uh of doing that there are also some drawbacks and the the most important drawback is uh while the generation is going to be slower because uh you generate uh roughly one token at
02:30 - 03:00 a time and if you are generating a word in one go it's going to be much faster than generating every character separately. So those efforts they I would say they didn't really succeed and we are still using tokens. Yeah. For folks who haven't spent a bunch of time thinking about tokens, there's a bunch of good uh Andre Karpathy videos and tweets and stuff like that of how like tokenizers are the root of all like weirdness and complexity in LM like all
03:00 - 03:30 these weird edge cases that you run into. It's like most of them are rooted in the fact that the model is not looking at things from a character level. It's looking at it from a token level and actually like the pertinent example um which folks love to go to these days is like counting the characters in a single word like how many Rs are there in strawberry is like a weird problem to solve. My understanding is because tokenizers like break the word into different parts. It's not actually looking at the word at like the individual character level. Is
03:30 - 04:00 that a an app description? Yeah, I think that's a pretty good description of the problem and you should one thing you should realize is that those models due to tokenization the view they view the world very differently from how humans uh view the world. And when you see a strawberry, you see a sequence of letters. But what the model says, it could be even one token and then you ask
04:00 - 04:30 like, hey, count number of uh R letters in this token. But this is a pretty hard it's pretty hard to get this knowledge from from pre-training because you would need to associate the R letter token that you encountered somewhere on the web with the word strawberry which is also one token. So if you think about the mental load of doing that it's it's not as such a
04:30 - 05:00 trivial task I would say. Although obviously when the model can't do it, we start complaining hey like if it's AGI, how come it can't count number of our letters in strawberry that's like a child could do that? Yeah, it is it is super weird. And and actually another interesting thing is that if you watch some of the Karpathi videos then there's a lot of there are a lot of problems with the white space. So this is an interesting point because normally
05:00 - 05:30 most tokens are prefixed with a white space and then some really weird effects uh might happen because uh you might you might encounter problems uh on the boundaries when you think you are concatenating uh something but uh this concatenation is very unusual for the model to see. M interesting. That is super interesting. I think this actually takes me to just
05:30 - 06:00 like generally talking about context windows and I think there's a lot of discussion or obviously we're talking about long context which sort of assumes you know what a context window is but um can you give the lay of the land of like how folks should thinking about like what a context window actually is. Why do I as a user of LLM or somebody who's building with AI models why do I need to care about the context window? So context window is uh those are basically exactly this uh this context tokens that
06:00 - 06:30 we are feeding into LLM and it could be the current prompt or the previous interactions with the user. It could be the files that the user uploaded like uh videos or PDFs. And when you supply context into the model, the model actually has knowledge from two sources. So one source is what I would call inweight or pre-training memory. So this is a knowledge that well
06:30 - 07:00 the LLM was trained on a slice of the internet and it learned something from there. It doesn't need um additional knowledge to to be supplied into context to remember some of those facts. So there is already even without context there is some kind of memory present in the model. But another kind of memory is this explicit in context memory that you are supplying to the model. And it's pretty important to understand the distinction between those
07:00 - 07:30 two because in context memory is much much easier to modify and update than in weight memory. So for some kinds of knowledge in weight memory might be just fine like uh if you need to uh memorize some simple simple facts that uh the objects uh fall down and not up. This is a very like basic common facts. It's fine if it's if this
07:30 - 08:00 knowledge comes from pre-training but there are some facts which are true at time of uh pre-training but then they become obsolete at the time of inference and you would need to update those facts somehow and the context provides you a mechanism for to do this um update and it's not only about the up-to-date knowledge there also different kinds of knowledge like uh private information the network doesn't know anything about you personally and
08:00 - 08:30 it can't read your mind. So if you want it to be really helpful for you, you should be able to supply your private information into context and then it will be able to personalize. Without this personalization, it's going to give you a generic answers it would give to any human instead of answers tailored to you. And the final category of knowledge which um which need to be inserted in context is uh some rare facts. So basically some
08:30 - 09:00 knowledge which was uh encountered very sparingly on the internet and I must say I suspect this category of knowledge it might go extinct with time. Maybe future models will just learn the whole slice of the internet by heart and we will not need to to worry about those. But the reality at this point is that if something is mentioned once or twice on the whole
09:00 - 09:30 internet, the models are actually unlikely to remember those facts and they are going to hallucinate the answers. So you might want to insert those uh explicitly into context and the kind of trade-off we are dealing with is for the short context models you have limited ability to provide additional context. Basically, you would have a competition between uh knowledge sources. And if the
09:30 - 10:00 context is really large, then you can be less picky about what you insert and you can have uh higher recall and coverage of uh relevant knowledge. And if you have if you have higher coverage uh in context, that means well you're going to alleviate all those problems with uh ine memory. Yeah, I think there there are so many angles to push on. Uh that that was a great description. Um one of the
10:00 - 10:30 follow-ups from this is we talked about sort of ine memory, we talked about in context memory um or in yeah just in context in general. The the sort of third class is around how to bring context in that like through rag systems um retrieval augmented generation. Can you sort of give like a highle description of rag and then I've got a bunch of spicy uh rag versus long context questions for you? Yeah, sure. So, what rag does is uh well, it's a simple engineering
10:30 - 11:00 technique. It's an additional step before you pack the the information into LLM context. So imagine you have a knowledge uh knowledge corpus and you chunk this knowledge corpus into well small textual chunks and then you use some uh special embedding model to turn every chunk into a real valued vector. Then based on those real valued
11:00 - 11:30 vectors, if you get a query at the test time, you can embed the query as well. And then uh you can compare this real valued vector for query to those chunks from the corpus. And for the chunks which are close to the query, you're going to say, hey, like I found something relevant. So I'm going to pack those uh chunks into context. And now I'm running LLM on this. So that's how that's how rag works and and why and and
11:30 - 12:00 this is maybe a silly question. Um, rag my my sense has always been like lets you obviously there's like very hard limits on context that you can pass to the model. We have 1 million, we have 2 million. That's awesome. But like actually if you look at like internet scale, you know, Wikipedia has, you know, many trillions of tokens or whatever or maybe not trillions, maybe billions of tokens, whatever it is. Um, why is like rag as this notion of like bringing in the right context to the model not just like baked into the model itself? Like is it just that to the
12:00 - 12:30 point of the conversation the model just not working well for like it's just like the wrong research direction to go in or like why why don't you think we build that mechanism in because my face value perspective is like it seems like that would like kind of be useful if the model could just like do rag and if I could pass a billion tokens and then let the model sort of figure out heristically or through the you know whatever mechanism what the right tokens are um or is that just like uh a problem
12:30 - 13:00 somewhere else in the stack that should be solved and the model shouldn't have to think about that. Well, one thing I want to say is that after we released 1.5 Pro model, there were a lot of debates on social media like is rag becoming obsolete and well from my perspective not really like uh say enterprise knowledge bases they constitute billions of tokens and not millions. And so for this use case, for
13:00 - 13:30 this uh for this scale, you still need rack. What I think is going to happen in practice is that it's not like rack is going to be eliminated right now, but rather long context and rack are going to work together. And the benefit of long context for rack is that you will be able to retrieve more relevant uh needles from the context by using rack. And by doing that you're going to increase the recall
13:30 - 14:00 of the useful information. So if previously you were setting some rather conservative threshold and cutting out many potentially relevant chunks then now you're going to say hey I have a long context so I'm going to be more generous so I'm going to pull more more effects and so I think uh there's a pretty good synergy between those and uh the real limitation is uh the latency
14:00 - 14:30 requirements of your application. Mhm. So if you need real time interactions then well you'll have to use shorter context but uh if you can afford to wait a little bit more then yeah you're going to use long context just because uh you can increase the recall by doing that. Why is 1 million just like a marketing number or like is there something like intrinsic that like after a million or 2 million like is there actually like something technically happening around the like million token mark for from a
14:30 - 15:00 long context perspective or is it literally just we found a a number that sounds good and then made the technology work from a um from a research perspective? Well, when I started working on long context, uh the competition at the time, I think it was about 128k or maybe 200k tokens at most. So I was thinking how to set the goals for long context project and well it was at the
15:00 - 15:30 time it was a small small part of Gemini and I originally thought well I mean just matching competitors don't doesn't sound very exciting. So I thought let's uh set an ambitious bar. So I thought well 1 million is uh an ambitious enough step forward. It was like compared to 200k that's like 5x and very soon after we released 1 million we also actually
15:30 - 16:00 shipped 2 million which which was about 10x larger and I guess one order of magnitude uh larger than the previous uh state-of-the-art. That's that's a good goal. That's what makes it uh exciting for people to work on. Yeah, I love that. And how my my my followup spicier question from that is like we shipped 1 million, we shipped 2 million rapidly after that. Like what's the limitation of like continuing to scale up beyond 1
16:00 - 16:30 to 2 million? Is it like uh from like a serving perspective, it's too costly or too expensive? Or is it just like the architecture that makes 1 to 2 million work like just like fundamentally breaks down when you go larger than that? Or like how how come we haven't seen the frontier for long context continue to push? Yeah. So when we released uh 1.5 pro model, we actually ran some inference tests at 10 million and we got some quality numbers
16:30 - 17:00 as well and for say single needle retrieval it was almost perfect for the whole 10 million context. We could have shipped this model but it's pretty expensive to run this inference. So I guess we weren't sure if people are ready to pay a lot of money for this. Uh so we started you know more uh you know with something more reasonable like in terms of the price
17:00 - 17:30 but uh I think terms of quality that's also a good question because uh it's it was so expensive to run we didn't run many tests and so just you know bringing up this server again it's also quite costly so unless we want to you know ship it to a a lot of customers right now. It's, you know, we don't have chips to to do that. Yeah. Do you think that that will continue to hold that like the it's like this like m
17:30 - 18:00 I don't know if it's like an exponential increase in capacity that's needed as we do more long context stuff but like do you have an intuition that like that will like do we need like fundamental breakthroughs from a research perspective for that to change to make it so that we can actually keep scaling up beyond or is it like 1 to 2 million is is going to be what we stick with and if you want more than that do rag and then like be really smart about bringing context in and out of the of the context window from the model perspective. So my feeling is uh that we actually need more
18:00 - 18:30 innovations. So it's not just a matter of brute force scaling to actually have close to perfect 10 million context. Uh we need to learn more innovations. Uh but then in terms of uh the rack and which paradigm will be more powerful going into the future. I think that the cost of those models is it's going to decrease over time and we're going to try to pack more and more context
18:30 - 19:00 uh retrieved with rack into those models and because the quality is also going to increase then it's going to be more and more beneficial to do that. Yeah, that makes sense. Can you take us back to when we originally landed long context? My understanding of the story is for 1.5 Pro um we didn't it like wasn't like it had been built for a long context to begin with like I think like you had obviously like tried to kick off that workstream with others inside of
19:00 - 19:30 DeepMind um and it ended up just being that like the pace of research progress was like super fast and like we I think my my loose understanding of the story is like we had the breakthroughs we realized it worked and then it was like shortly thereafter ended up actually landing in the model side or like what was the like timeline from the the effort starting to like actually landing it into a model that was available externally to the world. Oh, I think that was uh that was indeed pretty quick. And just to clarify, we didn't
19:30 - 20:00 really like we were wishing to to go long and achieve say 1 million or 2 million context, but we kind of didn't expect uh ourselves to get there that fast. And when it actually happened then we thought like hey like this is this is really impressive like uh we actually made some strides on this task. So now we need to ship it and
20:00 - 20:30 then we actually managed to assemble a pretty awesome team very quickly and the team worked really hard. Like to be honest, uh, in my life, I've never seen people working this hard. I was really impressed. I love that. That's awesome. Um, and that was for the original 1.5 Pro series. We landed it for 1.5 flash as well. We now have it for 2.0 Flash. We have uh 2.5 Pro. And I think the c can
20:30 - 21:00 you sort of give us the lay of the land of like what's uh what's been happening from a long context perspective from that in original launch when like we know long context is possible. We released the technical report for 1.5 pro which showed you know the needle in the haststack results a bunch of stuff like that. Um to today where like I think a lot of what's actually making this 2.5 pro model blow people's mind is actually how how strong it is at long context. um which has been awesome for you know coding use cases and stuff like
21:00 - 21:30 that. So what what's happened in the long context world from original launch to today? Yeah. So I think the the biggest um the biggest improvement was actually the quality and we made strides both on the quality at say 128k context and also at 1 million context. So if we look at the benchmark results for 2.5 pro model, we observe that uh it's uh better
21:30 - 22:00 compared to many strong baselines like uh GPD 4.5, CL 3.7 and also 03 mini high and some of the deepseek models. So the quality like to to actually compare to those models we had to run the evals at 128k so that they all comparable and we saw quite a big improvement for 2.5 pro and now in terms of 1
22:00 - 22:30 million context uh we compared it with 1.5 pro and we also saw some significant advantages. This is maybe a weird question, but like does the quality like eb and flow at different context sizes? Like do you see like is it like like almost like linear quality like on a you know 100,000 token input versus like 128,000 or like a 50,000 versus 100? like is it like pretty consistent across
22:30 - 23:00 or is there like weird like I'm I'm trying to imagine maybe the it all generalizes when you make it into the final model and there's no there's no difference but like is there any like nuance in in that perspective have we done evals that show anything like that? Yeah, internally we looked at some some of those evals. Um I guess maybe your question goes into these effects that people observed in the past like uh very popular one was lost in the middle effect. M and to answer your question uh
23:00 - 23:30 the lost in the middle effect where you have a deep in the middle of the context we don't really observe this with our models but what we do observe is that if it's a hard task not like a single needle but uh some task with hard distractors then the quality slightly decreases with uh increasing context and that's something we want to improve Yeah. And just for my own mental model,
23:30 - 24:00 does like when I think about putting 100,000 tokens into the context window of the model, should the should like from a developer perspective or a user of who's like actually using the long context functionality, um should I assume that the model is like actually attending to all of the different context? I know it could definitely do the like the one needle. It can pull that out, but um like is it actually reasoning over all those tokens like in the in the context window? I have like just like a a bad mental model
24:00 - 24:30 of like what's happening behind the scenes um when you have that much context in the in the context window of the model. Yeah, I think that's a good question. So one thing you need to keep in mind is that uh attention in principle has a bit of a drawback because uh there's um there's a competition happening between tokens. So if one token gets more attention then other tokens will get less attention.
24:30 - 25:00 The thing is if you have hard distractors then one of the distractors might look very similar to the information that you're looking for. and it might uh attract a lot of attention and now the piece of information that you are actually looking for it's going to receive less attention and the more tokens you have the harder competition becomes. So it it depends on the hardness of uh destructors and also also
25:00 - 25:30 on the context size. Yeah, this is another silly follow-up question but like is the is the amount of attention always fixed? Like there's no like is it possible to have more attention or is it just like whatever it's like a value of one and then there's like you know spread across all of the tokens in the context window and like so the more tokens you have like literally the less attention there is there's no way for that to change. Normally that's that's the case that um like the whole pool of
25:30 - 26:00 attention is um is limited. Yeah. from from that example you gave about like the like hard distractors like causing the model to do a lot more work and sort of split the attention. H has the has your team explored or other teams on the applied side explored like like pre-filtering mechanisms any of that type of stuff? So like you want long context to work really well in production? It like sounds like the best outcome is you don't have like you have like very dissimilar data um that's in
26:00 - 26:30 the context window. if there's like a lot of similar data and you're asking a question that could be relative to all of it, would you you'd expect the performance to be worse in general in that use case. So, have you is that just something that like developers or the world needs to like figure out from that perspective or do you have any any suggestions of like how folks should approach that problem? For me as a researcher, I think it's it would kind of be a move in the wrong direction. I think we should uh work more on improving the quality and robustness instead of coming up with some hacks for
26:30 - 27:00 filtering. One practical recommendation though is of course uh try not to include totally relevant context. Like if you know that something is not useful then what's the goal of including it into the context because in the very minimum it's going to be more expensive. Yeah. So why would you do it? Yeah, that's it's interesting because I feel like um in some sense that like goes against the like core way that people use long
27:00 - 27:30 like I think the examples I see online it's just like people being like oh I'll just take all this random data and throw it in the context window of the model and have it sort of figure out what's useful for me. Um, so you'd almost expect like the model to do uh given like how important it sounds like it is to like remove some of that stuff, the model to do like the pre-filtering itself almost to like only include the relevant because I think the like not that humans are lazy, but like I feel like that's like been one of the selling points is like I don't need to think about what data I'm putting into the context window. So do you think there's
27:30 - 28:00 a world where like the model you know it's like a multi-art system or something like that where the model is actually doing some version of the model is doing some of that eliminating the extraneous data based on what the user's query is and then like making sure when the context actually goes to the model it's a little bit easier over time as the models get better quality and they get cheaper you'll just you will not need to think about this anymore. I'm just talking about the like the current realities like if you want to make a
28:00 - 28:30 good use of use of it right now then well let's be realistic uh just don't put irrelevant context but also I agree with your point that uh if you spend too much time manually filtering it or like handcrafting which uh things to put into context that's annoying. So I guess there should be a good balance between those. Yeah, I think the point of context is to simplify your life and make it more
28:30 - 29:00 automatic, not to make it more time consuming or you know make you spend time handcrafting things. Yeah, I've I've got to follow up on this around like evals and the evals that you're thinking about from like a long context quality perspective. um needle and haststack obviously that was the original one that we put into the uh 1.5 technical report and for folks who aren't familiar needle and haststack is just like asking the model to find like one piece of context in 2 million 1 million 10 million tokens of context the
29:00 - 29:30 models are extremely good at this um how do you think about like the other set of long like is there like a set of like standard I think like folks like generally I feel like needle and hack gets talked about a little bit but are like another set of like standard benchmarks that you're thinking about from a long context perspective. So let's see I think the evaluation is uh pretty much the cornerstone of the LLM research and especially if you have a
29:30 - 30:00 large team evaluation provides a way for for the whole team to align and push in the common direction. So the same applies to long context. If you want to make progress, you need to have great evaluations. Now single needle in hashtag it's a solved problem especially with uh with easy distractors. So if it's like Paul Graham's essays and uh you put phrase here is my uh like a magic number for
30:00 - 30:30 the city of Barcelona is 37. give me the magic number for the city of Barcelona. So this is like this is really a solved problem. But now the frontier of capabilities is handling hard distractors. If you for example packed your whole context with uh like a magic number for ctx is y and like you pack say the whole million context with this uh key
30:30 - 31:00 value pairs. But that's a much harder task because then distractors are actually looking very similar to what you want to retrieve. Another thing which is hard for LLMs is retrieving multiple needles. So I I feel like these two things the hardness of destructors and multiple needles they are the frontier. But also there are there are additional considerations for the evals. One consideration you might have is oh well like those new in the haststack
31:00 - 31:30 evals even with hard destructors they're pretty artificial so maybe I want something more realistic. This is a valid argument but the thing you need to keep in mind is that once you increase the realism of the eval you might actually lose the ability to measure the core long context capability. For example, if you are asking a question to a very large codebase and the question is basically can be answered by just one file in this codebase and then the task is to
31:30 - 32:00 actually implement something complicated then you're not really going to be exercising the long context capability. Instead you are going to be exercising the coding capability and then it will give you a wrong signal for heel climbing. It will you will basically heel climb on coding instead of long context. Yeah. So that's one consideration. Another consideration is that something which people call retrieval versus uh synthesis evolves. So theoretically if you need to just
32:00 - 32:30 retrieve one needle from the haststack that can be solved by by rack as well. But the tasks that we should really be interested in are the tasks which integrate information over the whole context. And for example, well summarization is uh is one such task and uh rack would have a hard time dealing with this. But now these tasks
32:30 - 33:00 they it sounds nice and the right direction to go but they're actually not so easy to use for automatic evaluation. For example, the metrics for summarization we know that they are like metrics like rouch etc. they are imperfect. Mhm. And if you're doing hill climbing, then you actually you're better off using something which is more um how do I say less uh less gameable metrics? And and just a quick follow,
33:00 - 33:30 what makes them less useful like summarization as an example? Is it just that it's like more subjective of like what a good summary is versus what isn't and it doesn't have like a ground truth source of source of truth or what makes that use case hard? Yeah, those of us are going to be uh pretty noisy because there will be a relatively low agreement between the even between the human raers. Of course, this is not to give an impression that we shouldn't work on summarization and we shouldn't measure summarization.
33:30 - 34:00 These are important tasks. I'm just talking about uh like my personal preferences as a researcher is to heel climb on something which has a very strong signal. Yeah, that makes sense. How do you see sort of as long context especially for Gemini is just like a core part of the capability story that we're telling the world. It's like a core differentiator for for Gemini. Um and and yet at the same time it feels like the long context has always been like a independent work stream of like
34:00 - 34:30 everything isn't long context. Like do you think there's a world where like you know we have there's a ton of other teams hill climbing on a bunch of other random stuff factuality whatever it is uh you know reasoning etc etc. Do you think the like directional from a research perspective from like a modeling perspective is that like long context is just fused into every other workstream or do you think there's like still you know it needs to be an independent workstream because it's like just fundamentally different in how you get the model to do useful stuff with
34:30 - 35:00 long context versus you know reasoning as a corlary example perhaps. So I guess my answer will be twofold. First of all, I find it helpful to have um an owner for every important capability. But second, I think it's important for for the workstream to also uh provide tools for people outside of this workstream to contribute. Yeah, that that makes a ton of sense. Um, my I have
35:00 - 35:30 a another followup around reasoning stuff and I'm curious how like the interplay between reasoning and long context. We had Jack Ray on and we were both at at dinner with Jack last night uh and we were talking about random reasoning reasoning stuff. Um, do do you think like re the have you been surprised by like how much it feels like and you can correct me if this is wrong like the reasoning capability actually makes long context much more useful. Like is that like just a normal expected outcome just
35:30 - 36:00 because the model's spending more time thinking or is there like some inherent like deep connection between reasoning capabilities and and long context to like make it much more effective? I would say there's a deeper connection and the connection is that if the next token prediction task improves with the increasing context length then you can interpret this in in two ways. Uh, one way is to say, hey, I'm going to load more context into the input and the
36:00 - 36:30 predictions for my short answer are going to improve as well. But another way to look at this is say hey well the output tokens they are very similar to input tokens. So if you allow the model to feed the output into its own input then it kind of becomes like input. So theoretically if you if you have a very strong long context capability it should also help you with
36:30 - 37:00 the reasoning. Another argument is that long context is pretty important for the reasoning because if you are just going to make a decision by generating one token even if the answer is binary and uh it it's totally fine to gen generate just one token. it might be preferable to first generate a thinking trace. And the reason is simply they are architectural. Like if you need to make um uh many logical jumps uh through the
37:00 - 37:30 context when making a prediction then you are limited by the network depth because that's that's roughly the number of uh number of attention layers. That's what going to limit you in terms of the the jumps through the context. So you're limited. But uh now if you imagine that you are feeding the output into the input then you are not limited anymore. Basically you can write into your own memory and you can uh perform much
37:30 - 38:00 harder tasks than you could by just uh utilizing the the network depths. That that's super interesting. You you and I have also both like related to this reasoning plus long context story. You and I have both been pushing for a long time to try to get uh long output uh landed into the models. And I think developers want this. I I see pings all the time. I'm going to start sending them to you now so that you have to answer this question. But lots of people saying, "Hey, we want longer than 8,000 output tokens. We sort of have this to a
38:00 - 38:30 certain extent now with reasoning token or with the reasoning models. they have 65,000 output tokens with the caveat that a large portion of those output tokens is actually for the model to do the thinking itself versus generating some like final response to the user. How connected are like the long long context input versus like long context output capabilities like is there any interplay between those two things because I feel like for a lot of the like core use case I think that people want is like you know dump in a million
38:30 - 39:00 tokens and then like refactor that million tokens. Um, do you think we'll get to a world where like those two things are actually like the same cap capability? Do you look at them as the same capability or is it like two like completely fundamentally different things from a research perspective? No, I don't think they are fundamentally different. I think uh the important thing to understand is that straight out of pre-training there isn't really any limitation from the model side to generate a lot of tokens. You can just
39:00 - 39:30 put say half a million and tell it I don't know copy this half a million tokens and it will actually do it and we actually tried it. it it works but this capability it requires very careful handling in the post training and the reason why it requires a careful handling is because in the post training you also you have this special uh end of
39:30 - 40:00 sequence token and if your SFT data is short then what's going to happen is the model is going to see this end of sequence token pretty early in the in a sequence and then it's just going to learn like hey like you you're always showing me this token within context X so yeah I'm going to generate this token within context X and stop
40:00 - 40:30 generation that's what you are teaching me this is actually an alignment problem but one point I want to make is that I feel like reasoning is just one kind of uh long output tasks And for example translation is another kind and reasoning it it has a very special format. It packs uh the reasoning trace uh into some delimiters and model actually knows that we asking it to do the reasoning in
40:30 - 41:00 there. But for translation the whole output not just the reasoning trace is going to be long. And this is another kind of capability that we want uh the model to encourage to produce. So it's just uh it's just a matter of properly aligning the model and we are actually working on long output. I'm excited. People people want it uh very badly. I think that gets to a bunch of uh a broader point around just like how
41:00 - 41:30 developers should be thinking about best practices for long context and also for rag potentially as well. Do you have a general sense of and I know you um gave a bunch of feedback on our long context developer documentation. So we have some of this stuff sort of documented already but what what's your general sense of what the suggestions are for developers as they're thinking about how to most effectively use long context. So I think suggestion number one is uh try to rely heavily on context caching. So let me
41:30 - 42:00 explain the concept of context caching. uh the first time you supply a long context to the model and you're asking a question, it's going to take longer and it's going to cost more. While if you're asking the second question after the first one on the same context, then you can rely on context caching to make it both cheaper and faster to answer. That's the one of the features that we are currently providing for some of the models. And so yeah, try
42:00 - 42:30 to rely heavily on this thing. try to catch the files that the user uploaded into context because it's not only uh faster to process but it's going to cost you on average four times less for um for the input token price. And and just to give an example of this, like the most common, and you correct me if this is wrong or not not the same mental model that you have, but like the most common application where this ends up being really useful is like the like chat with my docs or like chat with PDF
42:30 - 43:00 or like chat with my data type of applications where the the actual original input context to your point is the same. And that's one of the um again correct me if my mental model is wrong like that's one of the requirements of using context caching is that the original context you supply has to be the same. If for some reason that input context was changing on a request by request basis, context caching doesn't actually end up being that affected because you're you're paying to store some set of original input context that
43:00 - 43:30 has to persist from like a user request by user request basis. Yeah, I guess uh answer is yes to to both. It's important for cases where like you want to chat to a collection of your documents. um or like some large video you want to ask some questions on it or a code base and you are correct to mention that this knowledge it shouldn't change or if it
43:30 - 44:00 changes then the best way for it to change is at the very end because then uh what we're going to do under the hood is we're going to find the the prefix which matches the cache prefix and we're just going to throw away uh the test and sometimes developers ask a question like where should we put the question before the context or after the context? Well, this is uh this is the answer like uh you want to put it after the context
44:00 - 44:30 because if you want to rely on caching and uh uh profit from costsaving then uh that's a place to put it because if you put it at the beginning and if you are intending to put all your questions at the beginning then your caching is going to start from scratch. Yeah, that that's awesome. That's helpful. Um other tips, anything else besides context caching that folks should be thinking about from a developer perspective? Uh one thing we already touched on and that's a
44:30 - 45:00 combination with Drag. So if you need to go into billions of tokens of context then uh you need to combine with drag. But also in some applications where you need to retrieve multiple needles it might still be beneficial to combine with rack even if you need much shorter contexts. Another thing which we already discussed is that uh well don't don't pack the context with irrelevant stuff. it's uh it's
45:00 - 45:30 going to affect this uh multi- needle retrieval. Another interesting thing is we touched on the interaction between uh in weight and in context memory. So one thing I must mention is that if you want to update your inweight knowledge using um in context memory then the network will necessarily get two kinds of knowledge to rely on. So there will there might be a contradiction between those two and I
45:30 - 46:00 think it's beneficial to resolve this contradiction explicitly by careful prompting. So for example you might start your question with saying based on the information above etc. And when you say this based on the information above you give a hint to the model that it actually has to rely on in context memory instead of inweight memory. So it resolves this uh ambiguity for the model. I love that. That's a great uh
46:00 - 46:30 that's a great suggestion and and your sort of comment about this tension between inweight versus not and and again we we talked a little bit about this but how how do you think about from a developer perspective like the fine-tuning angle of this and the only thing that's maybe more controversial than like is you know long context going to kill rag is you know should people be fine-tuning at all and like Simon uh Willinsson has a bunch of threads about this is like who does anyone actually fine-tune models does it end up helping them um how do you think about this from
46:30 - 47:00 like a like would it be useful to do fine-tuning and long context for like a similar corpus of knowledge or like does the fine-tuning piece potentially lead to like better general outcomes for fine-tuning? How do you think about that that interplay? Yeah. So, let me maybe elaborate on how fine-tuning could actually be used on the knowledge corpus. So, what people sometimes do is uh well, they get additional uh additional knowledge. Let's say you have a big
47:00 - 47:30 uh enterprise knowledge corpus say uh billion of tokens and well you could continue training the the network just like we're doing with pre-training. So you could uh apply language modeling loss and you can ask the the model to learn how to predict a next token on this uh knowledge corpus. But you should keep in mind that uh this this way of integrating information it actually works but it has
47:30 - 48:00 limitations and one limitation is because uh you're actually going to train the network instead of just supply the context. You should be prepared for various problems like uh you will need to tune hyperparameters. You will need to know when to stop the training. you'll have to deal with the overfeeding. Some people who actually tried to do that, they reported increased hallucinations from from from
48:00 - 48:30 using this process and they hinted that uh maybe it's not the best way to supply knowledge information into the network. But obviously it it also like this this technique also has advantages. H in particular, it's going to it's going to be pretty cheap and fast at inference time because well the knowledge is in the weights. So you're just sampling. But there are also some privacy
48:30 - 49:00 implications because now the knowledge is cemented into the weights of the network and if you actually want to update this knowledge then you are back to the original problem like uh this knowledge is not easy to update like it's in the weights. So how are you going to do it? You will have to again supply this knowledge through the context. Yeah, I think it's it's such an interesting um trade-off problem from a developer perspective about like how how rapidly you want to be able to update the information. Uh I think the cost
49:00 - 49:30 piece of it isn't like it's not cheap to just like keep paying to like I feel like rag is actually like pretty reasonable. you're paying for a vector database, which I feel like there's a lot of offerings and that's reasonably efficient to do at scale, but I think like continuously fine-tuning new models is like often times potentially not not cheap, which is super yeah, a lot of interesting dimensions to take into account. I'm curious about the sort of long-term direction from a fine-tuning or maybe not from a fine tuning from a long context perspective like what what
49:30 - 50:00 can folks look forward to in the next like 3 years for long context from a maybe an experience perspective but like will people will we even talk about long context in 3 years will it just be like the model does this thing and I don't need to care about it and like it just works um or yeah how are you thinking about this so I'll make uh I'll make a few predictions uh what I think is going to happen first is the quality of the current one or two million context it's going to increase
50:00 - 50:30 dramatically and we're going to max out pretty much all the retrieval like tasks quite soon and the reason I think it's going to be the first step is because well you could say like Hey, but why don't we extend the context? Why stop at 1 million or 2 million? But the point is that the current million context, it's
50:30 - 51:00 not uh close to perfect yet. And while it's not close to perfect, uh there's a question, why do we want to extend it? Because what I think is going to happen is when we achieve close to perfect million context then it's going to unlock uh totally incredible applications like something we could never imagine would happen like the abilities to process uh information and
51:00 - 51:30 connect the dots it will increase dramatically. this thing it already can simultaneously take in more information than a human can like uh I don't know go watch a one hour video and then immediately after that answer some particular question on that video like at what second someone is dropping a piece of paper that you you can't really do that very precisely as a human so what I'm what I think is
51:30 - 52:00 going to happen is these uh superhuman abilities they are going to be more pervasive like the more like the better long context we have the more capabilities that we could never imagine are going to be unlocked. So that's uh that's going to be step number one. The quality is going to increase and u we're going to get nearly perfect uh retrieval. After that, what's going to happen is the cost of long context is
52:00 - 52:30 going to decrease and I think it will take maybe a little bit more time but it's going to happen and uh as the cost decreases the longer context also gets unlocked. So I think reasonably soon we will see that 10 million context window which is uh like a commodity like um it it will basically be normal for the providers to to give uh 10 million
52:30 - 53:00 context window which is currently not the case. when this happens that that's uh that's going to be a deal breaker for some applications like coding because I think for one or two million you can only fit some uh I don't know some somewhere between small and mediumsiz uh code base in the context but 10 million actually unlocks um a large large coding projects to be included in the context
53:00 - 53:30 completely and by that point we'll have uh we'll have the innovations which uh enable uh nearperfect uh recall for the entire context. This thing is going to be incredible for coding applications because the way humans are coding, well, you need to hold in memory as much as possible to be effective as a coder and you need to jump between uh the files
53:30 - 54:00 all the time and you you always have this narrow attention span. But uh LLMs are going to circumvent this problem completely. They're going to hold all this information in their memory at once and they're going to be reproduce any part of this information precisely. Not only that, they will also be able to really connect the dots. They will fight the connections between the files and so
54:00 - 54:30 they will be very effective coders. I imagine we will very soon get uh super superhuman coding AI assistants. they will be totally unrivaled and uh this they will basically become the the new tool for every coder in the world and so when this 10 million happens that that's a second step and going to say 100 million well it's more debatable I think it's going to happen I
54:30 - 55:00 don't know how soon it's going to come and I also think we will probably need more deep learning innovations to achieve this. Yeah, I love that. What one one sort of quick followup across all three of those dimensions is like how much from your mind is this like hardware story or like the infrastructure story relative to like the model story? Like there's obviously a lot of work that has to happen to like actually serve long
55:00 - 55:30 context at scale which is why it costs more money to do long context etc etc. Um, do you think about this from a research perspective or is it like, hey, the hardware is sort of going to take care of itself. The TPUs will do their job and like I can just focus on on the research side of things. Oh, well, yeah. I mean, just having the the chips is is not enough. Uh, you also need very talented uh inference engineers. And I'm actually I'm I'm really impressed by the work of our
55:30 - 56:00 inference team. What they pulled off with the million context that was uh incredible and without such strong uh inference engineers I don't think we would have delivered one or two million context uh to customers. So this is uh it's a it's a pretty big uh inference engineering investment as well and no I don't think it's going to
56:00 - 56:30 resolve itself. Yeah our our inference engineers are always working hard because there's always uh we always want long contexts on these models and it's uh yeah it's not easy to make it happen. How do you think about the sort of interplay of a bunch of these agentic use cases with long context? Do you think is it like a fundamental enabler of of different agent experiences than than you could have before or like what what's the interplay between those two those two dynamics? Well, this is an interesting
56:30 - 57:00 question. I think uh agents can be considered both consumers and suppliers for long context. So, let me explain this. So the agents to operate effectively they need to keep track of the last state like the previous actions that they took the observations uh that they made etc and of course the current state as well. So to keep this uh all these previous interactions in memory
57:00 - 57:30 you you need longer context. So that's where longer context is helping agents. That's that's where the agents are the consumers of long context. But there's also another orthogonal perspective is that agents are actually suppliers of long context as well. And this is because packing long context by hand is incredibly tedious. like if you have to upload all the documents that you
57:30 - 58:00 want uh by hand every time or like uh upload a video or I don't know copy paste some uh content somewhere from the web. This is really tedious. You don't want to do that. You want uh the model to do it automatically. And one way to achieve this is uh through the gentic tool calls. So if the model can decide on its own like hey at this point I'm going to um fetch some more information and then
58:00 - 58:30 it's going to just uh pack the context on its own and so yeah in that sense uh agents are the suppliers of long context. Yeah that's such a great example. Um my two cents and I've had many conversations with folks about this. I think this is actually like one of the main limitations of how people interact with AI systems is like your your example of like it's tedious like it's so tedious like the worst part about doing anything with AI is like I have to go and find
58:30 - 59:00 all the context that might be relevant for the model and like personally bring that context in and in many cases like the context is like already on my screen or on my computer or like I have the context somewhere but it's like I have to do all the heavy lifting. So, I'm excited for like a uh we should, you know, we should build some like long context agent system that just like goes and gets your context from everywhere. Like I think that would be super super interesting and I feel like solves a very fundamental problem not for not
59:00 - 59:30 only for developers but like from a enduser of AI systems perspective like I wish the models could just go and fetch my context and I didn't have to do it all. Yeah, MCP for the win. I love that. Nikolai, this was an awesome conversation. and thank you for taking the time. I'm glad we got to do this in person. Um, and appreciate all the hard work from from you and the Long Context teams. Uh, and hopefully we'll have lots more exciting long context stuff to to share with folks in the future. Yeah, thanks for inviting me. It was uh fun to
59:30 - 60:00 have this conversation. Yeah, I love it.