Inside the Making of GPT-4.5: A Journey of Challenges and Triumphs

Pre-Training GPT-4.5

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In a unique conversation, the masterminds behind OpenAI's GPT-4.5 delve into the complexities and unexpected successes of creating this powerful AI model. Initially, the team faced daunting challenges, from scaling system resources to fine-tuning data algorithms. Despite these hurdles, they achieved a model perceived as 10x smarter than its predecessor, GPT-4. Throughout the development, the team balanced solving unforeseen issues with maximizing the new computing power available. They explored the dynamics between machine learning innovations and system architecture, questioning if future models could surpass human intelligence in data efficiency.

Highlights

The GPT-4.5 project involved pushing computational boundaries with over 100,000 GPUs 💻.
Team reportedly reduced system issues over time, improving training efficiency and model performance ⚙️.
Data compression and algorithmic refinement were crucial to achieving the desired improvements in AI intelligence 📊.
Challenges faced included hardware failures, data bottlenecks, and algorithmic unpredictability 🤯.
Pre-training models like GPT-4.5 continue to show scalable growth in terms of efficiency and capability 📈.
The ideal AI system remains an aspirational goal, with many hurdles yet to overcome in system design and execution 🎯.

Key Takeaways

Developing GPT-4.5 involved massive computational challenges and necessitated advanced system architecture 😅.
The team set out to make GPT-4.5 ten times better than its predecessor, and their success exceeded expectations 🎯.
Complex algorithms and architecture changes were integral in achieving a model that surpassed GPT-4's capabilities 💡.
Continuing advancements in AI face bottlenecks in data efficiency rather than computational power 🚀.
The project underscored the importance of data compression and its correlation to intelligence 🔑.
GPT-4.5 demonstrated improved nuances in user interactions, reflecting its smarter, refined abilities 🤖.

Overview

At OpenAI, the journey of creating GPT-4.5 was marked by intense challenges and innovative breakthroughs. During a discussion with the team, they recounted the intricate process of managing vast computational resources—over 100,000 GPUs—while addressing unpredictable hardware failures and optimizing data usage. The scale required collaboration across multiple domains, balancing ML advancements with system architectural changes to push the frontiers of AI capability.

Central to their success was a focus on maintaining efficiency while handling unforeseen issues and pushing the development of data algorithms. The team aimed to improve upon GPT-4 by tenfold, creating a model that not only surpassed expectations but also revealed new depths of AI intelligence. Key insights included the potential for optimizing data compression and efficiency, emphasizing that computational power is not the sole bottleneck in AI advancement.

Looking to the future, the team debated the potential to achieve human-level data efficiency, recognizing that while algorithms are improving, there is significant ground to cover. The drive to improve data efficiency marks a new phase in AI research, one that will shift focus away from purely computational constraints to more nuanced understandings of data. As OpenAI strides forward, the journey of GPT-4.5 highlights both the achievements and the ongoing challenges facing AI developers.

Chapters

00:00 - 01:00: Introduction and Overview The chapter introduces a different approach to the usual discussions about new product launches. Instead of focusing solely on introducing a product, it delves into the research behind the development of GPT-4.5. Despite initial modest expectations, the product exceeded these, with users expressing high levels of satisfaction, often noting significant improvements over GPT-4, both in obvious and subtle ways. The chapter sets the stage for an in-depth exploration of what went into making GPT-4.5.
01:00 - 03:00: Team Introductions and Background In this chapter, the team responsible for creating GPT-4.5 is introduced. The discussion begins with the challenges and requirements of creating a large-scale model like GPT-4.5, emphasizing the need for substantial human resources, time, and computational power. Alex, one of the key team members, introduces himself and highlights his role in overseeing pre-tuning machine learning processes for the model.
03:00 - 04:00: Project Inception and Planning The chapter 'Project Inception and Planning' begins with Amin Chian, the chief system architect, introducing himself and mentioning his role in overseeing systems and networking at OpenAI. He is joined by Dan, who focuses on data efficiency and algorithms. They discuss the origins and initial planning stages of a project that began approximately two years ago. The team anticipated the arrival of a significant new cluster, marking the project's inception and setting the stage for further development.
04:00 - 06:00: Launch Challenges and Execution In this chapter, the focus is on the challenges and execution strategies involved in launching a major initiative. The narrative reveals that significant effort went into planning and risk management before the main event. The team was proactive in identifying key features and conducting extensive derisking runs. They developed a comprehensive plan encompassing all aspects, from systems and machine learning to other integral components, emphasizing a meticulous approach to preparation. The chapter highlights that successful execution involves not only the actual launch but extensive pre-launch activities to ensure readiness and mitigate potential risks.
06:00 - 09:00: Scaling the Model The chapter titled 'Scaling the Model' delves into the collaborative efforts between the machine learning (ML) side and the system side. It highlights the initiation phase all the way to the decision on which specific model to train. The chapter notes the challenges associated with maintaining pace due to the introduction of new computational resources, making priority planning less straightforward.
09:00 - 12:00: Model Training Challenges Despite thorough projections on both machine learning and system sides, launching often begins with unresolved issues. Progress requires increasing compute capacity and resolving unforeseen problems to close the gap between predictions and reality.
12:00 - 17:00: Algorithmic Innovations and Future Needs The chapter titled "Algorithmic Innovations and Future Needs" discusses the broad and intensive process involved in executing algorithmic systems, emphasizing the significant amount of people, energy, and momentum needed for a prolonged training period. The narrator reflects on the usual distance between expectations and actual outcomes, particularly in the early stages of system implementation.
17:00 - 22:00: System Limitations and Improvements "System Limitations and Improvements" discusses the ongoing dilemmas in project launches, specifically the decision between delaying launch for issue resolution versus launching early and solving problems as they arise. It highlights the unpredictability of issues during the initial project phases and the essential balance needed to avoid unreasonable delays while anticipating and navigating unforeseen challenges.
22:00 - 27:00: Model Performance and Insights The chapter titled 'Model Performance and Insights' delves into the approach of managing known variables and preparing for unknowns in a project focused on achieving enhanced model performance. The central goal of this project, GPD4.5, aims to create a model that is significantly more advanced, labeled as 10 times smarter than its predecessors. The discussion highlights the importance of strategizing for both predictable and unpredictable elements that influence the outcome and duration of model runs.
27:00 - 33:00: Debugging and Bug Fixes The chapter titled 'Debugging and Bug Fixes' discusses the journey and challenges faced in achieving a significant improvement in the model known as GPD4. The goal was set two years ago to create a model significantly smarter than its predecessor, 2P4. The journey involved numerous considerations of potential improvements and setbacks. Despite the complexity of this process, the final result was a model that performed ten times smarter than the previous version, 2P4. This outcome was attributed to the effective computing strategies implemented during the execution phase.
33:00 - 40:00: Team Dynamics and Collaboration The chapter titled 'Team Dynamics and Collaboration' discusses the challenges faced when scaling hardware resources, specifically GPUs, from 10,000 to 100,000 units. Initial time estimates were surpassed, indicating the increased difficulty in problem-solving at larger scales. Observational skills play a crucial role, as issues that manifest at larger scales are often detectable with careful attention at smaller scales.
40:00 - 47:00: Unsupervised Learning and Compression The chapter discusses the challenges and phenomena associated with scaling up systems, particularly in unsupervised learning and data compression. It highlights how events or issues that are rare on a small scale can become catastrophic when they occur at a large scale. This becomes especially problematic if not anticipated. An example given is the failure rates and types of failures observed in infrastructure systems.
47:00 - 50:30: Scaling Laws and Model Improvements The chapter discusses the benefits of using a large pool of resources for observing failures and the statistical distribution in model execution. It highlights the advantages of having access to a broad range of samples, which allows for a more comprehensive understanding of both the types and frequency of failures. The importance of the network fabric and individual accelerators in this process is also emphasized. This extensive observation leads to improved insights and potentially enhanced model performance.
50:30 - 50:36: Conclusion The conclusion emphasizes the complexity and difficulty of working at the edge of scale, where everything must function almost perfectly for success. It highlights the challenges faced when pushing the boundaries, noting that while advancing to the next frontier is tough, tasks that were once frontier challenges become easier over time. The text specifically mentions the significant effort by OpenAI in developing GPT4.5, which required contributions from hundreds of people.

Pre-Training GPT-4.5 Transcription

00:00 - 00:30 how many parameters is it still or do we not care i think we should just Okay so usually when we do these it's to talk about a new product that we're about to launch um but we're gonna do something a little bit different today which is to talk about the research that went into our product when we launched GPT4.5 we thought people were going to like it we were very proud of the model but people liked it much more than we thought people would said all kinds of things like "I never thought I was going to have this experience talking about model it's so different than GPT4 it's way better in these ways that are either obvious or hard to explain or this or that." But uh there was like a lot of interest about what went into making
00:30 - 01:00 GPT4.5 so today we have some of the key team that made GPT4.5 and we're going to talk about it uh we're going to talk about sort of like what went into it what we learned and just like what it takes to make a giant model like this um actually maybe we start with that what does it take to make a giant model like this you go first uh a lot of people and a lot of time and a lot of compute oh maybe you guys should introduce yourselves alex you want to start sure uh yeah hi I'm Alex uh I work a lot on pre-tuning data i also led a pre-tuning ML for GP4.5
01:00 - 01:30 uh I'm Amin Chian i I'm open chief system architect i oversee systems and networking broadly at OpenAI i'm Dan i work on data efficiency and algorithms yeah okay so what what goes into it um yeah so I think we started this project basically two years ago or or so um and uh we kind of knew that we had a big new cluster coming online uh and we we kind
01:30 - 02:00 of saw this on the horizon and we started doing a bunch of work to kind of convince ourselves uh of the features that we wanted to include in the run doing a lot of large derisking runs building out a very long plan for this um and kind of across the full stack from systems ML everything um and yeah it was it was a long story of execution for uh d-risking it uh and kind of preparing for the run um before the run itself which you know itself was a very large endeavor yeah I think it's a process that starts
02:00 - 02:30 at inception with a collaboration between ML side and the system side and goes all the way to the time that we know what model precisely we want to train and then it's starting the uh run process in itself with the pace that we are working at and especially trying to make use of our most recent computer that is made available to us it becomes something that is difficult to priority plan perfectly mhm so we almost
02:30 - 03:00 always go into a launch with a lot of unresolved issues um and try to make forward progress throughout the run despite all the challenges basically add more compute resolve all the issues that we probably might not have anticipated despite all the projections that we had both on the ML side and the system side and try to basically close the gap between what we predicted should happen and what is happening uh I think that is at a very
03:00 - 03:30 high level uh and broadest stroke is the entirety of the process and the tail end of it is the execution which takes a lot of people a lot of energy and momentum for a prolonged period of time to go through the training process how how close do you feel like we were to what we expected to happen to what actually happened um on the usually at the beginning uh I'm talking about the system side of it u we are usually far
03:30 - 04:00 away from where we expect it to be and there is always a choice to delay the launch and basically defer until more and more issues are resolved or launch early and try to basically figure it out as we go it's always a balance of figuring out not to delay the process unreasonably uh but almost always there are issues that we don't necessarily know at the inception that we are going to run into
04:00 - 04:30 and the entirety of the process is try to uh handle the known to the extent that we can and have a plan for what how the run should go and as we make progress just deal with the unknowns which is the variability on let's say if a run is successful how long it would take and so How far have they been well yeah i think I guess one at the highest level I guess with this project we set out to do GPD4.5 which means like 10x smarter than
04:30 - 05:00 GPD4 so that was sort of I think the the initial goal starting like two years ago that we had set our sights on um and then there's a lot that kind of you know kind of happened along the way of like oh we think you know can we do better or worse um I think uh it was a very complicated row but in the end we got to a model that we feel hit this mark of 10x smarter than 2P4 um in terms of kind of the effective compute that we we put into it yeah on this execution side of
05:00 - 05:30 it of course initially it was far far away from how uh long we thought that is it did take longer than we thought uh yes but I think the process is to try to shorten it to basically match what we would two two part question about that why do why does going from you know making up numbers here 10,000 GPUs to 100,000 GPUs why does that make the problem much harder um a lot of issues it's I I do believe that issues that you observe at the scale if you have a very keen eye you would observe them at a
05:30 - 06:00 smaller scale it's not that they only manifest at larger scale but something that is a rare occurrence becomes something that is catastrophic at a scale uh especially if you haven't anticipated it being what are some of the kinds of things that have become captured I mean among those things is that I think is quite well known is uh issues with uh the infrastructure uh the failure rates that you observe the the variety of failures that you observe uh
06:00 - 06:30 in both in terms of the types of failures and also the the count itself so we get to observe something that I'm sure the vendor hasn't observed because this is a large pool of samples and we get to observe the entirety of the statistical distribution of uh a large pool of resources that that we are executing on the fabric the network fabric is always part of it the individual accelerators part of it but at the end of the day this is the beauty
06:30 - 07:00 of it at at the same time that almost everything needs to work as expected for the result to hold and the job is to basically minimize that variance second part of the question obviously it's really hard this is for all of you really hard to do things at the edge of scale um so you know even going as we go off and do the next training run even kind of crazier um but I've also noticed it gets much easier to go do things that are now no longer frontier so it took like hundreds of people almost all of OpenAI's effort to do GPT4.5
07:00 - 07:30 if you guys could go pick whoever you wanted what is the smallest team from OpenAI that could go retrain GPT4 from scratch today with everything we know and have and all the systems work i think to get to a GPT4 level model it's probably on the order of maybe five to 10 people yeah we did it with that type of that number of people with GBT4 4.5 was different in the sense that a lot of work history was a lot more people come
07:30 - 08:00 together and it was a very different effort than before I would say but now that we've done that work I think like the stack we've improved a lot and like if you were to retrain like I mean we kind of did this a little bit in the pro process of training GB 4.5 we trained GBD40 which was a GP4 caliber model that we retrained using a lot of the same stuff coming out of the GP4.5 research program um and I think doing that run itself actually took a a smaller number of people right what about from your perspective Dan
08:00 - 08:30 or just sort of like why is why is training big models hard i think doing anything new is hard i think uh even just finding out that someone else did something it becomes immensely easier because the hard part is having the conviction to do something in the first place i feel like just the fact that something is possible is a huge cheat code that just makes it Yeah yeah i mean we're always like we're scaling 10x beyond what we did before with these GPT pre-training runs and it's there's
08:30 - 09:00 always new things that you find that are interesting that you couldn't have anticipated necessarily what do we need for the next 10x or 100x in pre-training scale data efficiency what does that mean easy answer obviously I know but what does it mean so the the transformer the GPT is spectacular at making productive use of of data it it absorbs information and it it it compresses and and generalizes to some degree but it's its defining character its signature is
09:00 - 09:30 absorbing the information very efficiently with compute but there's a somewhat of a ceiling to how deep of an insight it can gain from the data and so at some point as the compute just keeps growing and growing and growing and the data grows much less quickly the data becomes the bottleneck of this standard paradigm and it requires some algorithmic innovations to be able to learn spend the compute more compute to
09:30 - 10:00 learn more from the same amount of data what do you guys think we need to keep scaling in addition to that i I think this answer is system side i think even between different GPTs that we have trained GPT 4.5 was the sheer uh volume of work that was required to basically the changes that were required for us to make
10:00 - 10:30 uh was a byproduct of the model specification the same we wouldn't have been able to train uh GD4.5 on the precise same stack as we did GP4 so let's say state management our approach with state management changed we had to scale to more compute and that compute was not available as part of one cluster we had to go to multicluster training uh and imagine it's many many different work
10:30 - 11:00 streams like that that have to come together in a short period of time for us to be able to do this and for making another 10x jump of course and other issues that pri we previously knew that they do exist it's just that it's a choice for expediting execution that we skip for this one for the next one we have to do it there's no way around it and it's always those choices that basically make the timeline of do building the perfect system take much longer uh so we are always compromising
11:00 - 11:30 on what is the fastest way to get to this result the systems is not an end in its own the product that the thing that it produces is so for the next 10x for me it would be uh of course fall tolerance not uh and a form of fall tolerance that we can co-design with the workload such that we don't have to worry the operational burden of keeping uh such a massive run going is not like
11:30 - 12:00 our prior system so I would argue that with our prior stack 4.5 but at edge of what we could keep up with do you know what percent of steps failed in the 4.5 run due to some component somewhere i actually don't have a number off the top of my head but it is usually the way things work this is a fascinating thing uh there are issues that are early on uh
12:00 - 12:30 in the lifetime of a new generation of hardware is not necessarily well understood or well studied we start the process and we want to make forward progress in presence of such issues of course the failure rates are quite significant earlier in the run they're not necessarily um it could very well be that once we find the root cause and eliminate it the
12:30 - 13:00 total number of failures significantly drops and this is often the case it's just that uh we learn more the infra that is some would call it clean up of the infra or understanding uh fundamental issues about the infra uh and the state improves significantly but that earlier phase of execution is almost always quite painful because we are learning about what are the new failure modes in the new uh uh and the new infrastructure while making forward
13:00 - 13:30 progress of course later on um the failure rates drop significantly the uptime improves overall and so on but it's just a matter of prior it is hard to predict what this early phase of a generation of infrastructures lifetime failure risk would look like and designing for the steady state might result in very poor um availability earlier on in the process obviously reasoning models are a
13:30 - 14:00 huge part of our future and you know but but if we were going to put this aside for a second and think about just how far we could go on classical pre-trained models assuming we had unlimited GPUs and you know unlimited networking and unlimited power but we were stuck with all of our current problems the stuff still broke we didn't figure out fall tolerant training we only had the data we have whatever else how and we kind of use the convention of each major number of GPT is 100x increment um how far could we go like GPT what could we train
14:00 - 14:30 with what we know today i think on that we get to like a 5.5 ML on the algorithm side I don't I don't think there's like a clear limit that we we found i think um yeah I think we're just kind of scratching the surface of more data efficient algorithms uh and I think better ways to leverage the data that we have um it's very interesting because I think up until this rough point in time like if you look even through GPD4 we were largely just in a compute constrained environment um so that was kind of where all the research
14:30 - 15:00 was going into but now we're you know in a very different kind of regime um starting with 4.5 for some aspects of the data where we are much more data bound um so there's now not a lot more excitement about this research it is a crazy update that I don't think the world has really groed yet i should pick a different one that the world has understood yet uh that we're no longer compute constrained on the best models we can produce that's just like that was so the world we lived in for so long what was the most interesting ML thing we learned during the four five training
15:00 - 15:30 that you want to share oh gosh um or either of you i don't know what I could share in general terms I think it is being off of the prediction and trying to figure out why we're off of the slope that we predicted to be on yeah I think that most surprising maybe we can yeah I think one of the more surprising things that we found um was just uh kind of how different aspects of
15:30 - 16:00 what we were working on on the ML side and what we put into the run scaled and and some things that did or didn't scale well um that we we kind of discovered through the process of training this model um it was uh it it's yeah we've learned a lot through this i'd say the the two defining characteristics of the GPT paradigm has been that the law you can predict the test loss and it scales predictably and magically lower test loss means greater intelligence in all these intangible amazing mysterious ways
16:00 - 16:30 are you a maximalist on that do you like fully believe that well I was going to say that one of the interesting things we found from 4.5 we tested it again and the model has all of these incredibly nuanced abilities that were not in in anyone's bingo card specifically we just like the conviction was it's going to be more intelligent in ways that are really hard to characterize ahead of time and then you see in the deployment you see in the the
16:30 - 17:00 user satisfaction that it's smarter in all these very subtle ways it has more common sense knowledge it understands nuance and context and that's that's the magic that came out of the few extra bits of test laws and I think that that part of the scaling held up really well what was the most like positive moment of the whole training run what was like what's like the favorite memory there's obviously lots of pain but hopefully that's somewhat fit i I do have one but yeah um yeah I can go uh
17:00 - 17:30 yeah so one one that comes to mind for me um we uh kind of we've we worked a lot on the ML of the run during the run itself as well and I think some of the the changes that we made during the run had a quite quite good impact i think uh maybe perhaps better than anticipated and this was a pretty exciting moment for us yeah I I think for me this was probably the biggest effort in terms of um I see time
17:30 - 18:00 during the run while we are building I mean I mean running things we are building things in parallel um of course to get there faster we are parallelizing work aggressively um and there is conviction that it will pay off we will B passes performance cliff that makes the model essentially untrainable in terms of the time it would take to to train um and there is a
18:00 - 18:30 plan everybody's executing on it but it's taking long it is a hard it's hard work definitely harder than I imagined it would be my projections were off uh by how long it would take to basically get uh those issues resolved and seeing I think seeing the moment that the whole team once a few of those issues got resolved we got a big performance boost after I remember that everybody got I mean you could sense that the energy changed it's
18:30 - 19:00 just that everybody feels excited and now more motivated to push uh everything through the end it's just it's fascinating part of see the ETA on our status tracker like yeah has constantly shifted from let's say two years to uh something uh tangible uh then the effect it has on the morale of the team and everything else is I think that was the beauty of it the other part I would uh
19:00 - 19:30 call out is the ML side of it didn't stop the ML code design part of this didn't stop at the launch of uh at the launch time and people tagged along issues that were left to say we'll figure it out later people were actively working on it and shipping things that helped um with execution time uh and I think that spirit of teamwork and basically not having team boundaries of I did my work I pass it on to you is something very powerful i was going to say there's
19:30 - 20:00 been a lot of focus on how the run itself was challenging and predictions were but this is despite a tremendous amount of sophisticated planning oh this is of course Do you want to say more about it by far the most planning yeah I think I mean like I said we we started basically working on this project like a year before we um even started training the run itself um and through this we had a number of very large de-risking runs we we took a lot of care to kind of
20:00 - 20:30 very carefully sequence all the changes that we wanted to put into it starting from sort of like very high confidence known good config you can think like GPD4 you know we really understand this this setup um from an ML perspective and just layering things in and being very careful to very carefully study the the scaling uh of all of the changes we're making so it's not just good enough that we see some amount of wind we also want any any win from any new feature to be like persistent across scale and not to be tapering off lots of things look good
20:30 - 21:00 at small scale don't look good at large scale uh so we had to be very very paranoid through this process um and really I think we continue to iterate on our scaling laws methodology we learned a lot on that front uh through this de-risking process uh that's continuing to guide us for future GPGs i I remember something I miss about another very fun moment um so there this is the torch sound uh but imagine it is very unlikely that we
21:00 - 21:30 launch a run and it doesn't have bugs i mean all sorts of bugs it is just a given that uh but we need to make forward progress and we need to be sure that okay are we actually sure that this is on track and these bugs are not super negatively affecting the health of the run while we are absolutely sure we were initially very sure that there are bugs that are consequential we do we have built a lot of systems around giving us
21:30 - 22:00 visibility and ability to distinguish between is it a hardware fault what type of hardware fault it is is it some form of corruption or is it some ML potentially an ML bug or something like races in our code um what happened was that we we of course had a few open threads about all sorts of issues that different symptoms uh all correctness related issues um and at the end of the
22:00 - 22:30 day uh so so of course we have found bugs fixed them and so on we arrived at this point that we have multiple open threads and there's a lot of uh thought around what is causing all these are they different bugs or it's one bug and causing all of these uh people went I mean uh around the table and say everybody vote which one do you think is the most probable cause of this bug and the one that turned out to be the bug got the least votes it's just that it is
22:30 - 23:00 a torch some bug a simple summation and uh upstream pietorch and no what was the bug so the the bug is for the this is funny because uh for this specific code path we were triggering it and for the for context remember that we are mostly using triton kernels it's just that for some corner cases that let's say the ops don't matter much we basically fall back to I mean it is okay to run torch ops uh
23:00 - 23:30 and our specific uh the specific code path I or data triggered for uh in the torch sum function uh was basically had a bug um and it was only occurring very infrequently it's just that it's a it's data distribution dependent and it was triggering in in a good case illegal memory accesses because it was computing some offsets and some bit of memory the fun revelation at the end is that once
23:30 - 24:00 uh somebody fixed the bug I mean our engineers figured out oh I found the bug it is this line is torched some let's ship a fix and see if it fixes everything uh it fixed all the way of sanding bugs that where they had seemingly distinct uh symptoms and it was quite fun and everybody I think we were renaming the slack channels of multiple theory to single bug theory or uh and um that was quite a lot of fun to
24:00 - 24:30 basically okay how when in the when did that happen I can't remember now it was live from very early days of the run up until I I think good portion of the run I would um 40% in event do you guys remember how someone found it uh I I do remember that i think it was in the list of so imagine there there's a sequence of uh kernel invocations and the um the second one was the one that was triggering illegal
24:30 - 25:00 memory accesses and that was some very complicated uh kernel that we wrote um and of course our team would suspect that there is a bug there uh obviously there must be a bug there and pe I mean several very clever people just line by line everybody is looking at it a bug was found but we rolled it out some of those issues disappeared but not everything and at some point
25:00 - 25:30 somebody was I mean in the list torch at some was the one feeding inputs to this kernel one of the many inputs of this kernel and somebody started uh one of our engineers started looking through the code and different code paths i said oh this very unlikely code path that probably most people don't hit we do hit uh and he said okay this line is buggy let's I mean of course the only verification and validation we have for
25:30 - 26:00 it is ship the change observe if all the crashes disappear and it did disappear I think that was the validation we needed so but uh it was t I mean this is the thing it's just that it's a crash at a very very I mean a slow rate it's one every hundred step one every thousand steps and it's something very easy to dismiss But it's just that we should not have that in the run as a discipline that we should we do have and it's just not giving up on it is the the story
26:00 - 26:30 alex I understand what your life is like i think most people can imagine it like leading up to like pushing go on the run but after that happens like what is your day-to-day like are you just like sitting there watching loss curves like how does it go definitely a lot of watching lost curves we all did a lot of that um no I think uh past that there's a variety of things still trying to work with with systems uh to uh kind of you know get out improvements that we didn't
26:30 - 27:00 get in um on the code design front uh before launch time um there's a lot of things that we try to continuously monitor of the run to see if anything is kind of trending like we're not uh expecting so this is lost cursor but there's a bunch of other statistics that we can look at um and then yeah kind of uh working a bit towards like any other kinds of improvements we could we could do to the run as well from an ML perspective u so on the data front it becomes less busy immediately once you click go but uh otherwise there's still plenty to be done i I think what we lean
27:00 - 27:30 on for ML is a lot of correctness imagine there's a lot of noisy signal and you are at times reading tea leaves it's just is this healthy or not of course if you wait long enough you would know it was it was it healthy or not it's just that the responsibility and how often are there false alarms there like how often are you like "Oh this is looking really bad." And then it's fine pretty frequently I think probably about half the time maybe because we're a paranoid bunch so I think yeah if it wasn't half the time we wouldn't be looking closely enough i have a short
27:30 - 28:00 lightning round of questions sure if you could have any one ML question answered before the next big run what would you most like to know i think uh what we should what what algorithms we should employ with uh for for limited data in certain domains is the main thing it's a kind of vague question but big answer and if you could have any change to current hardware you could have like a new kind of network invented or like a totally new chip architecture what is the most what is the limiter on systems at this point not
28:00 - 28:30 you don't get to like say like oh I want yeah so where at this is a transport level networking transport level change it's just that where there are faults that uh could be worked around uh at a different level than the application level i would rather uh the transport the network transport do its job and uh keep running and give
28:30 - 29:00 me the available bandwidth without me worrying about it is there anything promising on that front uh yes we can talk about Okay that's good at least um two-part one for you Dan uh how so on the data efficiency question humans for whatever other flaws we have about learning things we seem unbelievably data efficient yeah how far away is our very best algorithm currently from human level data really hard to measure apples to apples i think just like vibes by in language astronomically far away
29:00 - 29:30 100,000 x00x something in that in that range uh it depends on whether you count every bit of pixel information on the optical nerve but but we don't know algorithmically how to leverage that to be human level at text so I think algorithmically we're yeah quite quite quite far away and it apples to apples and then part two is do you think with our
29:30 - 30:00 current our like the direction of our current approach we will get to human level data efficiency or is that just not going to happen and doesn't matter well I think for for decades deep learning has been about compute efficiency and what's what what's magical besides the data and compute growth is that the the algorithmic changes stack so well you've got different people different parts of the world finding this little trick that makes it 10% better and then 20% better and they just keep stacking there just
30:00 - 30:30 hasn't yet been that kind of mobilization around data efficiency because it hasn't been worth it because when the data is there and your compute limited it's just not worth it and so now we're entering a a new stage of AI research where we we'll be stacking data efficiency wins 10% here 20% there and I think it would be a little foolish to make predictions about it hitting walls
30:30 - 31:00 that we have no reason to predict a wall but but it's there the brain certainly operates on different algorithmic principles than anything that's a small tweak around what we're doing so we have to hedge a little bit there but I think there's a lot of reason for optimism this next one is the same question for all three of you uh yes or no or you can add you can add you can add explanation too will humanity ever do a 10 million GPU or greater synchronous pre-training run
31:00 - 31:30 i don't know if it'll exactly be a pre-training run but I think there'll probably be some kind of training run that there will be 10 million GPU training yeah yeah yeah i don't know if it'll look it'll probably look totally different than what we're doing today but there will be something that is kind of in the spirit of unsupervised learning that is like at that scale i think I think so i would call it semi-synchronous uh and the scale of it I hope so i think it sounds very interesting you would call it semi-synchronous you said yes i think not fully synchronous but it's just laws of nature can't bend it
31:30 - 32:00 completely i think it's likely to be more decentralized than that there'll definitely be 10 million GPUs working together on an AI system that is learning and doing things but it might not all the parts of the brain won't necessarily all be talking to each other that makes sense um can you say anything that we've learned or observed about how uh smarter and bigger pre-trained models
32:00 - 32:30 correlate with how good a model is at learning reasoning um yeah so I think what we've kind of observed is uh it is better pre-training and unsupervised learning just tends to lift kind of broad-based intelligence of the model and aid a lot in generalization uh which we found to be like really complimentary I think with reasoning which can tend to be a little bit spikier lumpier in terms of like where it's increasing intelligence um so yeah I think they're they found them to be
32:30 - 33:00 good compliments to go off on a little bit of a tangent do any of you guys have like a intuition of is it weird or is there anything to take away from the fact that pre-training seems to be so general across everything and when we teach a model reasoning we can get it so good only at one category of things yeah I don't I don't know if it's the most Yeah I think I think it's interesting i think it's kind of not surprising to see this out of pre-training when you're just look at what you're what you're training with um you're when you construct like a training data set for
33:00 - 33:30 pre-training it's inherently very broad we're we're targeting breadth and diversity um and I think it's it's kind of difficult to always get the same breadth when you talk about doing RL and having like environments that you can kind of cleanly get uh good reward signal out of and good good environments out of i I I agree but I think there's another factor that pre-training is essentially it's compressing the data and compressing the data is about seeing connections between different things
33:30 - 34:00 it's about analogies it's about uh abstractions and reasoning is on a particular problem it there's like there is a skill and a craft to thinking carefully and thinking carefully can unlock many different kinds of problem solving in different domains but there's something uh more learning at a more abstract level when you're compressing across domains in the way pre-training does
34:00 - 34:30 that that makes Oh that I'm going to change my question for you in a second i just thought of something else um what's going to limit us in terms of systems progress chips processors memory network or power like what what will be the bottleneck of continuing to scale this up um it is this is the beauty of systems uh in that if you do code design the workload is adaptable to the infrastructure that you built um there is I think let's say there is no statement that broadly network is a
34:30 - 35:00 bottleneck or memory bandwidth is a bottleneck or computer is a bottleneck we have the option to basically shift and even for the same specification of the model we have the option of shifting resource demands to basically create a more balanced system however having said that I think let's say the pre-training answer is different than the inference answer and so on but it it never hurts to have more memory bandwidth uh I would uh yes I think this is a hard question
35:00 - 35:30 to answer and without qualification speaking of that earlier point how much do your teams work together on like the spec of the model as we get ready for the four five run like how much how much is it you're just like this yeah very closely i mean down to like the shapes of the the map moles that we want to be doing um making sure those are optimized well um but for this project it was a much deeper collaboration going back um to like I guess six or nine months before the launch of the run um trying to do for some of the the functionality
35:30 - 36:00 we wanted to put into the run and like aspects of what we needed to do to make 4.5 happen we took on like a a specific like very large d-risking run uh that was specifically focused on co-design with systems uh to make kind of the the ML and the systems work together well at scale um so there we I think it's the first time we put like such a large emphasis just on the codeesign aspect was really key yeah I think that was the first big um scale effort that I I
36:00 - 36:30 remember that it is not just about fine-tuning one aspect of it is that fundamentally you want a property whole to hold system side and that property doesn't emerge out of nowhere you really need to steer the system to give you that property uh so that code design effort is something uh that uh formed the architecture and architectural elements that goes into the model and somehow ties the system and ML side together it's probably a property we
36:30 - 37:00 would not prefer to have yeah I would ideally want everything to be decoupled to basically give maximal room to each other but at times things get tied together where you are really catering to the requirements of your infrastructure or how things should be i mean oftentimes it is you really want to have a balanced system uh and balanced communication and a very symmetrical type of system and you have the best knob we have at our disposal is all code
37:00 - 37:30 design how close are we to the idealized mean system do you mean being like fully happy we have all the hardware we want it works for what we know about ML we are nowhere near that you think it is we are nowhere near but it is fun it is the the practice of building systems is always about that that you have an idealized view of how things should work and it's just about reconciling the differences of that with what you have i think we are not doing theory for the
37:30 - 38:00 sake of let's say just talking about what we want it to be we just want to make it happen and approximate that ideal to the degree that we can so to be honest I think this u is probably as exciting as it gets for systems you be get to really come up with hypothesis of what is a good system design uh and take it from there and basically see the results in action very quickly things that previously one would
38:00 - 38:30 say this is an elegant system design and uh we have to just history will tell us if this is the right thing to do or the wrong thing to do we have a lot of compute we have a problem we we we know the target and just we go and see if our choices were correct or not how much is your team thinking about the sort of constraints of system design when they're trying to decide what what's going to go into the run yeah I think it's it's a huge consideration for just doing pre-training runs at large um I
38:30 - 39:00 think like since 4.5 a lot of the work on the architecture side there there's also kind of ongoing threads around further code design further places that we can design for future hardware uh build together i think there's been a lot of promising uh work already since then okay my changed question for you why does unsupervised learning work compression so So the ideal intelligence is called Solomon induction
39:00 - 39:30 basically it's it's uncertain about what universe it's in and it imagines all possible un it considers all possible universes with simple ones more likely than less simple ones and it's it's fully basil in head and up updates its views as it progresses and you can approximate this by uh finding the the shortest program that computes everything you've seen so far and what we're doing with
39:30 - 40:00 pre-training or one way to think about what pre-training is doing is it is compressing it is trying to find the shortest program that explains all of the data that humans have produced so far as a way of approximating and why does looking at next token prediction do that it's actually a subtle question there there was a paradox or somewhat of a paradox in statistics for a long time why do deep networks generalize when they don't seem
40:00 - 40:30 to compress normally in statistics you have lots of data and you got small models the models predict the data therefore you can the models must have compressed and must have learned something in pre-training generally the models are pretty gigantic and they scale roughly with the with the data uh so it was always a question are they actually compressing are they generalizing and of course there are critics who have said well it's just memorizing and interpolating and superficial um but there's a way of
40:30 - 41:00 viewing pre-training such that you do see that it is a compressor in a in a different unintuitive way and basically the idea is called prequential compression and the idea is that the fact that it learns quickly during training means you can turn it into a great compressor so even though the weights are big the binary doesn't need to store the weights the binary can pre-train from scratch to decompress and
41:00 - 41:30 so the fact that it's learning really really quickly means that most of that data you can encode with very very few bits so basically for a subtle reason it really is quite a good compressor and I think that is a a pretty satisfying explanation for why it really does lead to intelligence you guys have anything to add no it's great thank you uh one one thing somewhat relatedly that
41:30 - 42:00 hasn't come up yet is the discipline of metrics it's like the the thing the thing you get when you do these scaling laws and you you do all the ML science is very dependent on the metric that you choose what do you mean what do you want to say oh yeah just talking about the the what test side you're you're evaluating your perplexity on so what you're referring to even even looking primarily at perplexity oh yes there's
42:00 - 42:30 already some viewers might think we're looking at college tests or something oh yeah um so yeah I mean do you want to explain perplexity then i think it's worth it yeah so it's very it's very tempting to try to evaluate your model for intelligence on things that are legible to humans as tests uh but if if you if you do this probably you're going to be favoring changes that make
42:30 - 43:00 memorization easier at the cost of making systems actually smarter because almost every test that we have out there's something similar online like if you can actually train on the entire internet tests become somewhat uh degenerate compared to tests for humans where they can't do that and so the the main approach in the field is to look at how much it compresses some held out data that's thought to be good data and
43:00 - 43:30 e even then if you're not careful about that held out data it's too similar to your training data training changes to your training algorithm that make it memorize better will seem to make it smarter because it'll have already it'll already know your test set and we don't want to just be measuring memorization after memorization we're after generalization and so out of distribution generalization so yeah this is why I guess yeah maybe you're alluding to so like the the key test sets that we're looking at we we care a lot about them not being present in any
43:30 - 44:00 degree to the slightest degree uh in our training set because that kind of throws off all all the way that we do our scaling laws um so yeah this is a very key point with what's our best thing for that uh so our our internal codebase uh which we know is not out there um yeah it's a very good held out has that held as our best thing across like many it's still the best thing yeah it's remarkable i mean we joke that a model is its monor repo loss like everything else there's some incredibly meta
44:00 - 44:30 recursive thing there oh yeah somehow you you you pre-train the model it has a monor repo loss and somehow this tells you so much about how it's going to be behaved down it's it tells you a lot about how a philosophy grad student is going to find the nuances of its responses but it's incredible it is incredible related to that and sort of
44:30 - 45:00 last question in some sense this whole effort which was hugely expensive in terms of people and time and dollars and everything else was an experiment to further validate that the scaling laws keep going and why and turns out they do and they probably keep going for a long time um I accept scaling laws like I accept quantum mechanics or something but they still don't like I still don't know why like why should that be a property of the universe so why are scaling laws a
45:00 - 45:30 property of the universe you want I can I can take a stab well the the fact that more compression will lead to more intelligence that has this very strong philosophical grounding so the question is why does training bigger models for longer give you more compression and there are a lot of theories here there's the one I like is that the the relevant concepts are sort of uh sparse in the in the the data of the world and
45:30 - 46:00 in particularly it's is a power law so that the like the hundth uh most important concept appears in one out of a hundred of the documents or or whatever so there's long tales does that mean that if we make a perfect data set and figure out very data efficient algorithms i mean can go home it it means that there's potentially exponential compute wins on the table to be very s sophisticated about your choice of data but but basically when you just scoop up data
46:00 - 46:30 passively you're going to require 10xing your compute and your data to to get the next constant number of things in that tail and there's just that tail keeps going it's long you keep you can keep uh mining it although as you alluded to you can probably do a lot better i think that's a good place to leave it thank you guys very much that was fun yeah thank you