Inside the World's Largest AI Supercluster xAI Colossus

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

The video delves into the largest AI cluster in the world built by xAI, comprising over 100,000 GPUs and sophisticated networking to power Gro, a revolutionary AI beyond just chatbots. The supercomputer was constructed in just 122 days, showcasing an impressive engineering feat distinct with liquid cooling systems. Each computer hall boasts about 25,000 GPUs, detailed rack designs for effective servicing, and advanced networking not reliant on typical supercomputer interconnects. Energy efficiency is a key feature, highlighted by innovative liquid cooling and Tesla Mega packs for stable power provision. The video is backed by Super Micro, xAI, and Elon Musk.

Highlights

xAI's AI supercluster is powered by over 100,000 GPUs built in just 122 days! 🌟
Innovative liquid cooling and Tesla Mega packs ensure efficient operations. 💧🔋
Cutting-edge networking connects these massive GPU clusters seamlessly. 🌐🔥
Advanced serviceable Nvidia and Super Micro setups boost scalability. 🔄🔧
With massive network-delivered storage, the AI cluster is ready for anything! 📦💫

Key Takeaways

xAI built the world's largest AI cluster with over 100,000 GPUs in just 122 days, revolutionizing AI with Gro! 🚀
The supercomputer uses innovative liquid cooling systems and Tesla Mega packs for efficient power management. 🌊🔋
Super advanced Nvidia and Super Micro equipment make this AI system highly scalable and serviceable. 💪🔧
The cluster uses advanced Ethernet networking, ensuring high-speed data handling and communication. 🌐💾
Storage is network-delivered, highlighting the massive scale and shared resource approach of the AI cluster. 📦🔗

Overview

Hold onto your hats, tech fans! You've got an express pass into xAI's revolutionary supercluster, a gaming-changing AI behemoth. It's shattering records and expectations alike with over 100,000 GPUs and advanced features. Built faster than a pizza delivery, in just 122 days, this supergiant is all set to deliver Gro - AI technology that's way more than just chatbot banter!

Venture inside to witness amazing engineering feats like liquid cooled racks from Super Micro, optimized for massive scalability and efficiency. Tesla Mega packs smooth out power quirks, ensuring this titan of tech runs like clockwork. Forget the usual air conditioning; this beauty thrives on its liquid-cooling prowess, making it an environmental champ!

Networking fans, rejoice! With top-tier Ethernet technology whisking data at supersonic speed across its massive setup, the data exchange is as seamless as it gets. Add network-delivered colossal storage, and you've got a supercluster ready to tackle the most demanding tasks, making AI history along the way. Fancy joining the party? They're hiring!

Inside the World's Largest AI Supercluster xAI Colossus Transcription

00:00 - 00:30 this is the largest AI cluster in the world the t-x is building a massive AI supercomputer that incompasses over 100,000 gpus exabytes of storage super fast networking this place is absolutely amazing and this entire supercomputer is built to power Gro now xai is building something with Gro that is far more than just a simple chatbot like we've seen before and that is exactly why there is a giant cluster here today we're going
00:30 - 01:00 to go back inside these data halls and show you a whole bunch of stuff on what makes this work what makes this special and just all the really cool engineering that went into this now this is actually my second time at the facility and let me just tell you the speed at which this thing was built is absolutely amazing this entire facility with over 100,000 gpus was built in only 122 days and just to give you some frame of reference the largest scale
01:00 - 01:30 supercomputers only have a fraction maybe half to a quarter as many gpus as xai has here and yet those deployments generally take many years from start to finish the engineering accomplishment here is absolutely amazing and there's work still being done but I thought let's go take a look at inside one of these data halls and let's also show some of the facilities and just kind of show you what's going on and how something like this gets built we need to say that this video is sponsored by super micro since it is but also Al just
01:30 - 02:00 want to say thank you to the X and xai teams for giving us permission to go and film this also of course thank you to Elon and his teams for approving this and making this possible with that let's get inside a data Hall to see how this all works inside the data Hall xai is using a pretty common Design This is a rais floor data Hall and above we have our power down below we have all of the pipes for the liquid cooling so that he can be exchanged to the facility chiller
02:00 - 02:30 each one of these compute Halls has about 25,000 gpus plus all the storage the fiber optic high-speed networking it's all built into the hall and then they're connected together the connections to each data Hall are basically the Fiber Optic Cables the liquid cooling and plumbing for the water then there's also just a bunch of power delivery that's super cool inside the compute Hall we have these clusters now each of these is made up of eight liquid cooled racks by super
02:30 - 03:00 micro these are Nvidia h100 racks and inside each of these eight racks there are eight Nvidia hgx h100 platforms we also have all of the liquid cooling for these as well as the networking and stuff to make each one of these about a 512 Nvidia GPU mini cluster now these super micro Nvidia liquid cooled AI racks are probably the most advanced AI racks deployed at the scale and I'm going to show you exactly why right here so what you're going to see is that each of these racks has a total of eight
03:00 - 03:30 Nvidia hgx h100 systems so we have a total of 64 gpus per rack now in the top section we have the super micro Nvidia hgx h100 each one of these hgx h100s has a lot of the components that are really important in these systems like the 8 Nvidia h100 or Hopper gpus plus there's also envidia and vlink switches and all of that is on the baseboard now one of the really defining things of the super micro platform versus some of the others
03:30 - 04:00 in the market is that you can actually go in and pull this top section out you can see the little levers here they were probably get really mad if I did this but we have other videos where I've done that and even on these liquid cooled systems and that leaves the bottom tray the bottom tray has things like our CPUs which are fast x86 CPUs as well as our large pcie switches just to give you some frame of reference on how Advan these are all of that is only in 4 U of rack space and it's all serviceable just on trade
04:00 - 04:30 there are other options from Super Micro and others in the industry that are 6u or 8U for a similar style system and there are options in the market that simply don't have this kind of accessibility and serviceability which is why these things even though they're very compact they're also extremely Advanced and easily serviceable now on the front of this you're definitely going to see the fact that there are all these little tubes and they go through this little bar this little bar is what is called a manifold and so we have a oneu manifold for each of these systems
04:30 - 05:00 that one you manifold is how we connect our liquid cooling now all of these little tubes are in pairs we have both a blue and a red tube that comes out of each of these so inside these tubes there are two different liquid cooling blocks and as you might imagine the cooler liquid goes into the server in the blue side and then out of the server on the red side and that is brought to the manifold that's here and it goes back to the overall rack manifolds that are in the back of the rack this design means that you can go and actually slide
05:00 - 05:30 these systems out service the hgx h100 boards service the CPUs memory and all that kind of stuff and just go pop these things out now I'm not going to do that this is actually training a real model right now but it takes a matter of seconds at most now there are a total of eight of these servers in this rack so it's a total of 64 gpus plus 16 CPUs a bunch of memory and all kinds of other stuff but on the bottom is a big part of what makes us a really scalable solution these are super micro cdus which is a
05:30 - 06:00 cooling distribution unit so there are a couple super cool things about the CDU first you have the management unit so each one of these cdus has its own management so you can monitor things like flow rate temperature and all these kinds of things that you need to to just ensure that you're bringing the right amount of liquid through all these servers in your rack and that of course ties into the central management interface and that can be monitored so if something goes wrong you can see it remotely now the other cool thing is that there are two pumps down here now these pumps you can actually service because they're for redundancy if the
06:00 - 06:30 pump were to go out you just pull the thing out then you go put it back in again I'm not going to go do this on a live system but we've done it previously on St okay so it's time to get to the other side of the rack just to kind of give you an idea of how the back of these racks look on the side you'll see that we have our overall rack manifold you're going to again see our red and blue liquid cooled rails now behind that you're going to see that we have a bunch of three-phase power strips that are in here since uh you know these things use a lot of power in each rack and on the other side you're going to see that we
06:30 - 07:00 have all of these servers now these 4 you super micro servers have a total of eight Nvidia Bluefield 3 super niix and that's really there for the AI Network there's also a connect X7 and that's really for all the other things that the server might need like on the CPU side now you might wonder why are there fans in a system that's liquid cooled the real reason is that these fans are needed to cool all of the little tiny components the memory the dims and all that kind of stuff in a system now even though there are fans here it's definitely not as loud as if this were
07:00 - 07:30 all air cooled and another really important thing is that this isn't too hot like I'm just standing here and it doesn't feel like I'm being heated like if I were behind an air cooled system that was just blasting all that hot air towards me so there is a big difference in the fact that this is a liquid cooled versus being a air cooled system now on the back of this rack there is a rear door heat exchanger now how a rear door heat exchanger works is that the heat from the server is transferred to the liquid that flows through the radiator air is being pulled by these large fans through the heat exchanger and that's
07:30 - 08:00 how all the extra heat in the racks ends up getting removed the special part about this design is that each rack ends up being room neutral to the overall cooling of the data center you don't see as you're walking around here like these giant air conditioner or air handling units something that you would see in a lot of data centers over the years this is a really cool feature and it really helps each of these racks be a selfcontained unit and another super fun fact here is that you'll see that the back of these is lit up in blue and think that's a branding thing or maybe
08:00 - 08:30 they did that just because St is blue and that's why it's blue absolutely not Instead This is actually a status light so as you're walking through here if you see a bunch of blue that's cool if you see something like red or something like that that's not good so when you're walking down the data center Hall if you see a red one you know oh that one needs to get serviced and the rest are okay a couple weeks ago I got to see these things get fired up and it was super awesome to see all of them come up at the same time now of course with any large scale cluster you need CP compute as well for those tasks that gpus are
08:30 - 09:00 just not really good at and that's exactly what these are next to me these are 42 oneu servers per rack that provide all the CPU compute when you need to do things like data prep and all that kind of stuff that really works well on CPU and so in any large cluster you're always going to see a set of CPU compute nodes along with your GPU compute nodes now this entire cluster runs on ethernet which is the same basic technology you would find networking for your laptop your PC or a bunch of other
09:00 - 09:30 devices now each of the servers uses a Nvidia Bluefield 3 supernick dpu we have definitely covered the Nvidia Bluefield 3 dpus and previous generations on St for many years and if you've seen that you probably know it means that there's a lot going on here more than just your basic ethernet each of these Nvidia Bluefield 3 cards provides 400 gigabit networking all the way up to the AI infrastructure and that's kind of similar to how your PC or laptop will go access the internet just to watch this
09:30 - 10:00 video now for those that are steeped in the supercomputer realm they'll definitely say hey the way that a lot of folks make clusters is they use Technologies like infiniband or other exotic interconnects well those Fabrics often work for the world's supercomputers the world's gargantuan networks run on ethernet and that's one of the reasons that they're using it here because they don't need to just scale to the size of a supercomputer they need to scale to a massive AI cluster now of course this is not the
10:00 - 10:30 same ethernet that you have in your PC or notebook or something like that it's much faster probably something like 400 times faster but Nvidia has some other processes going along into this behind me we have the Nvidia SN 5600 which is a 64 Port 800 GB ethernet switch and that means each one of these can be split and run 128 400 gbit ethernet links and these Nvidia Spectrum X switches along with the Bluefield 3 dpus can do amazing amazing things these have a host of
10:30 - 11:00 features and processing capabilities that allow the Nvidia gpus and the entire cluster to run at their maximum performance levels the Nvidia solution can do things like offload various security protocols and has advanced flow management to help ensure that you don't have a congested Network the other thing you can do though is you can maintain a flow of data and packets throughout the entire cluster and help make sure that things get to the right place at the right time and here it could be used not just for the RDMA Network for the gpus but also for things like providing
11:00 - 11:30 storage as you can probably tell as I blend in with single mode fiber with my yellow shirt there's an absolute ton of Optics and fiber and stuff running throughout this entire building to make sure that the communication happens efficiently and fast now these over here are the north south switches in a modern AI cluster the east west traffic pattern is usually dominant still these fancy high-end switches can handle a ton of 400 gig ethernet connections for North South traffic just like the other switches that we looked at for East West these are not being used used for the RDMA Network which is a fast Network
11:30 - 12:00 that the gpus require Instead This is being used for all of the other work all the other supercomputer tasks in the cluster but these switches are another 64 Port 800 GB ethernet switch which is just really cool it's really high-end system and this is definitely one of the first deployments for this type of switch in the world now with a large scale AI cluster like this storage is delivered a little differently than you would be used to in something like your desktop your
12:00 - 12:30 notebook tablet all that kind of stuff where you have local storage instead the vast majority of storage is delivered over the network the reason for this is that this type of AI training needs tons of storage and so you can't really fit it in each of those GPU servers and also all of the GPU and CPU servers need access to all of that storage and so that's the reason there's a giant storage cluster here now with any liquid cool data Center a big part of that is of course
12:30 - 13:00 the liquid cooling and if you look around me right now what you're going to see is these just absolutely giant pipes I mean these things are huge and what these pipes do is they take liquid or water from the outside that's generally cooler and they bring it inside the facility it gets distributed into the different data halls and then from there it goes into the cdus which you saw where that's where all the racks have their gpus and all that kind of stuff all that heat from all those GPU servers
13:00 - 13:30 goes into the cdus gets exchanged to these racks it comes back out as warm water and at that point it can go outside to a chiller now these chillers are not made for things like you know making ice cubes or anything like that they just lower the temperature of the water by a couple of degrees and then that water can get recirculated at a cooler temperature and the whole process can cycle over and over again that's how data centers like this are able to reuse water and by the way these pipes are absolutely huge and I can certainly feel the water flowing through here right now
13:30 - 14:00 now one other amazing Innovation here is really what's next to me which is the Tesla Mega packs that actually power the training jobs that are in this facility what they found was that there are these little millisecond variations in power when all these gpus start training something and when that happens that was causing all kinds of problems with the power infrastructure so the answer was to have all of the input power from generators and what have you generate power that goes into the batteries and have the batteries discharge and power
14:00 - 14:30 the training jobs now of course that's the kind of engineering challenge that you have to solve for when you're building something of the scale now of course what you're seeing here today is really like a phase one of this entire cluster this thing is already the largest AI training cluster in the world and they are still building which is absolutely amazing now of course a project like this takes a ton of people I just want to say thank you of course to our team but also the super micro team the xai team and everybody else that's been involved in making this happen now of course if you like this
14:30 - 15:00 video and all of this cool AI infrastructure and you just want to look for a job there is a career page that you can definitely check out and see if you know something there pequs your interest because this seems like the project that would be awesome to work on and hey if you did like this video of the Colossus supercomputer powered by super micro servers well why don't you share it with your friends and colleagues but also give this video a like click subscribe and turn on those notifications so you can see whenever we come out with great new videos as always thanks for watching have an awesome day