The Future of Accelerated Computing

Dynamic AI-RAN Orchestration for NVIDIA Accelerated Computing Infrastructure

Estimated read time: 1:20

    Summary

    In a recent webinar hosted by Aarna Networks, co-founder and CTO Sri Ram delved into the intricate dynamics of AI-RAN orchestration for NVIDIA Accelerated Computing Infrastructure. The session explored the disaggregated RAN concept, focusing on leveraging NVIDIA GPUs for accelerated cloud RAN and machine learning workloads simultaneously. Key points included optimizing GPU resources, orchestrating AI and RAN workloads dynamically, and ensuring resource flexibility for future expansions. With an emphasis on innovative orchestration, the session aimed to highlight ways to maximize infrastructure efficiency and financial viability.

      Highlights

      • Aarna Networks dives into AI-RAN orchestration using NVIDIA GPUs for optimal resource use 🚀.
      • Sri Ram explains disaggregated RAN's potential in cloud ecosystems 📊.
      • Gain insights into the cost-efficiency of dynamic workload management on GPUs 💵.
      • Discover the orchestration prowess of Aarna Networks in harmonizing AI and RAN workloads 🎛️.
      • Explore NVIDIA's technological infrastructure supporting scalable and future-proof solutions 🔧.

      Key Takeaways

      • Discover how dynamic AI-RAN orchestration can boost NVIDIA's accelerated computing ecosystem 🚀.
      • Learn why disaggregated RAN is essential for cloud-based operations and flexibility 📊.
      • Understand the financial benefits of optimizing GPU usage with AI and RAN workloads 💵.
      • Explore the unique capabilities of Aarna's orchestration in handling diverse workloads simultaneously 🎛️.
      • See how NVIDIA's architecture supports the transition to efficient and scalable computing infrastructure 🔧.

      Overview

      In a riveting session from Aarna Networks, Sri Ram illuminated the intriguing world of AI-RAN orchestration within the NVIDIA accelerated computing sphere. With focus on disaggregated RAN, this session unraveled the complexities and potentials lying dormant in merging RAN and AI workloads for enhanced cloud operations.

        At the heart of the webinar was the discussion on accelerated Cloud RAN leveraging NVIDIA GPUs to amplify the L1 and L2 RAN functions. By integrating these processes on the cloud, efficiencies in handling machine learning tasks alongside RAN operations were outlined, paving the way for innovative orchestration.

          Moreover, attendees learned about converting underutilized GPU resources into profit centers, maximizing ROI via the dynamic orchestration capabilities of Aarna's solutions. This approach not only highlights resource flexibility but also sets a benchmark for future technological and infrastructural developments in the industry.

            Chapters

            • 00:00 - 01:30: Introduction and Webinar Overview The introduction to the webinar begins with a welcome message to the attendees. The speaker is introduced as SRI Ram, who will be discussing the topic of dynamic airline orchestration for NVIDIA accelerated computing infrastructure. SRI Ram is a co-founder and CTO at ARA Networks, leading the engineering team. Prior to his role at ARA Networks, he held a senior position in India.
            • 01:30 - 06:00: Dynamic AI and RAN Orchestration Concepts The chapter begins with an introduction to SRI Ram, an experienced engineer at Western Digital and a former co-founder of Roi Communications, which was acquired by Ulex. SRI Ram also held the position of Vice President of Technology at AMX. The chapter sets the stage for a webinar led by Sham and invites participants to engage by posting questions in the chat window for discussion afterward.
            • 06:00 - 10:00: Running Workloads on GPU Cloud The chapter introduces the topic of running workloads on GPU cloud, which is a key focus area of the webinar. Sham, the presenter, takes over from Mil to lead the session. The discussion begins with an emphasis on dynamic air and orchestration, although the chapter transcript only begins to touch on the webinar proceedings. Due to the incomplete transcript, further details are unavailable.
            • 10:00 - 15:00: ARA Networks' Orchestration Capabilities This chapter introduces the concept of using the same cloud infrastructure for both RAN (Radio Access Network) workloads and machine learning operations. The discussion emphasizes the term "cloud RAN," which is guided by industry standards, highlighting the integration and orchestration capabilities of ARA Networks to streamline these processes.
            • 15:00 - 21:00: Demonstration Overview This chapter discusses the concept of disaggregation in cloud systems, focusing on disaggregated RAM. It highlights that a fully integrated solution cannot be effectively run on the cloud, thus disaggregation becomes crucial. Additionally, the chapter introduces the idea of accelerating cloud operations.
            • 21:00 - 30:00: Demo: Running RAN and AI on GPU This chapter discusses the concept of running RAM workloads on CPUs, with a specific focus on disaggregated systems. However, it introduces the accelerated Cloud RAM approach, which utilizes GPUs, particularly Nvidia GPUs, to boost the performance of L1 and L2 RAN functions. The chapter highlights the use of Nvidia's AIEL SDK for porting RAM components onto the hardware development kit (HDK), thereby enhancing processing speeds and efficiency.
            • 30:00 - 45:00: Demo: Using NVIDIA NVCF The chapter discusses utilizing NVIDIA NVCF for GPU acceleration in running RAM (RAN) workloads. It introduces the concept of using GPUs to enhance performance in AAL HDK and utilizing GPU Cloud to leverage accelerated functions for RAM workloads.
            • 45:00 - 50:00: Deployment Architecture This chapter discusses optimizing capital expenditures (CAPEX) in deployment architecture by utilizing a shared GPU cloud for both RAN (Radio Access Network) and AI workloads. The focus is on the concept of AI and RAN sharing the same GPU resources, leading to efficient use of hardware and cost savings. The transcript suggests that this is a key point of discussion in the chapter.
            • 50:00 - 60:00: Full Demo Process The chapter discusses running different workloads on the same GPU Cloud, specifically the Nvidia GPU cloud. It highlights the ability to manage both randot loads, which could be either virtual or standards-compliant, using the cloud infrastructure. The description builds upon concepts introduced in a previous section, reinforcing the cloud's capability to handle diverse computing tasks concurrently.
            • 60:00 - 75:00: Q&A Session The chapter discusses scenarios involving heavy 5G traffic where the cloud is running the RAN workloads. It also touches on the possibility of 6G traffic in the future, describing how traffic might be lighter in non-p coverage areas.

            Dynamic AI-RAN Orchestration for NVIDIA Accelerated Computing Infrastructure Transcription

            • 00:00 - 00:30 hi everyone thanks for joining this webinar uh let me introduce you to the uh today's speaker of this webinar so the topic of today's webinar is dynamic airine orchestration for NVIDIA accelerated Computing infrastructure SRI Ram who is going to take this webinar uh is a co-founder and CTO at ARA networks and he heads the engineering team in Arna and prior to Arna shiram was head of India
            • 00:30 - 01:00 engineering at Data Center business of Western Digital and uh prior to Western Digital SRI Ram was also a co-founder of a startup called Roi Communications which was later acquired by ulex and he was a vice president of Technology at AMX so I uh invite sham to start this webinar uh and for you please feel free to post your questions anytime on the chat window but after uh the webinar is
            • 01:00 - 01:30 done uh we will sham will shall take the U questions so thank you again for joining this webinar and I will hand it over to Sham to uh to start the webinar thank you very much thanks thanks mil um hello everyone um so we'll go through uh today's uh topic which is uh the dynamic air and orchestration um so mil if you can go to the the next slide yeah so essentially
            • 01:30 - 02:00 this gives an introduction to the sort of the big picture right so so there are three terms here the ran the cloud and right so so essentially the the idea is that U the same Cloud can be used for both running the the Rand workloads as well as the machine learning or a works right so so the cloud ran is a term that's used which is a standards driven
            • 02:00 - 02:30 disaggregation and the idea is that in this case what is running on the cloud is a disaggregated ram because obviously you know if you have a fully integrated solution which is not disaggregated you you can't possibly be running it on the on the cloud right so so that's that's one of the one of the concepts here um the second concept is accelerating the cloud R so the um
            • 02:30 - 03:00 you know people are aware of you know running the RAM workloads on on the CPUs um you know either cus and you know which are all disaggregated and so on but what we're talking about here is the accelerated Cloud Ram so in that case they make use of the gpus or the in particular the Nvidia gpus for accelerating the L1 and L2 Rand functions so this essentially uses uh the SDK from Nvidia called aiel uh where the ram uh components are ported on hdk
            • 03:00 - 03:30 on AAL hdk and they can make use of the gpus acceleration for running the ram workloads so that's essentially what uh what what we are talking about today in in this uh uh in this webinar and the third Concept in this is uh the ran in the GPU Cloud so essentially you know using using these accelerated functions of the GPU uh you're going to run it on the on the GPU Cloud um so now the
            • 03:30 - 04:00 question is uh how do you um optimize the the capex on on this right so since you're going to be using the GPU uh for running the ran workloads can you also use the same GPU cloud or the for running the other workload such as the a workloads so that essentially leads to the fourth concept which is AI and RAM and running on the same GPU Cloud so what we're going to talk about today is
            • 04:00 - 04:30 essentially how you can run both these workloads on the same GPU Cloud if you can go to the next slide so this essentially Builds on what I mentioned in the previous slide so on the left hand side you see the picture where the cloud the Nvidia GPU cloud in particular is running only the randot loads right they could be either virtual r an or or standards compliant orand and
            • 04:30 - 05:00 in case of a heavy 5G traffic essentially the entire cloud is running the ran workloads right and then you see the arrow that's pointing to the the other scenario where the the 5G track or the ram actually we mentioned 5G but it could it could in future be 6G as well where the traffic is lighter Maybe in the nonp covers
            • 05:00 - 05:30 right so the idea is that in that case you could be running both ran and the AML workloads right so and some of the rationale for this is that typically ran is the most expensive part of a 5G R and I think most of the people in the Telco world can attest to that right so and also it's a fact that most of the ran sites are very often underutilized uh in fact there is uh there's an
            • 05:30 - 06:00 estimate that only typically 20 to 30% is uh utilized so now the question is what do you do uh with this GPU resources for the remaining um you know 70 80% right so so the idea is that you use the same underlying Cloud for the machine learning during these underutilization periods so the idea is that during this uh uh by by doing this process say you
            • 06:00 - 06:30 can convert your ran site to a profit Center so the estimate the conservative estimate is that you the ROI can go up to to 9 to 12 by by doing this and ARA currently is the only orchestrator that can go across these domains because essentially you're you're talking about the ran workloads and the Machine learning workloads so you need an orchestrator that can spam
            • 06:30 - 07:00 both of these domains and understand both of them switch the workloads and so on so then that's essentially what we're going to demonstrate today in the in the webinar today yeah so this picture shows um how how this process uh works right so the starting from the bottom the what you have is the aan GPU Cloud management software for full isolation so what we mean by that is that when you're running
            • 07:00 - 07:30 these different types of workloads obviously you want to maintain uh some level of isolation right and uh and you want it to be as strict as possible because you are typically running uh your ran workloads in in in the in the cloud in in this GPU cloud and you could also be running machine learning workloads which may or may not belong to your organization and we we'll show you some of those possibilities right so in
            • 07:30 - 08:00 that case you obviously want it to be well isolated right all the way from CPUs gpus um you know the networking and if there is infinite band the infinite band and if it is a fractional GPU you want it to be again using a technology like Mig to segment the gpus and use them for both of these workloads right um and thirdly there is uh also a concept which is a service offered by Nvidia where you
            • 08:00 - 08:30 can register your spare capacity with nvcf and we we're going to show that in the in the demonstration today and essentially the the top layer that we're seeing there is uh the dynamic switching so what we mean by that is that you can essentially switch from one workload to the other workload and do that in a dynamic manner right so and obvious you can do it statically but nobody is
            • 08:30 - 09:00 really going to do that statically right so nobody's going to monitor you know how much the the ran is running and what capacity your your gpus are being utilized and then manually spin up uh you know some machine learning workloads or or manually go register with nvcf with your spare capacity so nobody's going to do that manually right so so what you want is a solution that can do that um dynamic Ally by monitoring the
            • 09:00 - 09:30 resources and these resources um need to be monitored based on the requirements of the users right the GPU owners essentially right because some some of the Telco owners may want to monitor um some of the call Traffic some of them may want to monitor something else maybe some brand utilization um so it needs to be flexible enough the orchestrator needs to be flexible enough to monitor the
            • 09:30 - 10:00 type of resource that uh the the owner wants to monitor the GPU owner wants to Monitor and that's essentially what we can enable u in in Aras orchestration where the um the um you know the the the type of resource that you want to monitor can be programmed so this shows the entire stack uh the air and stack
            • 10:00 - 10:30 uh this is a sort of the Nvidia stack which we're showing where we show our orchestrator where it sits in in this uh in this in this stack so I'm not going to go into the all the details of this uh you can find it on the Nvidia uh website but essentially this is U it shows how um for example the Grace Hopper or or GB which is the latest uh you know Grace black Bel platform can be utilized for running these types of
            • 10:30 - 11:00 workloads in this case we are showing on the right hand side we showing the the ran workloads where there are disaggregated cus uh the UPF the ri and they'll all be using the as I mentioned the aerial uh SDK to make use of the the GPU acceleration and on the left hand side you'll see the the machine learning so in this case it could be Nvidia Nims um and and are at the top the Nvidia cloud
            • 11:00 - 11:30 functions the NVC and uh the the orchestrator that we are talking about is something that which can understand both of these workloads can do the dynamic switching yeah if you can go to the next slide yeah so this shows um some of the functionality that um the mop or the Aras orchestrator which is also referred to as a CMS the cloud management
            • 11:30 - 12:00 software um the functionality is so you see three uh sort of the the functions there so the first one is infra orchestration so which essentially means that you uh starting all the way from the bare metal which could be um the hgx dgx servers or the mgx servers with coper where we can orchestrate that um you know bring up the operating system on that um you know using any of the
            • 12:00 - 12:30 provisioning tools U bring up the kubernetes layers because most of these Cloud related functions are based on the kuties uh and also configure all the underlying switches right the front hall switch the back hall switch uh there could be a Grandmaster um present in the configuration so configuring that so essentially taking care of all the underlying infrastructure right so that's first part of the orchestration the second part part is uh
            • 12:30 - 13:00 the network service orchestration right so which could be the Smo the Smo functions uh include the o1 O2 um layers or we could also work with the other smos right if there is another third party Smo or a vendor specific Smo we could just be layered on top of that right in that case we we call it as an XML so essentially in that because we are not doing the direct configuration
            • 13:00 - 13:30 management of or or Performance Management of the of the Rand functions we could be using the vendor Smo functions and we could be just layer on top right so or we could directly be the Smo so if there's any other vendor and some of them actually are you know completely disaggregated where we can directly work with those functions directly so both the models are possible or the Smo and
            • 13:30 - 14:00 the xmo and then there are other functions like slicing non-real time break uh configuring the radio units so all of them can be performed by by amob or or or SMS and again this these are some of these are optional either you decide to do using uh the Smo mop Smo directly or you can use the vendor Smo functions and then the last one the most the important one is the application orchestration right so this is where we orchestrate the
            • 14:00 - 14:30 machine learning functions um which could be either um you know some rag models that they want to run or some training some fine-tuning um inferencing so we can we can essentially orchestrate them using their Helm charts or using their as as a Docker containers um we can bring them up on on these on these a clouds and more importantly the Clos Loop functions
            • 14:30 - 15:00 right so we can monitor both the RAM and the Machine learning workloads and they could be based on um some standard interfaces for example some of the Rand functions may expose Oran functionality right so in that case there is a standard F caps U that we can monitor using the standard one functions or if they expose some promus type interface uh we can just look at their end point point and monitor these functions so
            • 15:00 - 15:30 both both options are possible and then based on that you can create a Clos Loop function where uh you can create a a policy that says that based on these uh criteria make this change for example you decide to uh scale in your Rand functions and uh add a machine learning workloads and when you reach a um again a peak scenario where there's a lot of
            • 15:30 - 16:00 ran workload you do the opposite you essentially you start scaling out your ran functions and maybe spined on your machine learning right so so all of this can be done using the closed loop operations and what we show at the top is is is really our single pan of glass Lop single pan of glass which is a unified uh Wii for running both of these workloads yeah so now we jump into the demo um so
            • 16:00 - 16:30 before the before the actual demo we'll go through some of the um the concepts here and what we're going to show yeah so what what we're going to show is essentially ran and a workloads running on the same GPU and in fact we're going to show it even more final granularity running on different Mig partitions so we take a single GPU create into to multiple
            • 16:30 - 17:00 partitions let's say you know seven partitions on on a GPU and this is a grce hopper with with a single GPU and then we monitor the Mish a and the ran workloads right so so in this case in the demo we show the dashboards where we monitor both the A and ran workloads um and then um independently we show how we can register the spare capacity using NVC on nvcf and then we also do the scheduling
            • 17:00 - 17:30 nvcf scheduling of of of particular actually we happen to pick a deep seek R1 model uh where we schedule it through NVC so it's actually showing two different things one in the first case we directly deploying a a rag model and we monitor it so that's what we see in the first two steps in the in the third and fourth step we show the similar functionality
            • 17:30 - 18:00 but using nvcf so in that case um we register the NV the spare capacity on the nvcf and through nvcf portal we submit a job and this job happens to be an inference job using deep seek R1 model and then we show some queries running on that so that's essentially what we show in in the demo so this shows um now we are showing
            • 18:00 - 18:30 a typical deployment architecture so this is not the demo that we're going to show so the demo we're going to show is a much smaller uh configuration but in a typical deployment this is how it will eventually look like right so let me spend a minute to kind of explain this right so on the left hand side you see a a configuration right it's actually a three tier switch configuration and uh we have uh the the UFM uh which is to manage infinite band
            • 18:30 - 19:00 there's infinite band switch there is out of band Network and then what you see at the bottom or the nodes right in this case we're showing hgx nodes but it could be other types of noes it could be mgx gray offer nodes dgx noes it can be anything right so and then you also see a storage right there could be a storage that is needed for both ran as well as the machine learning right so now what we have done in this case is is that we have actually
            • 19:00 - 19:30 um isolated each of these workloads so you see the blue colored boxes which are running the ran workloads and you also see the green color boxes which are the spare capacity in this case the operator wants to you know provide all of the spare capacity and monetize it right so they could advertise that with nvcf and then the nvcf service could um schedule workloads out that right so now you look at the
            • 19:30 - 20:00 picture on the right hand side right here you're showing the same thing but the concept here is that the green boxes are elastic right so because the spare capacity typically is not static right so sometimes you may be your your Cloud may be fully utilized so you may have enough customers or you may have enough ran workload that's already running so you don't want to advertise as a as a spare
            • 20:00 - 20:30 capacity right but during the non peak hours you have some nodes freeing up right so in that case you want to push them into this nvcf pool bucket where you have more spare capacity that is advertised and that could potentially be monetized because uh that capacity may be utilized by nvcf to run any workloads right now the key here is that these process has to be dynamic and it also
            • 20:30 - 21:00 has to be in a fully isolated manner because you can't randomly pull out your ran workloads and you know push them in the in the in the machine learning bucket right because again you have to again isolate them make sure that all all their networking isolated all their infinite band is isolated or if you're using a a mid partition right within a single node you know you make sure that your other partitions are isolated from running these workloads right so and
            • 21:00 - 21:30 then once you advertise it the nvcf service can take care of uh you know scheduling jobs on that alternatively instead of using nvcf you could be running your own workloads right for example you know if you have your own training jobs or your own inferencing jobs which could be internal to your organization right then you can directly be running on on these uh on on this spare capacity so in that case you know the isolation may not be that
            • 21:30 - 22:00 strict but the conceptually it's it's the same thing know you you still want to make sure that you know one of the workloads doesn't really um you know take up any resources from the other workloads so you still want to maintain some level of isolation right so that's essentially the the typical deployment architecture that you will uh that you can eventually realize but obviously you know in in the demo we're not going to show this this kind of a large configuration so what we're going to
            • 22:00 - 22:30 show in the demo is a much smaller configuration um and you can see it in the next slide if you can go to the next slide yeah so this is what we'll show in the demo in in the demo today so we going to show it in a single uh Nvidia groper 200 server so the boxes that you see in the middle are the different partitions they're Mig partitions right so so what you're doing here is that you're
            • 22:30 - 23:00 using three of the M partitions for running Rand workloads and four of them to run the machine learning so in this case you're not going to use the nvcf service you actually directly use these partitions to run a rag model so that's what we're going to show in the in the demo and then you can use some querying to you know to query the rag model and you can see that on the on the amop um pan see pan of glass you can see the the
            • 23:00 - 23:30 monitoring where um as you run the ran workloads and as you start worrying the machine learning uh the rag model you can see that the utilization keeps going up and you can also see uh the dashboards the ram dashboards as well as the machine learning dashboards um and corresponding to those uh uh traffic you can see the utilization going up on the dashboard so that's essentially what we demonstrate
            • 23:30 - 24:00 today so that's first part of the demo and then the second part we'll show a similar functionality but and a slightly larger configuration with u with multiple uh nodes in a cluster and registering with nvcf and again conceptually the same thing we run the the uh track the Deep seek model using NVC so these are the two uh distinct demos that we show
            • 24:00 - 24:30 today yeah so with that let me switch to the demo um I will it's a recorded demo so I'll run through the demo and uh and then I'll wherever required I'll pause and explain uh what what I've just talked about okay so this is the this is the mod so in this case the system is
            • 24:30 - 25:00 already prepared so we have already deployed we've already created the ring make partitions and we have created both the ran workloads as well as the the rag um model which is uh which queries some internal database for queries so both of them are already deployed and now we'll see that the you know you're are actually starting to run some simulated traffic on the RAM and also you're running some uh
            • 25:00 - 25:30 queries right so this is simulation test so obviously we not running the real Ram you're running a simulated ramp so you can see that U through the simulation window you can see that the ran traffic is going up and on the first dashboard you can see the GPU utilization right so there are uh two um dotted lenses that you see so one is one corresponds to the 5G rant traffic and the other one
            • 25:30 - 26:00 corresponds to the AI which is which is really the the rag model that you're running right so you can see that roughly it's about 40 60 so you're running U three of the Mig partitions with for um for the ran and four of the M partitions for the machine learning so in this case this is a static configuration we're not actually changing it but conceptually you can you can imagine that the the division between the the partitions can can be changed
            • 26:00 - 26:30 dynamically so you can see that as the the traffic goes up and the the generated tokens starts going up the um the utilization on the on the first window keeps going up so you can see both the ran and the
            • 26:30 - 27:00 the tokens uh going up and corresponding to that you can see that it's uh it's hitting you know close to 80% of the GPU so there are uh keep in mind that there are two um there are actually seven partitions three partitions running uh the ran workload and four partitions running uh the machine learning workloads so you can see all the corresponding uh
            • 27:00 - 27:30 um dashboards Rel related to the RAM and also the machine learning um traffic so in a single pan of glass the operator can now monitor both both of them both the the 5G traffic as well as the the ran traffic which is which is being run yeah so you can see uh so me quickly forward this yeah so you can see
            • 27:30 - 28:00 that the active cells the Connected Wireless subscribers um the the completed a request the throughput tokens so these dashboards are fully configurable so you can actually um you know group them in a different way uh you can group all the ram dashboards in in in one one side of it and and then the machine learning and the a workloads in a different one or enable or disable some of these
            • 28:00 - 28:30 dashboards as as as you as as required yeah so you can see that now you know the the utilization keeps changing so now you're running more of the machine learning and less of the ram the ran has come to zero
            • 28:30 - 29:00 yeah so now we switch to the second part of the demo which is uh the NVC right so in the first part of the demo we deployed these rag models directly using their Helm charts so we did not use nvcf service to deploy them but now what we're going to show is how you can accomplish the same functionality through nvcf right so in this case what
            • 29:00 - 29:30 we're going to do is create a cluster and register that cluster with nvcf and you can see how the cluster can be dynamically uh changed you can go from you know two gpus to 3 gpus or three to four or whatever right and then U you can see that that reflects on the nvcf portal and through nvcf you can submit a job and in this case the job happens to be a deep seek model so you will see how that that
            • 29:30 - 30:00 works so first you log into the um the CMS portal so you can see that know right now that nothing has been onboarded um so there are two TS created so in this case the conceptually the tenants are the ran and the uh the AI the AI can also be n we basically so so these are the two tenants nvcf and
            • 30:00 - 30:30 RAM so now what we're going to do is U allocate some resources right so essentially a cluster so in this case you're going to create a cluster which is a kubernetes cluster so in the demo we show it with a kind cluster which is easier to uh set up but this can be any cluster based on either upstream kubernetes or you know open shift in in case of the the customer has a open shift license or it can be a kind
            • 30:30 - 31:00 cluster so for the demos we use a kind cluster and then you allocate the nodes right so in this case we allocate one uh one node so this is actually a kind cluster so in this case the worker nodes can be any number so we set it up as two worker nodes so you see that now the cluster is running right and there are two uh two worker
            • 31:00 - 31:30 notes the NV kind one and NV kind 2 so you can go back and see on the Coes cluster so now what you're doing is U registering this cluster with nvcm right so this is a onetime operation so this is typically a day Zero operation so you don't have to keep registering it so you just register once and then then whenever your cluster changes right
            • 31:30 - 32:00 whenever you add more nodes into the cluster remove nodes into the cluster that will automatically get reflected on your on the portal on the nvcf portal so you don't have to keep registering it so here you're going through the registration process the onetime process so this is an on-prem cluster so you can see that it's an on-prem cluster um that you're registering
            • 32:00 - 32:30 so once you register um the nvcf portal gives you a Helm command to run on your cluster on your target cluster so again this is a one-time process so this sets up the nvcf agent on your target cluster so you run the helm command
            • 32:30 - 33:00 and this essentially sets up the nvcf operator um nvcf agent on your cluster so you can see that the nvcf operator is starting up so you can see uh you know the the the nvcf um agent coming up on on the on your cluster
            • 33:00 - 33:30 so now if you go back to the nvcf U portal you will see that your uh cluster is registered right so you can see that your cluster nvcf demo cluster which is what we registered and it's showing up with two gpus right and both of them are not used right it's a zero out of two so essentially this number two can keep
            • 33:30 - 34:00 changing right as you register more gpus this number keeps incrementing and as you as you reduce your cluster size this will keep decrementing right so so essentially you don't have to do anything else once your cluster is registered dynamically your you can keep adding nodes into it or remove nodes from it and that process can be uh either uh manual or it can be automated right so typically nobody's going to do that manually right so nobody's going to
            • 34:00 - 34:30 add nodes and remove nodes um you know by looking at some some you know some kpas and some metrics so instead what you would do is set up some kind of a policy that when your utilization goes down to certain level you add another node into your nvcf cluster and again when your utilization goes up you take it out of the cluster so you set that up manually automatically right so but in the demo of course we show that
            • 34:30 - 35:00 manually so now you see that there are two nodes so you showing it as a two gpus so now what you have done is you've added one more node right so now if you go back to the nvcf we'll see that there are three um gpus yeah so here you can see that now there are three gpus because you've added one more Noe to your cluster and like I said you know that Edition and removal um can be an automated it doesn't need to be a manual
            • 35:00 - 35:30 process so now your your cluster is registered you have three nodes three gpus that are registered with nvcf and again again you go back to two so you can go back and forth so now you switch to the job submission part of the demo right so here essentially this is again nvcf user portal right so here users can um can
            • 35:30 - 36:00 register and uh then they can submit jobs so they can add their uh um you know their functions either Helm charts or Docker containers and uh and then through their registry they can uh they can deploy them so this is essentially through the djx CL the Nvidia dgx CL so in this demo we're going to show it as a container as a custom container
            • 36:00 - 36:30 so as I mentioned you know we're going to pick a deep seek model which has a containerized version and U and then use that to deploy through nvcf again this is all nvcf through nvcf portal and there are of course apas to do the same thing so this all can be done through um apas as well and this is all nvidia's nvcf Cod so
            • 36:30 - 37:00 this is nothing to do with with amop so now you've deployed it so when you're deploying it it will show you what are the options right so in that case in this case in the demo we pick a a00 because that's what we've registered with u with nvcf so that will show one of the onr Clusters that we have registered
            • 37:00 - 37:30 so you pick your onframe cluster which is the NF demo cluster that we just registered and there are various other filters you know you can you can specify the geography where you want to run and and you know what should be your scaling Factor you know what should the minimum instances maximum instances so there are various knobs that you can use to
            • 37:30 - 38:00 control what you know how many resources that you going to consume on the on this clusters and finally you deploy your version so that's what you're going to do now so now your function is getting deployed now remember that now your function is getting deployed on your on Prem CL which we have just registered and through nvcf portal you
            • 38:00 - 38:30 can see how it is getting utilized now we can go back to our cluster that we have registered and you can see the logs right so and you here we have also you know used a simple command line curl function curl functions to uh query the model right so we are actually since the DC
            • 38:30 - 39:00 carbon is running we are querying it using a curl command and you can see that the model responds um yeah so you asked a simple question um so the capital of France and in the Deep seek model uh responded with with its response so that's essentially how
            • 39:00 - 39:30 you can uh you can query these models um and also um you know you can create your own chat Type window to do the same thing you know you can query these models with with the chat and we have also integrated the chat interface into amop so through amop itself you can query these uh models
            • 39:30 - 40:00 we'll just show a kind of a glimpse of that yeah here you can see that your model is being yeah so this is the this is the query interface that we have created um which U you know through amob one of the windows you can actually query the model so yeah so that concludes the demo um essentially what uh what we wanted to
            • 40:00 - 40:30 show uh thank you sham thank you very much I think there was lot of content and uh uh yeah let's see uh I can see some questions so let me let me just put it across to you uh and uh to all the attendees please feel free to post your questions and uh you know we'll utilize the rest of the time to um go through to explain those so uh
            • 40:30 - 41:00 yeah uh soam I can see one question where uh where it says whether ARA Smo supports both vran and Oran for uh of any any vendors or any other vendors yeah so yeah let me answer that so Oran obviously yes because uh our uh Smo is fully standards compliant so if there is any ran vendor who who is Oran compliant then we should work out of the box right
            • 41:00 - 41:30 and we have done the integration with a few of the vendors right now vran is uh is a little more tricky right so vran means that they have virtualized it which means that we can deploy them but we may not be able to do all the F caps right because their interface may not be standard right so having said that we can manage them if there are you know their standard rest interface or you know they expose any of those we can build a plug-in to manage those wean
            • 41:30 - 42:00 components and and we have done similar such things with other vendors so so the answer is the short answer is yes but to varying degrees we ran with complete support with with Oran with complete support with Wan uh depending on how much we can integrate with it okay thanks thanks sham uh I can see one one more question here uh yeah the
            • 42:00 - 42:30 question is like many Telos already have the ran orchestrators or smos so how would uh Arna solution work uh and will be able to manage both ran and AI workloads especially given that Telos might have an existing ran solution but they don't have anything for AI workloads while they both need to be managed uh in the in the in this way that uh that that was just
            • 42:30 - 43:00 demonstrated yeah I think that's a good question um and we I think we typically get asked that question so the approach that we took is that we are a standards compant Smo so if some ran vendor um is fully disaggregated and they give access to their uh ran components their cudu components we can directly manage it but if the ran vendor has their own Smo right we still want to make sure that we
            • 43:00 - 43:30 can work with it right so that's the reason we kind of started calling it as xmo right so we can be layered on top of their Smo right so and in that case we are not going to be managing all the FC caps of the ran directly that's done by their ran their Smo but we will only orchestrate let's say we could do some day Zero functions of the Ram uh we could U uh of course do the machine learning um in
            • 43:30 - 44:00 orchestration um we could do the nvcf that we just offsh showed so all of that we can still do we could also do all the isolation function that we talked about where we create you know dedicated resources for your ran components so we can do some of the day Zero preparation for you and then let your Smo take over right so that's that's a mode that we can work with okay thanks sham uh there's one more
            • 44:00 - 44:30 question uh what level of automation is built into the orchestration platform uh can it self optimize GPU resource allocation based on network conditions yeah so I think I briefly touched upon this in the during the course of the presentation so the op the um there are multiple levels of automation right so one is um first of all creating isolated resources right so that is fully isolated
            • 44:30 - 45:00 automated so you can create your day Zero configurations with different tenants one of the tenants can be ran tenant the other tenant can be your machine learning workloads and you create them in a fully isolated manner right now what happens during day n right so during the course of operation you know you determine that your um ran workloads are not fully utilized right so they're not taking up all the resources so in that case you can set up
            • 45:00 - 45:30 a policy in in in the product in amov in the orchestrator such that when you when you hit this condition uh you scale in your RAM and you scale out your machine learning and you can do just the opposite right so that part is can you know can obviously be done manually but more often than not you want to do it in an automated way right so the way you do that is by creating this policy and and
            • 45:30 - 46:00 a workflow and that workflow essentially does this operation where it uh you know takes out your resources from one one group or one tenant and adds them to another ten and can do vice versa okay thanks uh there's one question I see uh how difficult is it it for operators to implement this solution and uh you know
            • 46:00 - 46:30 what is the typical deployment timelines uh and the required uh infrastructure upgrades so I think typically most of them start with a with a POC with a lab po right and that's a fairly simple process uh if they have the Nvidia resources you know which presumably they have um you know we can get started in you know fairly short period of time um all we need to do is uh deploy our orchestrator uh it
            • 46:30 - 47:00 can even run on the gray hopper for example we have just one gray Hopper we could just run on the gray Hopper on the arm processor and and then perform the functions that we just talked about in the in the in in the webinar today and and then of course going to production is a is a is a slightly more complex operation right so so in that case uh the time it takes will depend on you
            • 47:00 - 47:30 know their their resources whether they have you know one scalable unit multiple suus you know what type of resources they have um you know is it mgx hgx dgx so it all depends on you know the other resources the other switches so it all depends on on that but starting up is a lap OC with you know with a few Grace offer servers should be a fairly um simple process probably a matter of
            • 47:30 - 48:00 days thanks uh I can see one more question uh what are the key differentiators of your orchestration platform compared to other GPU Resource Management Solutions in the market so so one of them is clearly uh that we talked about which is the ability to orchestrate um across domains
            • 48:00 - 48:30 right we we just show that um so we can seamlessly orchestrate between the 5G or ran workloads and the Machine learning or you know AI jna type workloads so that's that's clearly one one area now for more sort of a non ran type workloads there are um differentiators such as uh uh Doing Hard isolation right so the approach that we take is uh doing the
            • 48:30 - 49:00 hard isolation of all your resources because typically for these types of uh workloads you don't want to mix up your your resources right across different domains or in case of multiple tenants using the same infrastructure you don't want to fix up uh your your resources across the tenants so but more in this context of the aan uh it's the multi-domain uh orchestration
            • 49:00 - 49:30 part which is which is really the key differentiator us okay thank you uh well I think uh yeah there are no more uh questions uh well thanks everyone for uh attending this webinar uh you know we will have this recording also published on our website uh and also like to thank sham for this uh uh for taking this
            • 49:30 - 50:00 webinar thank you thank you everyone bye great thanks all have a good day