The Evolution of Java at Netflix

How Netflix Uses Java - 2025 Edition

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    Netflix's journey with Java reveals a dynamic technological evolution. Once reliant on older frameworks and Java 8, Netflix has modernized by adopting newer Java versions, employing Spring Boot for their services, and utilizing advanced GraphQL for data management. Java continues to be a cornerstone in their backend architecture, powering various high-demand applications and systems, from streaming services to enterprise applications. Key innovations like virtual threads and generational garbage collectors have enhanced Netflix's operational efficiency, illustrating Java's pivotal role in maintaining Netflix's robust and scalable infrastructure.

      Highlights

      • Netflix evolves Java architecture for increased efficiency πŸ› οΈ
      • Keynote discusses licensing, transitioning to newer JDKs β˜•
      • GraphQL outshines REST, offering flexibility for UIs πŸ•ΈοΈ
      • Spring Boot underpins Netflix's robust frameworks πŸ‘·β€β™‚οΈ
      • Virtual threads touted as game-changer in concurrency 🌐

      Key Takeaways

      • Java sees major role at Netflix, with updates for improved efficiency 🎬
      • Transition from old frameworks to Java 17-21 achieved great performance boosts πŸš€
      • Virtual threads and generational GC improve concurrency and reduce errors πŸ›
      • GraphQL offers flexible data querying; REST is discouraged πŸ’‘
      • Close collaboration with Spring team enables robust integration πŸ”„

      Overview

      Over recent years, Netflix has shifted its Java strategy to stay ahead of the tech curve. Originally anchored in the Java 8 architecture, the streaming giant has progressively adopted newer versions, currently operating extensively on Java 17 to 23. This move, coupled with a full embrace of Spring Boot, reflects Netflix's commitment to leveraging cutting-edge technology for its varied array of services and applications.

        Adopting GraphQL has proven essential for Netflix, offering a much-needed alternative to REST. GraphQL's flexible schema capabilities enable more efficient data handling across Netflix's diverse platforms, tailored for everything from user interfaces to high-volume back-end services. The company emphasizes this industry-wide shift towards more modern, efficient data management solutions.

          The collaboration between Netflix and the Spring development teams has been instrumental in smoothening the transition processes and optimizing Java's abilities for their high-demand infrastructure. By focusing on innovations like Spring Boot integration and virtual threads, Netflix continues to refine its technology stack, ensuring scalability and enhanced performance for its vast global user base.

            Chapters

            • 00:00 - 03:00: Introduction and Overview of Java at Netflix The chapter discusses the dynamic and evolving use of Java at Netflix. The speaker notes that while they have previously given talks on this topic, each presentation varies due to ongoing changes in Netflix's architecture and technology. This reflects Netflix's continuous learning and adaptation, where certain technologies are phased out while others are reintegrated.
            • 03:00 - 09:00: Misconceptions about Java and Netflix's Backend Architecture This chapter discusses misconceptions about Java and Netflix's backend architecture. It begins by noting that even if listeners have heard a similar talk in the past, the upcoming discussion will likely differ from previous ones, highlighting the evolution and updates in the topic. The speaker references a recent keynote appearance that gained significant attention on social media, particularly Twitter, showcasing a picture posted by someone named Bruno. The chapter hints at an engaging and interactive dialogue, potentially sparking curiosity or questions related to the subject matter, though specific details on Java or Netflix's backend are not yet outlined.
            • 09:00 - 15:00: Java Platform at Netflix and High-level Architecture The chapter discusses the financial aspect of using Java in Netflix's platform. It highlights that Netflix does not incur licensing fees from Oracle for Java usage as they use Open JDK. Although there is a contract with Azul, it does not involve payments to Oracle. The chapter also touches on the consideration of programming language choices, specifically questioning why Netflix doesn't use Rust, triggering a debate and discussion among individuals.
            • 15:00 - 24:00: Challenges with Legacy Systems and Java Upgrade This chapter delves into the challenges faced by organizations when dealing with legacy systems and the decision-making process involved in upgrading Java. There is a significant amount of criticism surrounding Java, with some individuals expressing dissatisfaction to the extent of avoiding services like Netflix, which share an association with Java. A discussion emerges on alternatives like Kotlin and Rust, although these are greeted with skepticism and debate. The sentiment portrays a mixed reception towards Java, highlighting the complexities and opinions that organizations must navigate when deciding whether to upgrade or switch technologies.
            • 24:00 - 30:00: Garbage Collection Improvements and Performance Gains In this chapter, Paul introduces himself as a Java Champion and part of the Java platform team at Netflix. He discusses their responsibility for building a Java platform that supports Netflix's infrastructure, including the JDK and build tools. Paul is also highlighted as one of the original authors of the DJs framework, a graphical tool used within the company.
            • 30:00 - 39:00: Virtual Threads and Debugging Challenges The chapter titled "Virtual Threads and Debugging Challenges" starts with a high-level overview of the architecture related to the Java platform. It discusses where and how Java is used, including aspects like the JDK, Java frameworks, and build tooling. The introduction sets the stage by mentioning commonly known applications, such as the Netflix app, to contextualize where Java fits into larger technology stacks. The summary hints at more detailed exploration of virtual threads and debugging challenges that may follow in subsequent sections.
            • 39:00 - 54:00: Spring Boot Framework at Netflix The chapter 'Spring Boot Framework at Netflix' discusses the common misconception that Netflix uses Java for its user interfaces across all devices. It clarifies that while Netflix's backend services may be Java-oriented, the front-end UIs are developed using whatever language is most suitable for each specific device. This includes different technologies for TVs and mobile devices, with the exception of Android, which uses a form of Java. Therefore, the chapter primarily focuses on technical details related to Netflix's backend systems, presumably highlighting the role of the Spring Boot framework in that area.
            • 54:00 - 72:00: GraphQL Integration and Usage This chapter discusses the integration and usage of GraphQL in Netflix's architecture. It highlights the specific requirements and challenges of Netflix streaming, including handling extremely high requests per second due to the massive number of users. The chapter also touches on the scalability of Netflix's backend systems, which operate across multiple Amazon Web Services regions to manage traffic efficiently.
            • 72:00 - 80:00: Comparison of IPC Mechanisms and Conclusions The chapter discusses the importance of having low latency for backend services by using multiple Amazon regions. It emphasizes that if users in Europe had to connect to a US data center, it would result in higher latency and a slower experience, which is undesirable. To combat this, services are distributed across four different Amazon regions. However, this setup introduces complexities, as connecting between regions is expensive and relatively slow, though these terms are relative.

            How Netflix Uses Java - 2025 Edition Transcription

            • 00:00 - 00:30 All right, we can get started. Um, so I'm just going to talk about basically how we use Java at Netflix. Um, you might have seen this talk before in a slightly different variation because over the last few years I've done similar talks. Um, basically just iterating how we um, how we use Java at Netflix. Um, however, every time I give this talk, it's different because our architecture keeps changing. Um, we keep changing our technology. We keep learning. um some things uh go away, some things come come back in. Um so
            • 00:30 - 01:00 even if you've seen this talk before, it's probably going to be different than what you might have seen like two years ago. Um anyway, before we jump into that, um yesterday, if you were at a keynote, I had my three minutes of fame on the keynote stage and um uh Bruno posted this this picture and then the rest of the day my Twitter was on fire or everyone's favorite social network. And um just some of my favorites. So the first question was actually a really good one like I wonder
            • 01:00 - 01:30 how much licensing fees we pay to Oracle if you build everything in Java. Well hopefully everyone here knows this that is zero because we have open JDK. Um we do actually get a um we have a contract with Azul but that's completely besides the point. Yeah we don't pay Oracle at all. And then why not Rust? And this was not the only person. And then the next person asked well why Rust? That that is actually a pretty good good answer to that. Um, so there's been back and forth a little bit. A lot of people apparently
            • 01:30 - 02:00 are not so happy with with Java. Some people never want to watch Netflix again because apparently it's tainted by Java. Um, so that was interesting. And then there was this one person, well Java sucks. It's like heavyweight and slow, etc. You should do Cotlin. You're like, I don't think that's how it works, but okay. Um, and oh yeah, and then the why not rust discussion was um was was just kind of fun. Um anyway, so um that is kind of how people look at Java, I guess. So um let's let's get into a little bit more
            • 02:00 - 02:30 serious topics. Um so my name is Paul. Um I'm a Java champion champion and I'm a Java platform at Netflix. So a Java platform um we are responsible for the Java framework that we are building our stuff on for the JDK and the build tooling and all these things. Um so we basically build a Java platform that everyone else at Netflix is u building their um services on. Um I'm also one of the original authors of the DJs framework which is a graphical framework that we use. We'll get into it a little
            • 02:30 - 03:00 bit and that's also part of of Java platform. Okay. So um we're going to start with like more of a high level overview of the architecture. So that is going to be about uh where we use Java and then uh we kind of bubble up to the stack, talk about the JDK, the Java frameworks we use, some build tooling and um kind of get a whole picture of of um where and how we use Java. Um you're probably familiar with this screen. This is um the Netflix app
            • 03:00 - 03:30 where you choose the next show you might want to watch. And um so there was actually some confusion also in this thread that I just showed. Um uh people are assuming that um everything is Java also means that all the UIs are in Java. The UIs are just in whatever language is most fit for the device you're on. So if you're on a TV that is uh different from a mobile device, but none of it is Java maybe except Android because that's kind of Java, but not really. Anyway, everything that I'm saying is about backends. So if we think about a Netflix streaming
            • 03:30 - 04:00 app, that is only one aspect of the type of applications we built at Netflix. Um but there are some very specific things about Netflix streaming. It is extremely high RPS, so many requests per second, and that's just because of the the number of users we have. U we have many many millions of users. That just means there's a lot of traffic uh that we have to deal with in our backends. Um we're multi-reion meaning that we're on uh four different Amazon regions and
            • 04:00 - 04:30 that is just so that wherever you are in the world um you have pretty low latency to our backend services because um going from let's say you're in Europe if you would have to connect to a US data center that just adds a lot of latency makes things slower. That's not a good experience. So we are in four different Amazon regions. Uh but of course it has all sorts of different implications for our backends because now we have to deal with these different regions and connecting from one region to another region is really expensive and relatively slow. Um and and of course all in relative terms right because it
            • 04:30 - 05:00 just adds milliseconds but that is a problem by itself. Typically if a request comes in that is just one request coming in from a device to our backends and then we have a huge fan out to all the different microservices that we that we have and we'll get into that a little bit further. Uh but we have to deal with that fan out. If you think about failure and it comes to Netflix streaming we can usually just do a retry if we do a call to one of our backend services and that request for whatever reason times out.
            • 05:00 - 05:30 We also set very very aggressive timeouts. Um so that uh latency overall stays really low. Um but if a one of the requests to a microser times out or fails for another reason, usually we can just retry and that will usually just fix the problem because we land on another instance and um things might be better there because the one instance that we hit the first time uh might be in a garbage collection pass or something like that. Um but we can get
            • 05:30 - 06:00 away with just retrying on failure for the most part. And if that doesn't work, we can often just return a response to the device with maybe some data missing because the Netflix app as a whole will still be fine. You might miss like a specific recommendation or something like that, but for the most part, you as a user uh won't even notice. Um so we can we can be okay with some some failure and uh that is important because we have the huge fan out to many different uh backend
            • 06:00 - 06:30 services. And the other thing is we uh typically don't use relational data stores in the streaming pot. Um again that is because of uh being multi-reion. That doesn't work so well for relational databases for the for the most part. Um and also the type of data we store doesn't fit as well in a relational uh model. U more often we use things like in in memory distributed data stores for caching and things like that. That is kind of more common in this uh this spot. Now that's Netflix streaming. On
            • 06:30 - 07:00 the other hand, we also have many like more traditional enterprise apps. Um, Netflix is also one of the largest film studios in the world and that means we have built a lot of software around just movie production um to manage people and equipment and stages and and whatnot. Everything that comes to um to to actually filming um a lot of software was just built inhouse and these are typically apps that are kind of more traditional. So
            • 07:00 - 07:30 these are probably things that you um very likely work on. Um more traditional you have a UI, you have a back end, UI stores data in in a database or something like that. Uh super critical apps for the business. Um but very different uh if you look at like the traffic patterns. Um compared to Netflix streaming, these apps are usually very low RPS. There's not that many concurrent users compared to Netflix itself. uh we can typically get away
            • 07:30 - 08:00 with single region. We don't have to be in every data center in the in in the world for low latency. Um but the data is often very fit to be stored in a relational database. Um that doesn't mean that we always use relational database, but it's it's often a good option. But on the other hand, filter is not an option. If you are in movie uh planning, for example, and you have to um have to save some data from your UI, it's not acceptable that the data just disappears. we actually wanted to end up in a database. Um so that whole retry on
            • 08:00 - 08:30 failure mechanism that we can rely on in uh on the streaming side doesn't quite apply as well for these enterprise apps because um the just the the acceptance for failure is very different. So much lower traffic um much easier to scale um very different failure model. Now if you look at the architecture that we use for actually both of those um interestingly that is kind of the same. So this is Netflix streaming and um from your TV or
            • 08:30 - 09:00 whatever device you run Netflix on um a graphql request will go to our API gateway. That's kind of the first level um place you get to and that means that it's one HTTP request uh containing a graph query. Now the graphwell query is actually a federated graphical query. So although from the uh perspective of the device there's like one giant graphical schema. Um in fact that schema is
            • 09:00 - 09:30 implemented by many different backend services. So for this one small query that we have so we have a lomographical query asking for the title and artwork urls uh for shows. Um that's kind of what we need to build this this uh this screen. um we might have to hit three different um backend services and the federated graphwell service um basically takes care of that. The the federated graphwell spec uh defines how um a backend service can basically register
            • 09:30 - 10:00 its schema to the schema registry and based on that the gateway knows oh if I have to um fetch titles I have to go to the movie catalog surface. Uh we call these services DGSS, domain graph services. That's kind of just a term we came up with, but basically this is just a graphquil surface built with a with a DGS framework and uh built with Spring Boot and that means it's all Java. Now very often from there there's even further fan out and for uh that fan out we use GPC. So um most of the time
            • 10:00 - 10:30 when we go from device or UI to back end that's all GraphQL because GraphQL fits really really well for that kind of model. um we don't do any rest anymore. Um gc doesn't really work in that case because um devices usually don't work with gpc. U with the device you are usually better off with something HTTP based. But then if you go from Java service to Java surface we often use gpc because that is an extremely um fast mechanism. It's
            • 10:30 - 11:00 it's uh it's it's a binary protocol. So it's very efficient. Um and we can model um the the services more like I'm calling a method on another uh surface which in in this case for the service to service communication is is kind of the model you want to think in. Um an AF source of different data stores. Um it can be an in-memory distributed data store like EV cache or uh we have things like Kafka um and Cassandra um all these different data
            • 11:00 - 11:30 stores. We have many different data stores that we use um all with their own pros and cons depending on the use case um you might use one over the other. Now if you look at the same picture for these um kind of more traditional studio and enterprise apps is actually the same graph based architecture. It's the same thing. Um we really moved everything over to uh to GraphQL because that just works really well um to get flexible schemas. Um so in this case um again we
            • 11:30 - 12:00 do a graphical query um from someone's laptop. In this case it again ends up at a graphical federated gateway. The gateway knows that oh if I have to get uh the title for movies again I go to the movie DGS. In this case that movie surface uh might just be running a Postgress database because um we're running at a very different scale here. We are not talking about Netflix streaming. We're talking about these enterprise apps. Postress works really well for that. Um and so so we have many of these apps that that are kind of
            • 12:00 - 12:30 deployed this way. So a somewhat simpler fanout model. Um it is a little bit smaller. um much less traffic but the architecture in the end is um is exact exactly the same. Now of course it is only kind of a small part of uh what Netflix is doing. Um that architecture that we just talked about for Netflix streaming is really just discovery. Discovery meaning this is the Netflix app that you as a
            • 12:30 - 13:00 consumer use to uh kind of browse titles and figure out what you want to watch. As soon as you click play, other things start to happen. And these other things um are all happening in what we call open connect. And um uh what we actually have is we have appliances so uh servers in server racks at internet providers all over the world. So that uh the actual movie bits that that stream to your TV uh once you click play are coming from somewhere very close to
            • 13:00 - 13:30 where you live basically. So all the popular titles are just on giant boxes basically at the internet providers um so that they can stream it without actually costing um like network traffic on their side. Uh that's cheaper for them, cheaper for us and um it's a better experience for you because you're getting faster data basically. Um open connect is all the management software around there um is also all Java based. Um so there's Java there as well. And then of course we have things like encoding pipelines for the actual media
            • 13:30 - 14:00 encoding that's also all Java based. Uh we have all sorts of stream processing. Um some of the data stores are written in Java. So um of course there's other other languages and other things happening there as well. We have some low-level platform stuff in Go for example and um there's some machine learning things in Python. So there's definitely other languages as well. Um but for the most part it is really just all uh all Java in the back end.
            • 14:00 - 14:30 Okay, so this is kind of where um we have seen where we use Java and now we're going to talk about how we use Java. And um first we're going to talk about this JDK and then we're going to bubble up um the stack. So just like many other companies in the last um few years um or actually a few years ago very sadly we were still on JDK8 and uh that was a little bit embarrassing um because as like a tech
            • 14:30 - 15:00 company we were still on Java 8. That's not a great story. Uh that definitely didn't uh didn't feel great either. Um we we had kind of worked ourselves a little bit in a hole. We had an out old outdated application framework that was developed many years ago that was all kind of in-house built. Uh we were using a lot of old libraries that uh we had never updated because we didn't want to break anyone's apps and um these libraries were now incompatible with any Java versions newer than JDK8. So um
            • 15:00 - 15:30 service owners couldn't easily upgrade. It was just not a great story. Um and then the on the on the other hand um we had JDK1 available as soon as it came out but there just wasn't a lot of incentive for developers to do the upgrade because there's not a lot of new language features um coming from 8 to 11. So most of our developers even though it was available they were like yeah whatever I'm not going to put any effort in uh in upgrading. So now we had this big gap from 8 to 17 basically
            • 15:30 - 16:00 um and we really needed to break that cycle. So what we did is um is a few things. Um the first thing that we did is we patched all the unmaintained libraries for JDK compatibility. So if um an old surface needed to upgrade to JDK 17 even though they were on this ancient outdated uh application framework with all sorts of weird old libraries, they could just upgrade upgrade to 17 because we just patched the libraries. We didn't force them to upgrade anything. we just patched it so
            • 16:00 - 16:30 that it can at least upgrade to the to the new JDK. And that sounds really complicated, like, okay, now we're forking this weird open source library that no one maintains anymore, and that seems like a really bad idea. But in the end, when we really looked at it, it was like a handful of libraries that we needed to patch. It wasn't that much work. Um, it was really fine. So, um, we just got it done and that way we could unblock. So, I would also recommend if you're in this situation that you're kind of stuck on on JDK8 because weird
            • 16:30 - 17:00 library uh this weird library that I use can't upgrade, just fix it. It's really not that hard. Um it might look hard, it is actually not. Um the other thing that we did that was kind of unrelated to the JDK upgrade, but we also wanted to get rid of this old application framework because it was just a bad experience for all all our developers. Um we needed something more modern. So about 2 three years ago we decided to migrate all our Java services to Spring Boot and that
            • 17:00 - 17:30 means going from one application framework to a completely different one. That was a lot of work. Uh we built a lot of tooling to make that easier for our teams um like automated code transformations and things like that but it was still a lot of work. Um but surprisingly maybe uh we got it all done. Um, we upgraded about or changed about 3,000 applications all to Spring Boot. Uh, but the good news is that now all services are on Spring Boot. We have maybe a handful of services left that
            • 17:30 - 18:00 are not that are still on like a legacy stack. Um, those are basically the services that will remain for old device compatibility until those devices go away. Um all services are running on JDK 17 or newer and most of our like high RPS most important services are all on JDK 20 21 or 23 so that we can use the new garbage collectors. Okay. So when we uh I actually talked about this a little bit
            • 18:00 - 18:30 in the in the keynote yesterday as well. Um when we moved to JDK 17 what we saw is that um the G1 garbage collector uh just got a lot better. So on Java 8 we were using um uh G1. That's probably the garbage collector most of you are using. On 17 we were still using G1. Um it just got a lot better because that was a lot of Java releases where work had been done on the performance of the JVM mostly on the garbage collectors. And what we saw is that we got about uh 20%
            • 18:30 - 19:00 less CPU time spent on garbage collection on on a lot of these um high RPS services. And that is just a lot of performance we get basically for free by just upgrading to the new JDK. Um it is really hard to get 20% more out of your machines by actually performance tuning um if you've never tried that. So just getting that for free was kind of a a big win. Um so that is definitely a reason just by itself to uh to upgrade.
            • 19:00 - 19:30 Then with JDK 21 um there was the introduction of a generational ZGZ. So the G ZGZ garbage collector was around for a few Java releases already but it wasn't generational. And um when it was not generational so ZGZ is designed to be a low pastime garbage collector meaning that it doesn't do any stop the world garbage collection events. Um, while that sounds really great in theory, um,
            • 19:30 - 20:00 it wasn't generational and a lot of our services have pretty, uh, old lift data. So they they create objects at startup time and then these objects just stay around until the service shuts down basically. Um, if you have large uh, heap sizes, if a garbage collector is not generational, it has to go over that all that heap space every time it does a garbage collect. And that becomes really slow. Um so ZTGC didn't work for us before it became generational. Now 21 it
            • 20:00 - 20:30 did become generational and what we saw there is that um this was just a better garbage collector all around. We kind of expected it would be good for certain use cases for certain traffic patterns but it turned out it was just a really good general purpose uh garbage collector for most of our workloads. And if you kind of look at the difference um this is some metrics about the maximum GC passes. So these are
            • 20:30 - 21:00 metrics from from a cluster that we were running JDK 17. So sorry, we were running JDK 21 with a G1 garbage collector that is basically until the red box and these uh these green spikes are the the stop the world garbage collection events and you see that we have stop the world events uh from about a second to a second and a half. So that basically means when a garbage collection event happens the service would just basically reject traffic for more than a second. Now more than a
            • 21:00 - 21:30 second doesn't sound sound that significant but what happens as a result is that because we have really aggressive um timeouts on our services on our IPC calls that all the IPC calls going into that service during that one second with all this time out and now they have to retry and that means there's additional load on uh on your cluster. Now when we switch to ZGZ um that is basically the red box and you see that just um the graph just drops
            • 21:30 - 22:00 and it doesn't mean it stopped measuring it um it's measuring but it's running ZGC and you see there's just no pass times anymore. So that's really impressive. We went from like more than a second pass times to zero. Is it before it is it the regular ZGC or is it G1? G1. So before it's G1 and then uh it switched to generational ZGZ. But that's a good question. Yeah. So as a result of these pass times being gone, if you look at the error
            • 22:00 - 22:30 rates of these surfaces, so there's the the same graph, the same um uh time looking at the same clusters, but this is the error rates, you see that the error rates, which are the the purple in this graph, um they also dropped. And that is exactly for what I just explained. Um these garbage collection events would previously cause timeouts. And when these garbage collection events don't happen, these timeouts also don't happen. So that just means significantly less errors on our
            • 22:30 - 23:00 uh on our IPC calls and less errors are obviously better. Um it reduces all this retry behavior. It makes the services just run uh a little bit more consistent. It's easier to operate, easier to understand uh where errors are coming from. And um as an effect, we can also just run our services a little bit more more hot. So we can run at a higher CPU load um and still be okay before things just start to fall over in beard base. Um so we can basically just squeeze a lot more performance out of our machines and that is of course um a
            • 23:00 - 23:30 very good thing. Um so in this case we did have to switch the garbage collector but that is basically just um only setting oh I want to use CGC instead of G1. That's kind of all we had to do. Um so again it's that's a lot of benefits from mostly just upgrading the JDK. Then the other thing that we got um out of JDK 21 and beyond is virtual threads. And I've been super excited about virtual threads for many years now. Um
            • 23:30 - 24:00 maybe a bit too excited because it has taken a few years before we actually got there. Um but that is something that we we started experimenting with and um the first thing we started doing is add virtual thread support in our frameworks. So that is in our spring bootbased framework and also our graphical framework to DGS framework. And the idea was that if we automatically use virtual threads, our developers don't even have to change the way they write their their code. They will benefit from virtual threads without even knowing it. And we can just
            • 24:00 - 24:30 again get better performance out out of what we're doing. So this is a DGS or graphql um example. you don't really have to understand how this uh how this works but this just to illustrate kind of the the different difference in behavior. So without virtual threads if I do a query of shows that asks for artwork URL um the first thing that happens is that we have this DGS query method that executes that gives me a list of shows. Let's say we return five
            • 24:30 - 25:00 shows and now I have to uh call a second method um which will resolve the artwork URL field and that's that second method um with a DGS data annotation. That method will be called for each of the shows returned by the first method. So if I return five shows, this method will be called five times. if that method is relatively slow because I have to do a database lookup or um I have to call another surface and
            • 25:00 - 25:30 let's say this has to happen for every uh every show. This is a simplified example of course um and let's say it takes 200 milliseconds to run this method. This would actually happen in serial. So 200 milliseconds plus 200 millcond etc. So in this case we would have a response time of a second. Not great. With virtual threads, we switched the out of the box behavior of the same code basically. And now these uh uh this method is just running in parallel on
            • 25:30 - 26:00 virtual threads. And the uh effect is that now we still have this slow 200 milliseconds method, but it's running five times in parallel. And given that we have enough processors, um now we only have 200 milliseconds of like total processing time. Um, now you might be asking, okay, why do you need virtual threads for this? Couldn't you just run this uh this method without virtual threads on a um on a thread pool on an executor pool? And the answer is well yeah we could and
            • 26:00 - 26:30 we can actually if we like write slightly different code but we couldn't make it the default because in many cases this method will not take 200 milliseconds and these methods take like uh microsconds maybe they're like almost not measurable and then the overhead of putting it on a real thread would be bigger than the benefit we get. So we would only want this if we need this parallel behavior and in many cases we don't need that. So we always had to make this trade-off like well yeah okay
            • 26:30 - 27:00 sometimes you want this behavior but not all the time. So it can't be a default with virtual threads that extra overhead of the scheduling is basically gone because virtual threads are are free. Um so we can just make it a default. So it's a better developer experience because they don't have to think about okay this method should be running in parallel. I should schedule it on a thread. I need to use completable futures and all that versus with virtual threads, you can just do the writing by default out of the box because there's
            • 27:00 - 27:30 no cost to it. Um any I said I'm I've been pretty excited about virtual threads for a long time now. Um and my hot take still is that virtual threads combined with structures concurrency is going to completely replace reactive programming. Um that is kind of a big statement to make um maybe from someone coming from Netflix because um I don't know if you're familiar with the RX Java library uh which is one of the earliest reactive
            • 27:30 - 28:00 programming uh libraries that was actually developed at Netflix for the most part. So we were like knees deep in reactive programming at Netflix. We were like big believers in it. We pushed uh the technology quite a bit. Everything used to be RX at at Netflix like literally every API. Um and then we found that actually that is kind of hard because um yes it has a lot of benefits when it comes to like concurrency and things like that. Uh but it adds a lot of complexity to complexity to your code and also a lot of complexity to your
            • 28:00 - 28:30 debugging. And we found that in most cases that trade-off is just not a good one. And we backed out of using reactive programming for the most part and basically no one wants to touch it. That is that is kind of the extent of it. And we're still using some of it. Uh for example, if you're using a web client or so sorry a HTTP client that needs to do like multiple HTTP calls for a fan out kind of your only option today without virtual threads and structured concurrency um is something like web
            • 28:30 - 29:00 client which is again reactive. But it always comes with problems because one it's it's complicated but also now you have two two different threading models. you have a thread per request um thread model and then within that model you're now starting to get into a reactive uh model which is a completely different threading mechanism and that gets you in all sorts of hairy situations. So when structured concurrency is now also there, it's hopefully now in the latest or the last prefue in the in the
            • 29:00 - 29:30 current JDK in JDK 24 um that it will basically get rid of that last need for um for reactive programming and then we can finally live our lives happily. Um however, so virtual threats on dedicate 23 weren't exactly perfect yet. Um we started rolling out this uh functionality in the framework. So everyone was starting to use virtual threads and then our cluster started to completely deadlock as in we would have instances that would just be completely
            • 29:30 - 30:00 dead and um there's a blog post written by um a few of my uh co-workers. That blog post is really good. Um, and it goes into uh like the details of this problem. But more importantly maybe um they describe really well all the debugging steps how they um like got to understanding this problem and that gives a lot of insights how you can look at threads and that logs um and and understand what's going on because that's step one. But anyway, what we
            • 30:00 - 30:30 found is that um some libraries that we were using, some of them were using the synchronized keyword and um prior to prior to JDK 24, if you use synchronized in a virtual thread, this virtual thread would be pinned to a platform thread. So we had some of those in libraries. Now other libraries were using things like re-entrant locks. So these are also locks but not based on synchronized. And kind of the weird scenario that we got into is that um we
            • 30:30 - 31:00 would have a bunch of uh well let's say we would have all our platform threads pinned with a virtual thread. So all our platform threads were in use by a virtual thread that was pinned because of the synchronized keyword. However, all those virtual threads would be waiting for a lock. And guess who owns the lock? It's owned by another virtual thread, but that virtual thread will never ever be able to run anything because um there's no platform threats
            • 31:00 - 31:30 available. So that's the deadlock situation that we run into where no one is able to do work because there's no real threats to schedule work on and the lock is owned by someone um who uh who is a virtual threat. Um so we kind of had to back out slowly a little bit of okay maybe we shouldn't push our heart on virtual threats because this is a very uncommon but very real scenario. Um but the good news is that in JDK 24 that was released yesterday uh we have 491 and they
            • 31:30 - 32:00 basically reimplement reimplemented the way how synchronize works with virtual threads and this whole thread pinning issue is just completely gone because they reimplemented the whole mechanism. Um so it means that with JDK24 available uh we will once again start pushing on virtual threads and um hopefully well the expectation is definitely that it will go a lot better. We did some early experiments with um um like preview builds and that look really good. So um pretty confident about
            • 32:00 - 32:30 this. Okay let's uh move up a little bit in the stack and start talking about our application framework. So um all our applications are based on what we call spring boot Netflix and spring boot Netflix is really just open source spring boot and then on top of that we have a whole bunch of modules uh that make spring boot work with our um infrastructure and um ecosystem of of things around it. Um from a developer perspective it is just spring boot. It
            • 32:30 - 33:00 is the same programming model. We don't we add anything to the pro programming model. uh we use all the same annotations and things like that. It it is and looks like just plain Spring Boot and that's a good thing because that's what most people understand. And um we try to really stay close to open source releases as well. Um of course the upgrade to Spring Boot 3 was a little bit of a bigger story. I will talk about it specifically, but anytime a uh minor release comes out, so going from 33 to 34 etc. within a few
            • 33:00 - 33:30 days we have the whole fleet upgraded and it all mostly rolls out automatically. Um and so the things that we add in Springwood Netflix is things like uh security integration. So uh integration with our authentication authorization uh systems these security systems are all Netflix specific but then the way that is exposed in uh spring boot is just through spring security. So, it's the add secured and add pre-authorize annotations, things
            • 33:30 - 34:00 like that. Uh, probably the same stuff that that all of you are using if you're using Spring. Um, and and we just integrate with our systems under the hood. Um, for all all our incoming uh traffic and outgoing traffic uh in a service, we use service mesh based on proxyd. So, we integrate the framework with that. Um, service mesh takes care of things like service discovery and um TLS and and things like that. Um then we have a mechanism uh that is
            • 34:00 - 34:30 actually a programming model uh for GPC clients and servers. So we have an annotation based programming model that maybe kind of looks like a little bit like you're just writing a rest controller but then you're actually building a GPC surface very similar is being worked on open source now as well um with a with a spring team um that is not exactly what we have built but it looks pretty much the same. Um so this makes it easy to implement a GPC server or call a GPC server as a client. Uh then we have
            • 34:30 - 35:00 observability. So that is our distributed logging and tracing and metrics and these things. Um again this is mostly just working through the uh spring boot provided APIs. So this is mostly just using micrometer. Um but then it uses our in-houseuilt um uh systems to actually store uh store all the data and that's all in-house built because it's basically there's nothing that scales to um uh to what we need. Um there's fast properties that is uh dynamic configuration. So most of our
            • 35:00 - 35:30 configuration can be changed without restarting a surface. That is super important if there's like incidents going on and that we can disable feature flags and things like that. Um and then um we have our IPC clients. So uh the GPC client to call GPC services and things like web client. So web client comes from Spring of course, but we uh then extend that with all sorts of resiliency behavior for retries and things like that. Um that all works out of the
            • 35:30 - 36:00 box. Now the way we implement these things is the same way as Spring itself is built and all the frameworks around Spring Boot are built. So if you look at for example spring data or something like that um we basically use the same mechanisms to extend spring boot. So it's basically a lot of auto configuration uh to provide and override extra beans. Uh we have default uh configuration properties that's all environment postprocessors and then we try to provide test slices for any
            • 36:00 - 36:30 components we we build basically so that your test also um work work in a very fast way. Now you might be wondering, okay, why Spring Boot? Because there seems to be a lot of other frameworks that are interesting. Some frameworks u seem kind of more modern. Um and um all these frameworks actually do look really interesting. However, I don't think they are necessarily better. They're just also good. Um but Spring Boot is um
            • 36:30 - 37:00 proven to be a really reliable framework in the long term. Um actually one of my first projects I did uh just coming out of university which is a long time ago uh I think it was 2006 that I started uh using spring um spring framework is still based on the same concepts and they have done a really really good job um in like iterating and and making the framework better and using new language features uh from Java etc to to to to
            • 37:00 - 37:30 make the developer experience better. We're obviously not using XML configuration anymore. Um but at the same time it's based on the same on the same concepts and um that uh this is quite impressive how they managed to uh to evolve the framework. Um they're obviously still innovating every new Java release that comes out. They come out with new features leveraging those uh those new Java features. Virtual threads is one example. Um and a lot of developers just have spring experience which is kind of a big plus if we onboard new uh new
            • 37:30 - 38:00 folks. um it's extensible otherwise we wouldn't be able to u make it work within in in our environment and maybe one of the more important things also is the the spring team is just a really good partner for us uh we collaborate quite a bit with them um we give them a lot of feedback about ideas that they have directions they want to take um and um we work on a lot of stuff together with them and it works really well if you think about a deployment of these um spring put applications. Uh we
            • 38:00 - 38:30 basically have two different deployment options. Um for a developer, it almost doesn't matter which route you you choose. All the tooling kind of makes it transparent. But we either deploy directly on AWS instances or on Titus, which is our container platform, which is basically Kubernetes, but um we built it in-house many years ago and it's it's more and more actually starting to use Kubernetes components. uh so either containerized or just directly on u um AWS instances and then um we deploy as
            • 38:30 - 39:00 exploded jar files with embedded Tomcat. So that's kind of the u deployment model. We are not using anything native image. We have definitely experimented with that because of course we want faster startup time. Everyone wants faster startup time. However, um it doesn't quite work well enough for us yet. it is too hard to get it right. Um, it is pretty like easily breakable and
            • 39:00 - 39:30 although yes, with a native image you get faster startup time at your deployment, it makes the whole development development experience a lot worse because now you add a lot of time to your build times. Um, how do you actually start your application during development? You do definitely not want to build a native image every time you do like run your app during development. Um so it's it's just not a great story if you look at the whole picture. We are now more looking at um AOT in general
            • 39:30 - 40:00 and what project Leiden is doing. Uh again the spring team is also jumping jumping on that that ship. They're working with the with the lighten team uh on this and that is kind of what we're betting on to um improve startup time in the future. Um I kind of already mentioned that we are kind of staying away from anything reactive for the most part. Um we are not doing anything web flux. We uh sometimes get uh asked from our developers at Netflix like hey can can you guys not support web flux and the answer has always been no uh because we
            • 40:00 - 40:30 just kind of don't don't want to get back in that into that world. And web flux really only works well if you can guarantee like a reactive API all the way from the front to the to the very back to to your database connections. and that is a really hard thing to pull off especially if you have a lot of existing libraries. Uh so the benefits just don't weigh up to the negatives there. So we completely standardized in web MC and again with virtual threads and structures concurrency I don't see
            • 40:30 - 41:00 any reason to um to go the back to the reactive route. Okay. So um if you're using spring boot you are probably struggling with a upgrade to spring boot 3. Um, Spring Boot 2 is not maintained anymore in open source. So, it's definitely time to uh to take care of this upgrade. But there's kind of two big topics in this um new new release. U first of all, they baseline on JDK 17. Um that's fine. We were already on 17 anyway. I think it's a great thing. Finally, the Java
            • 41:00 - 41:30 community as a whole can kind of move forward a little bit. I'm very thankful to the uh to the Spring team for doing this. The second topic is completely not interesting but very impactful. uh and that is the um use of Jakarta e um namespacing instead of Java X namespacing. Now why is this important? This is important for libraries. So if you're just building building an application, this upgrade is trivial. Uh you can literally just do a find replace in your source code. Change everything
            • 41:30 - 42:00 Java X to Jakarta, you're happy. Um upgrade completed. However, if you have libraries, these libraries might provide like a surface filter in this example which is java x.sur.f filter. That filter works fine on spring boot 2. That filter does not work fine on spring boot 3 because they expect Jakarta uh surflet. So the way we mitigated this issue is um we are using gradal transforms. Um so we build a gradal plugin basically that add artifact
            • 42:00 - 42:30 resolution time. So basically when a jar file for a dependency is downloaded we run a transform that's just a standard gradle feature and we are doing a um bite code rewrite from Java X to Jakarta. Now the good news is that all the APIs um from uh Jakarta for this version are completely unchanged. So it's literally just a namespace change package name space change. There's nothing else changed. So you can safely just replace everything Java X to Jakarta and you will be happy. And we do
            • 42:30 - 43:00 this with this um bite code uh rewrite. Um that's all open sourced. So if you're using Gradle, you can use the whole thing as part of the Nebula ecosystem, which is our uh open source Gradle plugins that we uh that we provide from Netflix, which is then based on another open source uh tool that does the actual bite code manipulation. So even if you're not using Gradle, this would be a good starting point. Um, and this is kind of how we can at the moment have libraries that are built against Spring Boot 2, but we'll also work with Spring
            • 43:00 - 43:30 Boot 3. And once we get rid of all the Spring Boot 2 apps, we're almost done with that now. So, almost everything is on Spring Boot 3 now. Um, we can start changing libraries and eventually get rid of this transform. Going to skip this. Um and then uh I I talked a lot about GraphQL already and it's all built on top of the uh the DGS framework that is the uh framework that we open sourced uh in 2020. This was um like in the early days
            • 43:30 - 44:00 when Netflix started developing on top of GraphQL. We needed a framework to make that uh to make that easy. There wasn't really anything available in the Java world that looked reasonable um except for GraphQL Java, but that's a very low-level library. And this stuff is all built on top of GraphQL Java. Um so um it's kind of how it works. This is providing like the spring boot integration um with that and you basically just get a um annotation based programming model to write your uh
            • 44:00 - 44:30 resolvers for uh graph queries and then also a testing framework that makes it really easy to uh run graph queries against your service without actually having to start like a web server and things like that. So that that is a pretty important part to it. Then um a few years after we did that the spring team also started to realize um like the importance of graphville and they started working on graphql support directly in spring boot. Um now we were kind of going towards two almost
            • 44:30 - 45:00 competing frameworks um in in the community that wouldn't be a good thing. So we worked a lot with the spring team on um shaping what uh spring for GraphQL should be. So that's the the thing that comes out of the the spring team and um we integrated into frameworks completely. So if you use DGS in fact uh the um under the hood it is using a lot of the uh uh spring for graphql graph yeah spring for graphql components and um you can use both
            • 45:00 - 45:30 programming models and all the features uh basically um in just together in whatever way you want to to do that. Um last slide that I will get into. Um if you ask about okay what kind of IPC mechanism should we use? This is kind of how I look at it. Um you have you have two options. GraphQL or GPC. Those are are your two good options. If you think about a UI talking to a back end, you want to have a flexible schema or a
            • 45:30 - 46:00 flexible API basically that kind of works for all these different uh clients that you um that you need to deal with. And graphical gives you that that gives you that really flexible um way of querying data. And very importantly, you have a schema. So that is the the way you collaborate between UI developers and backend developers. Um and when working with GraphQL, you kind of think in data, not in methods. If you're talking about server to server communication, you often want
            • 46:00 - 46:30 to think a little bit more about, okay, now I'm actually just calling a method. It just happens to run on on another server. That's kind of the the mental model that you're in. Um, and that is what GPC is really good at. GPC is extremely performant because it's a binary protocol. Um, it still has a schema because it's Protobuff. It's just a different type of schema than GraphQL is. Um, but that is um that is really good for server to server communication. Now, it it doesn't mean that you will never use GraphQL for server to server communication. We have some of that as
            • 46:30 - 47:00 well. Um, that's fine. But these are kind of the big the big buckets. And what does it mean for REST? Well, I don't think you should use REST at all really. Um, if yes, REST is easier than GraphQL because you can basically just do a data dump and tell your UI developer, I have a lot of data. Good luck. Um, it is not a very good way to uh to build these UIs. It's just not a good experience if you don't have a schema and um it's not a flexible API. You basically just always get all the data that is available. That might probably be way
            • 47:00 - 47:30 more data than the UI actually needs. it's just not a good model. Um, and yes, of course, you can use something like OpenAI to add a schema on top of uh on top of REST, but that's kind of not really um this more or less an after the fact thing. So, uh yeah, my opinion uh don't use REST. Um, of course, if you just want to do something quick and dirty and easy, it's fine. I'm not saying you are a bad person if you ever do REST. You're just not as good as a
            • 47:30 - 48:00 person. Okay, I'm going to skip over this. I'm out of time. Thank you very much. Uh, do I have time for questions? Not really. I'm sorry. Okay, then find me in hallways and ask me questions.