Understanding Vulnerabilities in Memory Systems

RowHammer, RowPress & Beyond - Jean-Claude Laprie Award on Dependable Computing - Talk at DSN'24

Estimated read time: 1:20

    Learn to use AI like a Pro

    Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo
    Canva Logo
    Claude AI Logo
    Google Gemini Logo
    HeyGen Logo
    Hugging Face Logo
    Microsoft Logo
    OpenAI Logo
    Zapier Logo

    Summary

    Onur Mutlu delves into the perplexing issues of memory vulnerabilities, particularly focusing on the RowHammer and RowPress effects, during his awarded talk at DSN'24. Despite advances in technology, these vulnerabilities persist, mainly in commodity DRAM chips. Mutlu highlights how the predictable induction of bit flips poses both a security and reliability threat, likening it to breaking into an apartment through a neighbor's door vibrations. He emphasizes that while various solutions have been proposed, the problem is ever-evolving with technology scaling, suggesting a dire need for innovative, cross-layer solutions involving more intelligent memory controls.

      Highlights

      • Onur Mutlu explores the history and current state of the RowHammer vulnerability in DRAM chips. 🌟
      • The RowHammer effect can induce bit flips, compromising system security and reliability. 🔄
      • Various mitigation strategies exist, but none fully eradicate the problem, as revealed by recent studies by Mutlu and others. 📚
      • Industry giants like Google and Microsoft show heightened interest in addressing RowHammer, pressing for more robust solutions. 🏢
      • Future innovations may need to revolve around smarter memory controllers to combat these hardware vulnerabilities effectively. 💡

      Key Takeaways

      • Memory vulnerabilities such as RowHammer and RowPress continue to pose significant challenges in computing. 🔧
      • RowHammer allows for predictable induction of bit flips, a critical security threat identified in DRAM chips. 💥
      • Despite solutions like increased memory refresh rates, the issue hasn't been fully resolved, highlighting a need for innovative approaches. 🚀
      • Collaborations with Google and Microsoft help bring attention to the threat but reveal the complexity of finding effective solutions. 🤝
      • The path forward may require more intelligent memory controllers and reconsidering memory controller designs. 🛠️

      Overview

      Onur Mutlu delivers a highly insightful talk at DSN'24, shedding light on the persistent challenges posed by RowHammer and RowPress vulnerabilities. His presentation, honored with the Jean-Claude Laprie Award on Dependable Computing, walks attendees through the evolution and implications of these memory issues. Despite the progressive nature of computing technologies, these loopholes remain unsealed, forming a critical accountability gap in ensuring data safety and reliability.

        During his discourse, Mutlu reflects on the troublesome journey of addressing bit flips in DRAM chips, a phenomenon he likens to repeatedly knocking on an apartment door until the locks are jostled loose. Such vulnerabilities allow attackers to predictably induce errors, thereby disrupting system reliability and security at a fundamental level. Various solutions have been attempted, including upping refresh rates, yet many fall short of wiping out the vulnerability.

          Mutlu calls for a rethinking of solutions involving more intelligent memory controls and greater industry accountability. He highlights collaborations with tech powerhouses like Google and Microsoft, whose efforts emphasize the growing need for innovative approaches in tackling these rich, complex issues. As the future of computing marches forward, ensuring the robustness of memory systems remains paramount to safe and dependable infrastructures.

            Chapters

            • 00:00 - 00:30: Introduction and Award Acknowledgement The chapter titled 'Introduction and Award Acknowledgement' begins with the speaker expressing gratitude. They acknowledge the flexibility of their speech in terms of time, inviting the audience to stop them if necessary. The speaker thanks the organizers for the award, expressing pleasure at being present. They emphasize the significance of receiving an award named after a pioneer in the field, whose influential papers from the 1970s and 80s are well-known.
            • 00:30 - 02:00: Paper History and RowHammer Introduction The chapter discusses the history and current state of the RowHammer problem as detailed in an influential paper. Despite initial expectations that the issue would be quickly resolved, it persists even now, confounding expectations. The chapter's context involves the speaker sharing insights and expressing surprise that the issue still exists, as it initially seemed straightforward to address.
            • 02:00 - 04:00: Technical Explanation and Studies on DRAM The chapter delves into technical explanations and studies related to DRAM, featuring contributions from Masters and Bachelor students at Carnegie Mellon, including two associates from Intel, Chris Wilers and Conrad Li, among others. It highlights the concept of rowhammer issues within DRAM systems. Emphasizing the impact of bit flips on infrastructure, the chapter conveys that while these infrastructures are robust when functioning correctly, they suffer significantly when bit flips occur, thus stressing the importance of stability in emerging DRAM technologies.
            • 04:00 - 06:00: Security Implications of RowHammer The chapter "Security Implications of RowHammer" discusses the concept of bit flips, focusing on the RowHammer vulnerability. It explains how this vulnerability allows for predictable bit flips in commodity DRAM chips, highlighting that almost all recent DRAM chips are affected. This represents a significant concern as it is an example of how a fundamental hardware failure mechanism can lead to widespread system security vulnerabilities. The chapter suggests that this issue is unprecedented but is open to correction if other similar instances are known.
            • 06:00 - 08:00: Industry Adoption and Mitigation Solutions The chapter discusses the history and process involved in addressing problems related to memory in computing systems. It highlights the importance of understanding issues that arise as technology scales, particularly in memory density. The chapter reflects on the early recognition of these challenges before publishing significant research, specifically the 'Roh Hammer' paper, emphasizing the need to scale both the system and memory comprehensively.
            • 08:00 - 10:00: Current Industry Practices and Future Directions The chapter begins with a discussion on current memory technologies, notably DRAM, which was invented in the 1960s. It describes the components of DRAM such as the capacitor, access transistor, and access circuitry that must all function reliably for memory technology to work effectively. The chapter also highlights the challenges faced as the physical size of these components is reduced to increase the capacity of DRAM chips.
            • 25:00 - 30:00: Q&A and Final Thoughts The chapter 'Q&A and Final Thoughts' discusses challenges related to memory technology, specifically in DRAM. The speaker highlights the complexity and noise issues that arise when memory technology fails. They reference a specific issue known as 'Rammer,' explaining it's a notable effect observed in memory systems. Using Samsung's internal workings as an example, the speaker explains the disconnect between the logical and physical structure of DRAM and how manufacturers tackle this by utilizing three-dimensional configurations to enhance capacity.

            RowHammer, RowPress & Beyond - Jean-Claude Laprie Award on Dependable Computing - Talk at DSN'24 Transcription

            • 00:00 - 00:30 yeah the good news is I can take as long as short as needed so please stop me it's a topic that uh is fascinating and thank you I for the uh award uh it's a great pleasure to be here uh especially uh receiving an award named after a Pioneer in the field whose papers we read from 1970s and 80s Etc uh
            • 00:30 - 01:00 I'm going to give you a history uh of the paper but also a little bit of the what's going on right now because I think it's both fascinating and disturbing at the same time the fact that this problem is not solved yet like after he wrote the paper I thought I told I told my students this will be solved immediately but but right now we're in a very different regime so this is the paper uh that we're talking about unfortunately none of the authors are here most of the authors are my PhD
            • 01:00 - 01:30 Masters and Bachelor students at Carnegie Mullen two of them are were at Intel at the time Chris wilers and Conrad Li I will not name everyone over here uh but we talk about rowhammer right now and I use this analogy uh this infrastructure this is beautiful if you get a bit flip not so beautiful anymore another infrastructure beautiful if you get a bit flipped not so beautiful anymore so essentially another infrastructure that we're building today
            • 01:30 - 02:00 this is Mr Beans infrastructure and he's very happy but if he gets a bit flip he's not going to be happy anymore so essentially we're talking about bit flips and rammer is the fact that you can predictably induce bit flips in commodity gam chips and essentially all recent Dam chips are fundamentally vulnerable and what's interesting to me is that this is the first example of how a simple Hardware failure mechanism can create a widespread system security vulnerability if you know of something else please let me know we will correct this claim and people are writing articles that look like this this was in
            • 02:00 - 02:30 2016 uh let me give you the history of how we stumbled on this issue so we were working on memory a lot memory is clearly an important thing in memory in Computing systems we were working on many issues in memory and we were looking at scaling technology scaling issues as memory becomes denser what kind of problems you will see and we did some work and this is an early position paper before we published The rooh Hammer paper we always argued that these scaling issues are going to become much worse and we should really think about scaling the system together with memory
            • 02:30 - 03:00 as opposed to just scaling the memory technology alone I'll get back to this so main memory consists of DM essentially DM is a beautiful technology invented in 1960s by Bob dard uh you have a capacitor and you have an access transistor and you have uh essentially the access circuitry and for this to work all of these components should work reliably now the problem is as you reduce the size of the circuitry meaning that you want to more have more capacity in your damp chips
            • 03:00 - 03:30 all of these things start failing and there's a lot of noise uh that happens and rammer is an effect of this if just to give you an idea this is 2019 Samsung this is what ad DM looks like internally the logical picture is very different from the physical picture and DM manufacturers are actually doing a lot to uh itch these capacitors in three-dimensional manner so that you can actually get a lot of capacity okay we were doing a lot of studies this is actually this is a study that we did with Facebook before Rob Hammer also we found out that essentially this was published in DSN in 2015 after some time
            • 03:30 - 04:00 and we found out that essentially denser DM chips lead to a lot more errors in real data centers and we really wanted to understand these things so we built fpj based infrastructures you're seeing the earliest one over here I'm going to tell you why we really built it a little bit later to understand these issues and more recently this is where a lot of the roh hammer experiments were performed actually from the paper and more recently we' open sourced it other people are using it and there's newer versions uh like called the Bender that
            • 04:00 - 04:30 are being used so the real reason why we really built this infrastructure was to understand data retention issues in memory so that was one of the scaling challenges that we believed was really important I still believe it's important but R Hammer is actually getting equally or more important and these actually have a tension with each other in DM actually so what is data retention basically DM cell cannot retain data for long you need to refresh it and this is a problem in terms of performance reliability and efficiency we wanted to get rid of these refreshes without eliminating robustness so for that you
            • 04:30 - 05:00 really need to understand how long a DM cell can retain data and this is very much dependent on the process manufacturing variation and if you really want to understand this you really need to build an infrastructure to test the retention time of the cells again I'm not going to go into this this is the reason we built this infrastructure and we wrote papers about it we are still writing papers about it this a tough problem it's a beautiful problem though at the same time uh we were actually doing a lot of studies on flash memory ssds so we actually built this infrastructure this an fpga based infrastructure to test Flash memory to
            • 05:00 - 05:30 understand its reliability issues robustness issues and we were seeing a lot of read disturbance there this a paper from date in 2012 uh where we demonstrated some R disturbance eras and we we later WR another paper to Intel technology journal and we kept writing papers I'm not going to go talk about these but essentially these two things uh enabled us to think about oh what happens in DRM is there some other air phenomenon in DM that could be sneaking into the DM chips uh so several of my stud students uh were spending a summer
            • 05:30 - 06:00 at Intel in 2012 and even before that we were collaborating with Intel uh on these scaling issues and during the summer we actually really uh intensified our efforts in the infrastructure and we started doing some of these R disturbance studies in D and essentially we found that existing DM chips are vulnerable I'll give you the story a bit basically what's the problem the problem is very simple it's read disturbance if you know memory there's nothing new here uh in the sense that as memory scale down cells affect each other now this
            • 06:00 - 06:30 should not be exposed to users that's the key problem uh basically uh you have uh if in DM you activate one Ro to access a cell which applies high voltage to that Ro and then you keep doing this repeatedly activate pre-charge activate pre-charge activate precharge it turns out you flip bits in physically adjacent rows now this should not be happening because you're just accessing that row over there and you're not accessing anything else you're not actually writing to anything in the right clearly
            • 06:30 - 07:00 this breaks memory isolation physical isolation and those rows that get bit flips the victim rows as we call them can belong to some other application can belong to the operating system can belong to some important Keys Etc important security infrastructure and this is actually what makes the real security problem in the end it's a robustness problem in the end which affects security reliability and safety but security folks actually took away with the bit flips because they wanted those bit flips uh if we get to it at the end I'm going to actually mention some paper later on okay so tested
            • 07:00 - 07:30 essentially a lot of modules 129 modules from three major di manufacturers and there are only three major di manufacturers that occupy more than 95% of the DRM market today uh we'll name them later we we did ABC in the original paper and we found out that more than 80% of the modules we tested are vulnerable to this problem and we argued that it's a scaling problem basically the test we did in 2008 or so we could not induce errors on those chips where the cells were larg and far away from
            • 07:30 - 08:00 each other uh on all of the uh model modules that were manufactured between 2012 and 2013 we actually essentially saw erors uh and the reason is basically as cells become closer to each other and as cell become smaller this noise this electrical interference becomes larger and the number of activations or hammers to induce a bit flip reduces over time so uh in older Technologies you could actually induce these bit flips but uh
            • 08:00 - 08:30 the cells would get refreshed before you could actually induce a bit flip now newer Technologies the number of activations to induce a bit flip uh was smaller than the number of activations you could actually Inc induce within a refresh interval so before the cells get refreshed the mid flip happens and we're going to talk about that in future if you get to it if you have time so please stop me if I'm running because it's an exciting topic to me okay I already said all of this why is this happening there's I interference they're very
            • 08:30 - 09:00 interesting device level studies so this phenomenon was not actually well understood before uh our work later works actually uh investigated the device level mechanisms as well this is actually fascinating to me also uh people are still investigating the device level mechanisms okay so uh if such a bitflip happens in a real memory uh system now you have a huge implication on the higher level of the software because the data structures are immediately corrupted right and this could actually lead to a lot of issues
            • 09:00 - 09:30 when we actually wrote the paper we also released a source code which many Works improved on later this source code is very simple it's a user level program what it does is basically it uh selects addresses X and Y and hopefully they're in different banks we did not do a good job in selecting the banks there are a lot of later works that actually did a much much better job than us and basically it's bypasses the caches in the CPU bypasses a robot for in DM and ping pong's activations or hammers to X and Y and if the chip is vulnerable you get bit flips you could still try out
            • 09:30 - 10:00 the source code I'd recommend trying out Google's or more recently c i source code uh on blacksmith and Etc right okay so we showed that on real Intel and AMD systems you would get errors by running this user level program okay so that's interesting right clearly we argue that this a robustness issue uh let me give you the history a little bit going forward as well so the first people who really picked up on our work were testing people this is PassMark software m test 86 you may have used it these folks actually included a version
            • 10:00 - 10:30 of our program in their tests and it's test 13 hammer test and immediately they started getting inquiries from people saying that oh we're getting errors in D why is this happening and they had to write on their website that the errors detected during this test 13 albite exposed only in extreme memory access ke are most certainly real errors so and then they have some other stuff in their website that's interesting so when we wrote the paper the first sentence and I still believe that this first sentence
            • 10:30 - 11:00 we should satisfy is memory isolation is a key property of a reliable and secure Computing system and access to one memory address should not have unintended side effects on data stored in other addresses we said that we should maintain this uh and I don't think we should go back any time on this one uh we also said that uh someone can take over your system if by taking advantage of these bit flips we did not demonstrate that but we demonstrate the proof of concept of bit flips in a real system the other people that took on our work initially was Google Google project zero basically read the
            • 11:00 - 11:30 our paper uh and replicated the results and they were basically able to devise an attack that would take over the Linux kernel using these bit flips I'm not going to go through this which is very very interesting it's beautiful black hat presentation that they have this is actually copy and paste from their blog post they aable to induce bit flips in the page table uh entry of this user level program uh such that they would actually uh get access to their own p P table meaning right access to your own
            • 11:30 - 12:00 page table once you have right access to your own page table all bets are off at that point right so this became famous as the RO Hammer vulnerability especially after this Google paper and people started writing things like this like it's like this a famous hacker saying it's like R Hammer is like breaking to an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after I like this one hackers are actually quite insightful in explaining things I think there's a whole line of security work in in my opinion that were waiting
            • 12:00 - 12:30 for these bit flips meaning they wanted to have these bit flips with real software so that they could actually show what you can do with these bit flips clear clearly theoretically it was very well known I think that in the 1990s if you have a bit flip this is very bad for cryptog cryptography right you could actually lose all your keys Etc but Ro Hammer enabled people to actually test this in real life let's say and again I don't I won't go through these papers there are a fascinating set of papers in the security Community including DSN that talk about uh these
            • 12:30 - 13:00 attacks I'll just mention them I will mention this one this is interesting because roham is an Integrity problem uh but even if you may not be able to take over a system by compromising Integrity you could actually uh access data that you're not supposed to access so these folks actually turned it into a confidentiality problem as well which is very interesting so rammer is a side channel was uh was shown by this paper more recently people have shown that you could actually corrupt neural networks Etc and Google and Microsoft have been
            • 13:00 - 13:30 doing work on this also if you have time I'll get back to get to that as well and people have been drawing pictures like this this this may be a solution or uh an attack okay we've been writing a lot of papers I will not talk about some of these so what our paper did I think there are two major things understanding and solving the problem our paper did both and I think it's really necessary to do more of this today uh uh uh so this that's why we built this infrastructure we tested a lot of modules and we showed that by understanding the problem you can devise
            • 13:30 - 14:00 better Solutions again I will not go through these these slides are meant to be reconfigurable you can I'd be happy to share them Etc but by reducing the accesses to DRM you can reduce the problem not a good idea bad performance by refreshing DRM more frequently you can get rid of the erors not a good idea because we wanted to get rid of refreshes remember that's why we built this infrastructure actually and in fact our paper show that if you want to get rid of every single error in every module that we tested you have to increase the refresh rate by 7X we're going to get back to the solution data
            • 14:00 - 14:30 pattern is very effect has a huge effect on the areas that you see because it affects the charge coupling and the noise and there are many many other observations I'm not going to talk about but I'd be happy to talk about separately so more recently A lot of people have improved on this so we're actually investigating uh Ro Hammer it's getting much worse it has many dimensions it has a lot of interesting interactions and a paper was presented Yesterday by alabar on hbm chips hbm is actually critical infrastructure today and Hammer is as bad if not worse in hbm
            • 14:30 - 15:00 chips uh today okay let's talk about Solutions because Solutions are necessary also right and this sort of issue discovered in the field you have to have Solutions immediately now unfortunately we didn't have a lot of leeway in real systems because memory controllers were not programmable that's what we're going to argue soon longer term solutions could be actually much better so our paper proposed seven different solutions uh one of them I'm going to mention uh and then I'll give you a history of what happened to that solution so there are
            • 15:00 - 15:30 actually many solution approaches a lot of them actually we discussed in our paper unfortunately there's no easy way to solve this problem today for the reasons that I will discuss and all of the solutions are uh a trade-off between cost power performance complexity and hopefully not robustness hopefully we don't want to give up robustness in the end and you can combine different solutions in different ways also and different solutions that has upsides and downsides so what did people do in Industry essentially industry increased the refresh rates because that was the only lever that you could have in a memory controller uh and this is Apple uh I like Apple
            • 15:30 - 16:00 because they actually reference our paper when they actually release their cve and they basically said that uh this issue was mitigated by increasing memory refresh rate they didn't say how much probably 2X and that's not enough to get rid of all of the Errors By the way that it closed the vulnerability but we're going to talk about that so our solution was probabilistic we called it perah probabilistic adjacent Rob activation with very low probability when a memory controller Clos a row it activates one or more rows and this actually gives you a very good Rel liability guarantee it's
            • 16:00 - 16:30 uh and if you're paranoid you can adjust the value of p and we analyze the solution it's low cost it's SL it's still the lowest cost solution uh we we said that you could actually Implement in DM chip or the memory controller it was it's done today in both uh it was actually adopted by Intel very early on uh they basically this is a bios uh screenshot where you can see that you could pick your rammer solution you could pick Hardware rowhammer protection or 2x refresh if you pick Hardware Hammer protection you have a choice of
            • 16:30 - 17:00 probability how do you reason about that good luck you don't want to probably activate a every other activation right okay that was not exactly how we imagine it but it was a step in the right direction in my opinion actually okay so there's more of hammer Solutions proposed so I think this an example of an intelligent memory control slightly more intelligent that's trying to fix the problems and we always argue that we have actually memory controllers in Flash and we have a lot of intelligence over there our argument was that indeed and we should have more intelligent memory controlers also to
            • 17:00 - 17:30 fix these issues because these issues AR going to happen in the field more and more and okay now let me give you the history of what happened a little bit because I think it's really important to see where we are uh today uh intel was developing Solutions I think AMD was deving Solutions but at some point the manufacturers said we solved the problem and everybody believed that uh so we questioned this uh I'm going to give you very quickly what we've done we actually analyzed r the drob hammer six years
            • 17:30 - 18:00 later the Anisa paper six years later and we tested even more chips many more Generations 1500 chips and we showed that this is actually a very difficult problem to solve meaning newer chips are more vulnerable because cells are closer to each other you get bit flips after only 4,800 Hammers and existing mitigation mechanisms are not effective basically so we built a lot more infrastructure and this picture shows that uh you from left to right you're marching towards a smaller number of activations that could induce a bit flip this is the latest generation chips that
            • 18:00 - 18:30 we tested and mitigation mechanism actually become much worse in terms of performance and energy overheads as you can see over here okay so basically R Hammer is a technology scaling problem and finding a good solution to row Hammer is difficult so and we'll become more so I don't think this problem is easy to solve actually there you can you can actually see interesting numbers uh of how many activations you need to do to induce a bit flip we marching towards almost one and at that point the may not be scaling and maybe we should really do
            • 18:30 - 19:00 something else okay so we also wanted to uh understand what kind of solutions that Ro that manufacturers put in because they were actually relying on security by obscurity uh meaning that oh we solve the problem trust us uh we basically said can we actually reverse engineer what's going on in existing DM chips and bypass these mitigations and that's essentially what this work did uh uh basically the solution was like this if you want to induce a bit flip you just f figure out how to induce the bit flip that's the idea so internally the
            • 19:00 - 19:30 manufacturers introduced some tables and we reverse engineered them and we devised access patterns to overflow the internal tables that would count the activations to different rows and we were able to induce bit flips and Real dmips by using attacks that look like this so instead of having two rows that you Hammer you rammer four rows in some ships you need to hammer seven rows in some ships you need have nine rows but you would overflow the internal protection mechanisms and then you would induce bit flips and we kind of declared
            • 19:30 - 20:00 Victory showing that oh you could actually induce these bit flips on real phones Etc basically I think this still I would still argue that this is import this is uh this is true basically we don't want security by obscurity I think going forward unfortunately I'm going to show you the solutions that are adopted today where at the same point uh manufacturers are still relying on security bi Security today okay so these Works actually cause a lot of churn in industry and they basically Jed finally start taking the problem much more
            • 20:00 - 20:30 seriously and uh they basically induced these two paper they they published these two papers and more recently the standard ddr5 standard is updated I'm going to mention what solution is that but basically if you have an infrastructure like this you can uh cover uh you can you can completely reverse engineer what's going on in DM I'm not going to talk about this you could actually do much worse I will mention the industry effort also because this actually really interesting Google Microsoft and some other industry took a lot of interest in this work uh Google actually showed that in 2021 you could do row Hammer far away meaning
            • 20:30 - 21:00 if you have a far away aggressive row you could actually induce bit flips in a victim that's far away which is interesting I'm not going to talk about the implications of it uh Microsoft also wrote paper saying that you could actually do this in real workloads commodity workloads okay I'll be I'm almost getting to the end uh so okay this is interesting I think and uh these Forks actually were perhaps the major dyerson pushing man manufacturers to do something better we'll see if they're
            • 21:00 - 21:30 doing better so we always argue that you need better memory controllers uh there are a lot of uh work in this area uh and I think I still believe that we need better memory controllers better programability in the memory controllers okay let me talk about what industry is doing right now so this is 2023 this is the first paper from industry SK hyx which is a major di manufacturer second biggest one uh this is the first paper that acknowledges roow hammer it uses the word row hammer and they basically introduce an intelligent memory controller in DM I'm
            • 21:30 - 22:00 not suggesting this is the best solution what they do is internally inside the DRM chip every single row has an activation counter and these activation counters are incremented when a row gets activated and they take some mitigating action if the activation counter of a row is greater than some threshold they don't disclose how exactly this works they claim they reduce the row Hammer vulnerability so it's more security by obscurity again in my opinion Samsung actually after seeing the skic paper order the paper too but this a mess okay and there's a paper coming up in
            • 22:00 - 22:30 thec that shows that it's kind of a mess okay Microsoft actually proposed a solution like this earlier than SK HX and this solution is actually adopted by jedek in the recent uh standard ed5 standard this is April 2024 they called perow activation counters uh we have an analysis of this coming up next week at the showing that this actually vulnerable to denial of service it may actually fix bit flips depending on the conditions but it may not be very good good for performance okay so basically I think
            • 22:30 - 23:00 this a good question to ask Are We Now row Hammer free in 2024 and Beyond I don't believe so and I think there's a lot more that's going to come up soon uh since we don't have a lot of time we've shown last year that you could actually induce bit flips by keeping a row active a single row active for a long time this is called row press and this is very different from row hammer it turns out uh basically this picture shows the example R Hammer keep activating a row many times row press you keep the row active for a longer time and you get bit
            • 23:00 - 23:30 flips at much smaller Hammer counts less than 1K today okay again I'll skip some of these over here but you can see that we're marching towards very small Hammer counts where Bine applications are actually inducing these bit flips almost I believe there's more to come so I think we need to do more in this area we need to understand more and we need to solve the problem more and there are a lot of works that are going on right now that I will kind of flash at you over here these are some of our works but there are other works also uh these are actually a collection of the
            • 23:30 - 24:00 works that I A small collection of the works that I've been looking at this year uh in different conferences as you can see and there's actually a DM security Workshop that has a bunch of rooh hammer papers I just looked at it yesterday night if you're interested you can take a look at it so people have been actually doing these attacks on risk 5 uh and Google has a paper actually showing that Samsung solution is not good enough which is really interesting as you can see okay so what about the future I think future is much more Bleak basically we're marching
            • 24:00 - 24:30 towards a much worse regime as technology scales and we need to think about our systems in a better way okay let me go back to the bridges I'm going to end so people have been building fridges for thousands of years right and we still have bit flips in bridges some of you may remember some of these things this is close to my heart I used to be at CMU and this bridge I used to go over and then it collapsed uh so we haven't figured out how to build Bridges but I think we have a better shot at Computing infrastructure hopefully uh by having
            • 24:30 - 25:00 intelligent memory controllers I think this is the place I'll stop I'd be happy to talk more but I'd be happy to take questions also thank you for a very interesting inspiring talk so I have two questions so one if you mentioned that memory
            • 25:00 - 25:30 isolation is a very important property and you don't want to go back on but it looks like the more we try to preserve the property the more C wor the C so I wonder if we should actually rethink B abstraction and start building software level solutions that can deal with these kind of memory and you know in this community there's a lot of work that likey to build other other kinds of solutions than just pure Hardware on to De with these so what I thought about that and secondly have you thought about
            • 25:30 - 26:00 extending these to CPU kind of fall especially in of the SDC causing errors Google yeah yeah thank you carik that's a both are great questions I think the first thing is absolutely I think we should really preserve the memory isolation abstraction at any level that we can actually preserve it in a nice way it doesn't have to be done completely in Hardware in fact very it's going to be very very difficult to do at the completely at the hardware level we need to think about different ways of potentially writing the software the UL is of course it's not easy right for the
            • 26:00 - 26:30 software to handle that and the overheads may be very large uh but that's the system level solution space that we really need to explore especially if this problem doesn't get solved the second is yes I think sdc's are another example of Hardware errors bit flips in fact uh one of my slides I generalized the hardware bit flips and we should really take them seriously uh also uh the solutions over there are going to be I think in my opinion different because I think Solutions in memory are different but but in the end uh some principles are similar if you
            • 26:30 - 27:00 for example have more flexible Hardware uh that would test itself over time uh and figure out these issues while it's in the field as opposed to putting a huge burden on the manufacturing subsystem where we actually actually spend a lot of money Etc uh that'll be very useful if you can we configure the hardware to fix some of these issues online uh I think that can be very useful so there are some solution approaches that are similar with sdcs and R
            • 27:00 - 27:30 Hammer very entertaining talk there a shocking takeway from all this which that the manufacturers recorded violating L thew contract when if pointed out they IGN in political uh that's basically called
            • 27:30 - 28:00 sh over where we now accept that there cont how can we get back now there is you can just ignore this people will live with it and we cannot possibly build secure systems yeah yeah I mean you have a very good point and I uh I I was at your talk I enjoyed it I agree with you that uh they should not have ignored it they should have taken it more seriously can we get back to it
            • 28:00 - 28:30 meaning right now you have these bit flips and can we somehow really get back to a place where we can say okay you're responsible for fixing it fix it and don't uh I think it's tough uh it's tough once that thing is broken uh I think that's why I think we really need to in my opinion we haven't found out uh the partitioning of responsibilities between memory and the memory controller uh right now memory is manufactured by
            • 28:30 - 29:00 some Hardware vendors memory controls are manufactured by some other Hardware vendors and that the interface is broken in my opinion and as a result a lot of the solutions are hampered so if some of the responsibility was actually given to the memory controller inside the DRM and they could have some freedom to do things in my opinion we could get back to that so I think that interface is broken yes
            • 29:00 - 29:30 that's a great question also so uh if the numbers are less than 1,000 activations per refresh interval there are a bunch of applications actually many many applications do that uh if the numbers are higher than that uh in fact there are some applications and the Microsoft paper that I mentioned showed that uh so for example if if if you are able to induce 300 bit flips with 300 activations many applications do that
            • 29:30 - 30:00 actually today so it's a very general purpose problem let's say in that sense but uh some applications when we were doing these studies actually we actually had a lot of interactions with industry as well and there were some applications some benign applications that induced bit flips even at that time even when you actually needed to do 100,000 uh hammers to induce a bit flip so what is that sort of application it doesn't do caching meaning it does a lot of non-cashable memory accesses and it hammers on a row uh you can imagine that
            • 30:00 - 30:30 if you're trying to get a lock uh in a row and there are many threads banging on that lock you could actually be doing that sort of Hamming and one of the applications was like that thank you okay so we don't have much more time this is your thank you very much thank you they sure