Best of Velocity: What, Where and When Is Risk in System Design?

Estimated read time: 1:20

    Summary

    In "Best of Velocity: What, Where and When Is Risk in System Design?", the speaker delves into the nuanced world of risk management within web operations, drawing from his expertise in risk and safety science. He highlights two predominant views on risk: as a product of unreliable system components, which stresses reductionist approaches and reliability maximization, and as a product of non-linear interactions within complex systems, emphasizing the dynamic, interdependent nature of such environments. The talk further explores risk as a path-dependent process and a control challenge, urging organizations to keep the discussion of risk alive by embracing minority opinions and understanding trade-offs. Ultimately, it encourages making underlying values explicit in risk assessments.

      Highlights

      • Unreliable human components are often blamed for system failures, highlighting risk as a function of individual component reliability. 🤷‍♂️
      • The use of redundant barriers and reducing variability are strategies to manage risk in component-centric systems. 🛡️
      • In complex systems, adding barriers can increase interactions and risks instead of reducing them. 🔄
      • Risk management should consider the historical context, or path dependency, of risks. 📜
      • Effective risk management involves keeping the discussion of risk alive and inviting minority opinions. 💬
      • Understanding risk as a control problem, involving constant optimization and trade-off decisions, is crucial. 🔀
      • Risk management in complex systems requires seeing safety and risk as products of the same processes. ⚖️
      • Organizations should make the values underlying their risk assessments explicit for effective discussions. 🎙️

      Key Takeaways

      • Risk can be viewed from two perspectives: as a product of unreliable system components or as a product of complex system interactions. 🤔
      • Managing risk in web operations requires understanding both historical paths and current control challenges. ⏳
      • Organizations should foster open discussions about risk even when everything seems okay, highlighting the importance of diverse opinions. 👥
      • Understanding the trade-offs people make at work is crucial for effective risk and safety management. 🎯
      • Make your values explicit when assessing risk in web operations to ensure a constructive risk management approach. 🌟

      Overview

      The session starts with an intriguing exploration of risk in system design, challenging attendees to rethink their understanding of risk in web operations. The speaker, coming from a background in risk and safety science, shares insights into two primary interpretations of risk: one focusing on the reliability of system components and the other on the complexities of non-linear systems.

        In discussing risk as a product of unreliable system components, the emphasis is placed on identifying high-risk elements and employing strategies like redundancies and enhanced reliability rules. However, the speaker warns that in non-linear system models, an increase in barriers can paradoxically heighten risk through more interactions, urging a shift in metaphor from mechanical to biological system approaches.

          The talk culminates in advocating for inclusive, transparent risk conversations within organizations. Emphasizing the importance of historical context and control dynamics in risk assessment, the speaker advises that all risk interpretations inherently carry values, advocating that these should be made explicit to facilitate constructive and meaningful risk management discussions.

            Chapters

            • 00:00 - 01:30: Introduction to Risk in Systems Design The speaker, coming from a background in risk and safety science, addresses an audience unfamiliar with their expertise in web performance and operations. They acknowledge this gap and express hope that their examples, though possibly premature, will effectively convey their points. The focus of the talk revolves around the themes of 'what, where, and when,' aimed at introducing risk in systems design.
            • 01:30 - 03:30: Two Extreme Views on Risk The chapter discusses two extreme perspectives on interpreting risk in the context of web performance and operations.
            • 03:30 - 07:00: Risk as a Product of Unreliable Components The chapter discusses risk as a consequence of unreliable components within system design. The author explores how system components can lead to different outcomes based on their reliability and the impact this has on risk management.
            • 07:00 - 11:00: Risk as a Product of Non-Linear Interactions The chapter discusses the concept of risk in the context of web operations, highlighting it as a product of unreliable system components. It explores how organizations can manage risk by adopting an organizational model that uses a machine metaphor principle, suggesting that web operations should mimic mechanical systems.
            • 11:00 - 18:00: Path Dependency and Risk as a Control Problem The chapter discusses the concept of understanding an organization by likening it to a machine, emphasizing efficiency and adherence to specific principles. It suggests that to comprehend the functionality or dysfunctionality of an organization, one must analyze it by 'going down and in,' implying a deep, introspective look into its operations and underlying principles.
            • 18:00 - 22:00: Managing Risk in Complex Systems The chapter titled 'Managing Risk in Complex Systems' focuses on understanding the reliability and efficiency of organizational operations. It emphasizes the importance of examining individual components and their linear interactions, in order to determine whether they adhere to design rules. Achieving reliable operations is highlighted as the primary goal within the realm of web operations.
            • 22:00 - 26:00: Conclusion and Final Thoughts In this concluding chapter, the focus is on the reliability of web operations when all components adhere to established rules. It emphasizes that reliability emerges from properly designed rules that govern the operations. The chapter concludes with the reductionist principle, suggesting that the overall functioning of web operations can be broken down into the functioning of its individual parts, leading to inherent reliability.

            Best of Velocity: What, Where and When Is Risk in System Design? Transcription

            • 00:00 - 00:30 it is an honor to be here to give this first keynote of this morning to you as john hinted at i come from the fields of risk and safety science i don't know much about web performance and web operations so please forgive me if my examples might seem a bit premature to you but i i hope that i can use them to make my points across to you this morning so my topic here is what where and when
            • 00:30 - 01:00 is risk in systems design in web operations and i have also published a connected blog post at o'reilly for you to check out if you like and i think that what time will allow me to do here this morning is to outline two pretty extreme views on how to interpret risk in web performance and web operations and the first one the first view of risk is risk as a product of unreliable
            • 01:00 - 01:30 system components and we will see where this will take us and then contrast it with risk as a product of non-linear relations and interactions in saying risk as a product of complexity basically and i've drawn some kind of a pretty ugly looking mind map for us uh to use this morning and taking us in both directions we will see some principles of these two views and we will in both cases end up with some strategies for how to manage risk given the respective perspective
            • 01:30 - 02:00 so let's first look at risk as a product of unreliable system components and see how we could play with that for the field of web operations inherent in this idea of risk is an organizational model of organizing web operations based on a machine metaphor principle basically saying that what you are to do in web operations is to mimic
            • 02:00 - 02:30 as closely as possible the workings of a machine and the machine has some specific principles for it to adhere to in order to organize efficiently and the main principle is that if you want to understand why your organization your machine is working or why it is not working what you do analytically is that you go down and in you go down and
            • 02:30 - 03:00 in to have a look at the individual components how they interact linearly and how well they follow their design rules those are some of the main principles or main targets of what to look at when understanding why the organization is working or why it is not and when it is working we say that it works reliably that's the ultimate aim of the machine of web operations if you
            • 03:00 - 03:30 like and web operations is working reliably when all the individual components stick to the rules which is pretty much a matter of designing the proper rules for it to stick to reliability then becomes something that the machine of web operations has or is it is reliable per design it is reliable and the reductionist principle follows that the functioning of the whole can be reduced to the functioning of the
            • 03:30 - 04:00 constituent components essentially that if the machine of web operations is not working there needs to be an individual component which is not working so which one is the most unreliable component of web operations which one is it i've googled my way around looked at some postmortem accounts and we can look at a few here we say amazon cloud outage
            • 04:00 - 04:30 triggered by human error right that gives a hint another one twitter outage caused by human error all right and final one oh surprise surprise amazon blames human error for christmas eve outage so from these by me highly cherry-picked examples it seems that the most unreliable component of web operations is often constructed to be you guys
            • 04:30 - 05:00 the unreliable individual human actor of the system the programmer or or whatever it might be in your case so this idea of risk as a product of unreliable system components does not only allow us to identify the most unreliable one but it also allows neatly for us to calculate risk calculate risk as a function of the
            • 05:00 - 05:30 severity of the failures that you cause by your unreliability and the unreliability so the probability and severity of the event that you cause that's that's the function we use to calculate risk and we can then illustrate risk as is done here with a risk matrix for instance where you have the two dimensions of probability and severity and then you can plot all the individual
            • 05:30 - 06:00 failure scenarios that you have identified in this matrix all you do as is done in this simple example that i found on a blog you calculate the risk of foursquare going down during a year to to one specific number multiplying probability and severity in this case the probability being three and a half percent during a year and the cost a million dollars and then it's it's an estimated risk of 35 000 that's simple as that so this
            • 06:00 - 06:30 idea of risk as a product of unreliable system components neatly allows us to to construct risk as as a function of severity and probability of failure and it also has some some underlying ideas for how to manage risk and i will show two to you here this this morning the use of redundant barriers and the idea that we can then reduce
            • 06:30 - 07:00 your unreliability or the variability of the system and let's first look at the use of redundant barriers a strategy very physically graspable to us working in the traditional safety sciences where we have worked with industries such as process control and transportation healthcare where we have clearly physical energies harmful energies that we need to keep contained from vulnerable targets like us humans
            • 07:00 - 07:30 and we do so by building multiple layers of defense we call it defenses in depth multiple barriers jim reason explained in the 1990s that also these barriers have a certain degree of unreliability to them and he illustrated that by by making holes in them and that's how the swiss cheese model of accident causation came to be that at one point these holes might line up putting the harmful
            • 07:30 - 08:00 energy in direct contact with the vulnerable target and that's the concept of of having an accident and this way of reasoning we also use when we make quantitative risk analysis you see here an example of an event tree for a fire occurring and you can see all the barriers illustrated by nodes in this event tree and you can see the amount of holes in them illustrated by the probability of them maintaining their function in the event of a fire occurring
            • 08:00 - 08:30 so very neat this idea of risk as a product of of unreliable system components in order to manage risk through the use of redundant barriers the other principle is that if we have constructed risk as um as your unreliability then what we do to manage risk is to reduce that unreliability or to reduce the variability of the system if you like we can do so by simply replacing you
            • 08:30 - 09:00 unreliable humans with some more reliable technology that we see here in in this blog post entitled outage prevention taking humans eye out of the i.t equation don't know how that would look in practice but you might have some ideas we could also control you or replace you by more reliable rules or simply ask you to try harder to be more reliable to appeal to your motivation for instance that's also a quite popular risk
            • 09:00 - 09:30 management strategy to use when constructing risk as a product of unreliable system components so those two strategies the use of redundant barriers and reducing variability are the most typical ones used when constructing risk as a product of unreliable system components so let's look at the second idea then here the idea that we can construct risk not as your unreliability but instead as a
            • 09:30 - 10:00 product of non-linear interactions and relation and relations essentially risk as a product of system complexity to do so we need a complete different set of metaphors the machine method for doesn't really work here and the metaphor that we typically use is the metaphor of a living system in order to construct risk as a product of non-linear relations and interactions and the living system works from a complete different set
            • 10:00 - 10:30 of principles than thus the machine here we do not work with linearly connected components here we work with a diverse set of actors who shifts between being loosely and tightly coupled who shifts between being highly interdependent and highly independent constantly and dynamically so for you you could probably come up with many examples of when this is a case for instance services like
            • 10:30 - 11:00 netflix foursquare quora and reddit are to a certain and varying degree dependent on the workings of of amazon for instance sometimes more sometimes less probably very much so on a christmas eve for instance so the functioning on the whole can no longer be reduced to the functioning of the constituent components we need to look at the interactions and relations between actors rather than the reliability
            • 11:00 - 11:30 of each actor this system does no longer allow for a complete description of it to be made partly because it is so constantly dynamically adapting to its environment it is constantly changing but also because no actor in a complex system can grasp the complexity of the entire system that's theoretically impossible for it to happen and also the system doesn't allow for a complete description because depending on which
            • 11:30 - 12:00 actor you ask to describe the system you will get qualitatively different descriptions back so it's a matter of whose perspective do you take and we need to understand those relations and interactions now that's where we focus and indeed it is also so that the use of more barriers might be a problem here because more barriers actually might increase the number of interactions in this system and if we see risk as
            • 12:00 - 12:30 a product of interactions then more barriers might actually increase risk so the use of the use of barriers is no longer non-problematic to use as a risk management strategy these are some of the principles of a complex system of web operations maybe as a complex system and i will spend the rest of this talk outlining a few more ideas connected to this idea of web operations as a complex
            • 12:30 - 13:00 system when descr discussing risk and i will specifically focus on on two topics the first one being the path dependency of risk risk as a path dependent process if you like and the second risk as a control problem and we will see where that might take us let's first look at risk as a path dependent process essentially the idea that history matters in order to understand risk we need to understand the history of
            • 13:00 - 13:30 that risk that we are trying to understand and one example that i found from from your world would be this one a bbc coverage written by the technology reporter leo kielian entitled why banks are likely to face more software glitches in 2013 and in this coverage leo interviews levelly soaking a strategy chief of cost who really appeals to history in his accounts of the current software risks that the banking
            • 13:30 - 14:00 industry run look here he says software is inherently difficult and for developers who are dealing with systems which have been added to cropped and changed over the years it is a struggle to see where faults in a system are most likely to lie definitely appealing to history in order to understand current risks and also he embraces the idea of no single actor being able to grasp the complexity of the whole saying that no single person or even group of people
            • 14:00 - 14:30 can ever fully understand the structure under the key business transactions in an enterprise i think it's nicely formulated another example of the path dependency of risk not taken from your world but rather from the words that are typically thought of when we work with safety science and this is from the transportation industry and this is the example of a letter written by the major of a little italian island called giglio in august 2011 and this letter was written to a cruise
            • 14:30 - 15:00 shipping captain at a shipping company called costa and in this letter the major of the little italian island thanks the cruise shipping captain for an unequaled spectacle that has become an indispensable tradition indispensable tradition really appealing to to history here so what is this indispensable tradition you might get it already well it's the it's the unequal spectacle of making close flybys
            • 15:00 - 15:30 saluting certain islands and cities and we know the scene some five months later outside this very island of giglio when costa concordia grounded right there with another captain on board i should say than the one who received the letter with a thank you for for the unequaled spectacle that has become an indispensable tradition another captain but an event where we mainly construct him now as
            • 15:30 - 16:00 a crazy sex addict idiot completely lacking any sort of seamanship the unequal spectacle that has become an indispensable tradition tells a different story tells a different story about the path dependency of risk and we might have different notions to work with here in in your world you use the term technical debt for instance and i think that might be highly appropriate to discuss the historical trajectory of certain risks
            • 16:00 - 16:30 in in safety science we use terms like a normalization normalization of deviance based on diane vaughn's writings normalization risk as a normalization as a process of changing what is normal and what is norm in an organization an indispensable tradition normalization of deviance we also use terms like practical drift or drift into failure that you've might have heard so that's about
            • 16:30 - 17:00 the path dependency of risk the importance of considering history when discussing risk in web operations maybe and we will also look at risk as a control problem which i think could be interesting to you in in web operations where what you do is that you try to constantly experiment to optimize locally in a highly goal-constrained environment which is what this model by jens rasmussen tries to illustrate
            • 17:00 - 17:30 a goal-constrained environment where you are not to cross you see on the top the boundary of financially acceptable behavior because that's when you get bankrupt where you are not to cross the boundary of unacceptable workload because you're burned out and where you are not to cross the boundary of functionally acceptable behavior or acceptable risk because that's when you have an outage that's when you have an accident and there are a couple of important things to note here the first
            • 17:30 - 18:00 important thing is that the only way that you can get definitive feedback on where any of these boundaries is is by crossing it that's the only way you can know where the boundary of acceptable risk is is by having an accident essentially but also this model says that each of these boundaries creates a pressure away from the boundaries so the boundary of financially acceptable behavior creates a pressure towards efficiency pushing away from the boundary the boundary of
            • 18:00 - 18:30 unacceptable workload creates a pressure towards least effort and together these two forms a gradient towards the boundary of unacceptable risk so how do you push back how do you push back what's the pressure that you can apply to push back from the boundary of unacceptable risk and also how do you know that you get closer how do you get feedback that you get closer also connected to the idea of risk as a control problem in a goal constrained
            • 18:30 - 19:00 environment is the question of whether you want to do minor changes of the code very often or major changes more rarely how do you want to play that how do you want to experiment to optimize locally how do you want to do that this idea this model also embraces the idea that risk measures can be risky risk measures can be risky that's what it is to work in a goal conflicted environment and we can see one example
            • 19:00 - 19:30 from from your world i showed you this topic before twitter outage caused by human error and it also says domain briefly yanked and this was the t dot co domain that you guys had twitter use for link shortening and the chief research officer at f secure afterwards tweeted that this event illustrates how short links make the web more fragile and harder to archive so was this only for me to be able to write longer
            • 19:30 - 20:00 urls in short tweets look at look at what twitter says about why introducing this t.co domain which later was constructed as a fragile t in the system a risk in the system look at the third point of motivation here having a link shortener protects users from malicious sites that engage in spreading malware phishing attacks and other harmful activity essentially it's a risk management measure it's a safety measure
            • 20:00 - 20:30 the t.co domain a safety measure which later became a risk or a fragility in the system so from this control problem theory of risk we would say that risk and safety are products of the same kind of processes in gold-constrained environments like the one that you are working within risk and safety are products of the same kinds of processes so not only is risk a product of variability which was a threat
            • 20:30 - 21:00 in the idea of risk as a product of unreliable system components but safety is also a product of reliability risk and safety products of the same kind of processes so based on these ideas based on on the idea of risk as managing complexity where we use ideas of the path dependency of risk and risk as a control problem where do we go to manage risk what do the great
            • 21:00 - 21:30 thinkers here say and they have some ideas what they typically say is that organizations that are really good at this tend to keep the discussion about risk alive even when everything looks safe and they do so by constantly inviting minority opinion inviting doubt inviting i don't feel good about this and taking that seriously they do so by constantly debating the location of the boundaries and the distance to them
            • 21:30 - 22:00 boundary of one of acceptable risk for instance and remember we said that in a complex system each actor will essentially experience different systems so the debate here is a highly fruitful risk management exercise organizations that are typically good at this also constantly monitor the gap between work as prescribed and work as performed realizing that there will always be such a gap but also questioning whether the gap
            • 22:00 - 22:30 shows tendencies of normalizations of risk for instance and given that risk and safety are products of the same kind of processes organizations that are good at this they focus on understanding how people make the trade-offs that guarantees safety and not seeing people as as an inherent risk in the system but as a guarantee of safety in the system this is what eric hold nigel really emphasizes one of the great thinkers in this field
            • 22:30 - 23:00 really emphasizes when he says safety management it's not about avoiding negatives such as incidents accidents and errors safety management is indeed about achieving it is about the constant trade-offs that you make so which one is it should we see risk as a product of your inherent unreliability or should we see risk as a product of
            • 23:00 - 23:30 of the complexity itself of the systems that you work in well as the annoying little academic that i am i will of course argue that we ask the wrong question from the beginning and i will i will use paul slovik to help me out in in making that argument saying that what we should see risk ass is as a game we should see risk as a game played between actors representing different values and different frames of reference so how
            • 23:30 - 24:00 do you want to play that game these two extreme views that we've drawn out now um during during this keynote would represent different frames of reference that you could that you could compete for so to speak in in the risky game but also frames of reference that have their implicit values baked into them any account of risk according to paul slovik has inherent
            • 24:00 - 24:30 values implicit or explicit whether we construct google going down as the end of the world we express certain values whether whether we express google glass as a great security threat or a fantastic opportunity and possibility we express certain values whether we see as the un does internet access as a basic human right we express certain values whether we construct snowden as a hero or a traitor
            • 24:30 - 25:00 we do the same we express values so my final appeal to you this morning will be make your values explicit what are the values in any of your accounts of risk what are the values baked in to these accounts what are the values guiding your perception of risk in web performance and web operations make them explicit and you will have a constructive risk
            • 25:00 - 25:30 game played amongst you i am sure that will be my my final appeal to you today i will thank you very much for having listened this morning these slides are already available at jbsafety.se you can click your way back and forth through them and when you come to a point where you say this dude is not making any sense whatsoever you continue the discussion with me on twitter or by writing me an email thank you so much for
            • 25:30 - 26:00 having listened really appreciate to be here today thank you you