articles

The Basics of Software Resilience and Security Chaos Engineering

Kelly Shortridge

The author, Kelly Shortridge, discusses the importance of software resilience and security chaos engineering in adapting to failures. She emphasizes that resilience involves gracefully handling changes and failures rather than just bouncing back to normal. The book provides insights on experimenting with systems to improve their resilience and security through proactive stress testing and collaboration.

49 highlights

Highlights & Annotations

Security Chaos Engineering: Sustaining Resilience in Software and Systems.

Ref. AAF7-A

Failure is never the result of one factor; there are multiple influencing factors working in concert. Acute and chronic stressors are factors, as are computer and human surprises.

Ref. 175E-B

Resilience is the ability for a system to gracefully adapt its functioning in response to changing conditions so it can continue thriving.

Ref. E1B7-C

The five ingredients of the “resilience potion” include understanding a system’s critical functionality; understanding its safety boundaries; observing interactions between its components across space and time; fostering feedback loops and a learning culture; and maintaining flexibility and openness to change.

Ref. B8EB-D

Resilience is a verb. Security, as a subset of resilience, is something a system does, not something a system has.

Ref. E3D3-E

There are many myths about resilience, four of which we covered: resilience is conflated with robustness, the ability the “bounce back” to normal after an attack; the belief that we can and should prevent failure (which is impossible); the myth that the security of each component adds up to the security of the whole system (it does not); and that creating a “security culture” fixes the “human error” problem (it never does).

Ref. A695-F

SCE embraces the idea that failure is inevitable and uses it as a learning opportunity. Rather than preventing failure, we must prioritize handling failure gracefully — which better aligns with organizational goals, too.

Ref. F210-G

If we want to protect complex systems, we can’t think in terms of components. We must infuse systems thinking in our security programs.

Ref. 65FD-H

Attackers take advantage of our incomplete mental models. They search for our “this will always be true” assumptions and hunt for loopholes and alternative explanations (much like lawyers).

Ref. 391C-I

Resilience stress testing is a cross-discipline practice of identifying the confluence of conditions in which failure is possible; financial markets, healthcare, ecology, biology, urban planning, disaster recovery, and many other disciplines recognize its value in achieving better responses to failure (versus risk-based testing). In software, we call resilience stress testing “chaos experimentation.” It involves injecting adverse conditions into a system to observe how the system responds and adapts.

Ref. 48C0-J

The E&E Approach is a repeatable, standardized means to incrementally transform toward resilience. It involves two tiers of assessment: evaluation and experimentation. The evaluation tier is a readiness assessment that solidifies the first three resilience potion ingredients: understanding critical functions, mapping system flows to those functions, and identifying failure thresholds. The experimentation tier harnesses learning and flexibility: conducting chaos experiments to expose real system behavior in response to adverse conditions, which informs changes to improve system resilience.

Ref. F1BB-K

The “fail-safe” mindset is anchored to prevention and component-based thinking. The “safe-to-fail” mindset nurtures preparation and systems-based thinking. Fail-safe tries to stop failure from ever happening (impossible) while safe-to-fail proactively learns from failure for continuous improvement. The fail-safe mindset is a driver of the status quo cybersecurity industry’s lack of systems thinking, its fragmentation, and its futile obsession with prediction.

Ref. 93CC-L

Security Chaos Engineering (SCE) helps organizations migrate away from the security theater that abounds in traditional cybersecurity programs. Security theater is performative; it focuses on outputs rather than outcomes. Security theater punishes “bad apples” and stifles the organization’s capacity to learn; it is manual, inefficient, and siloed. Instead, SCE prioritizes measurable success outcomes, nurtures curiosity and adaptability, and supports a decentralized model for security programs.

Ref. 3005-M

Our systems are always “becoming,” an active process of change. What started as a simple system we could mental-model with ease will become complex as it grows and the context around it evolves.

Ref. 9042-N

We — as individuals, teams, and organizations — only possess finite effort and must prioritize how we expend it. The Effort Investment Portfolio concept captures the need to balance our “effort capital” across activities to best achieve our objectives.

Ref. B55C-O

When we allocate our Effort Investment Portfolio during design and architecture, we must consider the local context of the entire sociotechnical system and preserve possibilities for both software and humans within it to adapt and evolve over time.

Ref. FD1A-P

There are four macro failure modes for complex systems that can inform how we allocate effort when architecting and designing systems. We especially want to avoid the Danger Zone quadrant—where tight coupling and interactive complexity combine — because this is where surprising and hard-to-control failures, like cascading failures, manifest.

Ref. A750-Q

Tight coupling is sneaky and may only be revealed during an incident; systems often inadvertently become more tightly coupled as changes are made and we excise perceived “excess.” We can use chaos experiments to expose coupling proactively and refine our design accordingly.

Ref. D673-R

When we build and deliver software, we are implementing intentions described during design, and our mental models almost certainly differ between the two phases. This is also the phase where we possess many opportunities to adapt as our organization, business model, market, or any other pertinent context changes.

Ref. C480-S

There are four key opportunities to support critical functionality when building and delivering software: defining system goals and guidelines (prioritizing with the “airlock” approach); performing thoughtful code reviews; choosing “boring” technology to implement a design; and standardizing “raw materials” in software (like memory safe languages).

Ref. 953B-T

There are four opportunities for us to observe system interactions across spacetime and make them more linear when building and delivering software and systems: adopting Configuration as Code; performing fault injection during development; crafting a thoughtful test strategy (prioritizing integration tests over unit tests to avoid “test theater”); and being especially cautious about the abstractions we create.

Ref. A789-U

To foster feedback loops and learning during this phase, we can implement test automation; treat documentation as an imperative (not a nice-to-have), capturing both why and when; implement distributed tracing and logging; and refine how humans interact with our processes during this phase (keeping realistic behavioral constraints in mind).

Ref. 7679-V

Attackers can directly measure success and immediately receive feedback, giving them an asymmetric advantage. We must strive to replicate this for our goals too.

Ref. 25D8-W

Success is an active process, not a one-time achievement. We must support graceful extensibility: the capability to anticipate bottlenecks and “crunches,” learn about evolving conditions, and adapt responses to stressors and surprises as they change.

Ref. BBC5-X

We want to mimic the interactive, overlapping, and decentralized sensitivities of biological systems in our observability strategy. In particular, we want to observe system interactions across space and time. We must maintain the ability to reflect on three key questions: How well is the system adapted to its environment? What is the system adapted to? What is changing in the system’s environment?

Ref. 3849-Y

Tracking when a system is repeatedly stretching toward its limit (“thresholding”) helps us uncover the system’s boundaries of safe operation. Increasingly “laggy” recovery from disturbances in both the socio and technical parts of the system can indicate erosion of adaptive capacity.

Ref. 129F-Z

Attack observability refers to collecting information about the interaction between attackers and systems. It involves tracing attacker behavior to reveal how it looks in reality versus our mental models. Deception environments can facilitate attacker tracing, fuel a feedback loop for resilient design, and serve as an experimentation platform.

Ref. C35C-A

Incidents are like a pop quiz. To prepare for them and ensure we can respond with grace, we must practice incident response activities—and can do so through chaos experimentation.

Ref. 2E75-B

Humans often feel an impulse toward action (action bias), which can reduce effectiveness during incident response. Practicing “watchful waiting” can curtail knee-jerk reactions.

Ref. EB24-C

There is no “best practice” for all incidents. The best we can do is practice incident response activities to nurture human responders’ adaptive capabilities.

Ref. 4506-D

Repeated practice of response activities through chaos experimentation can turn incidents from stressful, scary situations into confidence-building, problem-solving scenarios.

Ref. 3A04-E

A blameless culture helps organizations stay in a learning mindset — uncovering problems early and gaining clarity around incidents — rather than play the “blame game.” It encourages people to speak up about issues without fear of being punished for doing so.

Ref. F716-F

Humans at the “sharp end,” who interact directly with the system, are often blamed for incidents by humans at the “blunt end,” who influence the system but interact indirectly (like administrators, policy prescribers, or system designers). The disconnect between the two can be summarized as the delta between “work-as-practiced” and “work-as-imagined.”

Ref. 1A4B-G

The cybersecurity industry often (unproductively) blames users for causing failures, as evidenced by the acronym PEBKAC: problem exists between keyboard and chair. A more useful heuristic is PEBRAMM: problem exists between reality and mental model. An error represents a starting point for investigation; it is a symptom that indicates we should reevaluate design, policy, incentives, constraints, or other system properties.

Ref. 457A-H

There are numerous biases that tempt us to blame human error during incidents, which hinders our capacity to constructively learn from and adapt to failure. With hindsight bias, we allow our present knowledge to taint our perception of past events (the “I knew it all along” effect). With outcome bias, we judge the quality of a decision based on its eventual outcomes. The just-world hypothesis refers to our preference for believing the world is an orderly, just, and consequential place. All of these biases warp our perception of reality.

Ref. 4DF0-I

A platform engineering approach to resilience treats security as a product with end users, as something created through a process that provides benefits to a market (with internal teams as our customers). Platform Engineering teams identify real problems, iterate on a solution, and prioritize usability to promote adoption. Resilience is a natural fit for their purview.

Ref. F1DA-J

Treating resilience — including security — as a product starts with identifying the right user problems to tackle. To accurately define user problems, we must understand their local context. We must understand how our users make tradeoffs under pressure, maintain curiosity about the workarounds they create, and respect the limitations of their brains’ computational capacity (“cognitive load”).

Ref. C477-K

Security solutions become less reliable as their dependence on human behavior increases. The Ice Cream Cone Hierarchy of Safety Solutions helps us prioritize how we design security solutions, from best to least effective. Starting from the top of the cone, we can eliminate hazards by design; substitute less hazardous methods or materials; incorporate safety devices and guards; provide warning and awareness systems; and, last and least effective, apply administrative controls (like guidelines and training).

Ref. FFC2-L

There are two possible paths we can pursue when solving user problems: the control strategy or the resilience strategy. The control strategy designs security programs based on what security humans think other humans should do; it is convenient for the Security team at the expense of others’ convenience. The resilience strategy promotes and designs security based on how humans actually behave; success is when our solutions align with the reality of work-as-done. The control strategy makes users responsible for security while the resilience approach makes those designing security programs and solutions responsible for it.

Ref. A398-M

We should gain consensus about our plans for solving resilience problems — from vision through to implementation of a specific solution — and ensure stakeholders understand the why behind our solutions. Success is solving a real problem in a way that delivers consistent value.

Ref. CD7F-N

To facilitate solution adoption, we must plan for migration and pave the road for our customers to adopt what we’ve created for them (hence the strategy of creating “paved roads”). We should never force solutions on other humans; if that is the only way to drive adoption, then it is a failure of our design, strategy, and communication.

Ref. 2FF6-O

Measuring product success is necessary for our feedback loops, but can be tricky. If we design solutions for use by engineering teams, the SPACE framework offers numerous success criteria we can measure. In general, we should be curious about the factors contributing to success and failure for our internal customers.

Ref. 3B5E-P

Early adopters of security chaos experimentation learned three key lessons: first, it’s fine to start in nonproduction environments because you can still learn a lot; second, use past incidents as inspiration for experiments and to leverage organizational memory; third, make sure to publish and evangelize your experimental findings because expanding adoption will become your hardest challenge (the technical work is comparatively easy).

Q: What are three key lessons learned by early adopters of security chaos experimentation?

Ref. A414-Q

To set chaos experiments up for success, especially the first time, we need to socialize the experiment with relevant stakeholders. Investing in the right messaging and framing at the beginning will reduce friction later.

Ref. 087B-R

The next step is designing an experimental hypothesis. Hypotheses typically take the form of: “In the event of the following X condition, we are confident that our system will respond with Y.”

Ref. 529A-S

Documenting a precise, exact experiment design specification (“spec”) is critical. Our goal with the spec is for our organization to gain a luculent understanding of why we’re conducting this experiment, when and where we’re conducting it, what it will involve, and how it will unfold.

Ref. 2636-T

Launching an experiment is not unlike a feature release. Our preparation in socializing the experiment, designing the hypothesis, and defining the experiment specifications makes this one of the easier phases.

Ref. 9B68-U

What evidence we collect when conducting an experiment is defined by the spec; we should already know what we’re monitoring and what evidence we expect.

Ref. 2E97-V

The resilience / SCE transformation is not exclusive to a certain type of organization. Fortune 10, highly-regulated organizations can pursue it. So can smaller, scrappy startups. And this speaks to something I really tried to emphasize in the book: the resilience approach isn’t about revolutionizing X, Y, Z overnight and doing them perfectly; it’s about iteratively changing how you do things towards more resilient outcomes. It’s about experimentation at basically all levels, whether experimenting with new modes of collaboration with software engineering teams, conducting resilience stress tests, or trying any of the zillion strategies I described in the book. For some organizations, it’ll be easier to iteratively migrate to memory safe languages; for others, it’ll be easier to migrate to running workloads on immutable infrastructure or implementing integration testing or so many other things that can make a difference outcome-wise.

Ref. F011-W