“Our greatest glory is not in never falling, but in rising every time we fall”, said Confucius, the great philosopher who lived around 500 BCE.
Availability and Reliability are two important quality attributes in the evolution of any software product. But by themselves, availability and reliability are insufficient to ensure success of a system in production. It must also be resilient.
Software resilience is the capacity of a system’s software to withstand a failure in a critical component but still recover in an acceptable predefined manner and duration. Failures in software can arise from intentional activities/attacks or unintended faults. Either way, a resilient system should have the capacity to recover after such faults. Software that does not recover on its own or applications that recover after a long duration are not resilient.
Failures in today’s highly complex, interconnected and inter-dependent systems are not an exception, rather are supposed to be,
• Not predictable, and
• Not avoidable
Resilience is about embracing failures, not avoiding them.
Every software product reaches a point in its evolution when availability, reliability and resilience considerations are given precedence over functional enhancements.
The success of a Business is often measured by its Systems’ Availability in Production.
The traditional approach is to Maximize Uptime or increase Reliability.
While the resilient approach is to Minimize Downtime or increase Resilience.
To implement resilience, a system must incorporate controls that would help it recover from adverse situations. Such controls can be laid out into a pattern of capabilities which drive the evolution of the product.
A map of the capability patterns for a software product is shown below.
Here is a brief overview of the above capabilities —
The Core patterns include isolation, redundancy, communication paradigm and other supporting patterns. Microservices and Bulkhead are commonly used isolation patterns to divide the system into “functional units”, “failure units” or “migration units”. Redundancy is the basis for many recovery and mitigation patterns.
At the node level it includes fault handling patterns such as Circuit breaker and Timeout. At the system level it includes patterns for Monitoring, Health check and Heartbeat.
Recovery patterns include Retry, Rollback, Roll-forward and Reset. Rollback brings the system to a known (check-pointed) safe state. Roll-forward advances the system execution past the point of error. Reset brings the system to a guaranteed consistent state.
Mitigation patterns include Fallback and Shedding or Sharing load. Fallback helps to execute alternative action when the original action fails. Shedding or Sharing load prevents an unacceptable throughput situation of system resource(s).
Treatment enables simpler, smaller, manageable components and deployments. Deployments are carried out more often.
CI/CD, Synthetic Transactions, Accelerated Life Test (ALT), Customer Workflow Testing and Behavior Driven Development (BDD) are patterns that help to make the design sustainable. These patterns enable the detection of new failure modes.
A Note on Mapping Patterns
It is not necessary that all patterns are used for every software product. Most products, depending on their business needs, use only a subset of the above patterns.
The Resilience Build Cycle (a.k.a. RBC)
The key idea behind using the software resilience pattern is to achieve a feedback mechanism. A feedback loop helps to identify new (undetected) failure modes and thus iteratively build detection, recovery or mitigation into the architecture. The two critical capabilities required for the feedback are:
The Resilience Build Cycle, or RBC, can help identify new failure modes by collecting the results/logs from simulation runs and then using them to design new/updated/extended scenarios/probability density functions. The cycle of “prevention” and “detection” is iteratively used to build software resilience.
Where do we go from here?
The next step would be to look into the details of the RBC. Please read the next chapter: The Resilience Build Cycle.
Subsequent posts will step through each of the capabilities in detail. Whenever needed code examples will be shared.
The following have shaped (and will continue to shape) all my thoughts and work on Software Resilience.
Chapter 1: Patterns of Software Resilience
Chapter 2: Resilience Build Cycle
Chapter 3: Detection — Logging and Monitoring
…And more Chapters on the capabilities to follow.