While the title may seem somewhat of an oxymoron, it is actually true. You can reduce downtime with Chaos Engineering, a concept initially introduced by Netflix in 2011 to support their cloud adoption and now followed all major product companies like Google, Amazon and Linkedin to ensure the reduce the downtimes of their systems and provide a seamless customer experience.
What is Chaos Engineering?
Canonically, Chaos engineering is the discipline of experimenting on a software systems in order to build confidence in the system's capability to withstand turbulent and unexpected conditions by deliberately introduction a fault in the system and observing the impact.
In a nutshell, Chaos Engineering is the science of deliberately creating an outage in a controlled environment to ensure systems and applications are resilient to failures.
Some of the biggest advantages that Chaos Engineering brings (Coupled with other prominent techniques like DevOps and SRE)
Identification of issues and bugs in the system (Bugs which somehow went past the testing stage). Identifying single point (s) of failure for mission critical systems. Ensuring system resiliency through effective identification and mitigation of points of failure. Reducing MTTD (Mean time to Detection) by helping teams train in controlled environments by simulating failures and response plans. Helping organizations recognize key KPIs and monitoring requirements by identifying key components in the system’s overall architecture. Helping organizations recognize scalability requirement and auto-scaling parameters. Implementing Principles of Chaos Engineering in real time (Chaos Teams and GameDays) Overall Process – Chaos Team –
A chaos team is a 4-5 member team who are responsible for running Chaos experiments within the organization, it is a good idea to dived a your entire Ops team into multiple Chaos teams and let each team run a different set of chaos experiments. A typical Chaos team consists of
1- The Chaos Leader – The person responsible for running the Chaos drills end to end. 2- The Chaos Engineers (2-3) – The person responsible for analyzing available data, forming hypothesis and running experiments. 3- The Scriber (1-2) – Person who takes notes on expected v/s actual behavior during the experiment, running the post experiment analysis and providing recommendations for changes. Risk Metrics
The chaos risk metrics is used to define various components of the system, and there possible impact on the business in case of any outages. The following risk metric can be used for doing the same
Risk Category and Meaning
Component Risk Category Meaning Severity Category 1 Complete Outage of Service Sev 0 Category 2 on application performance but not a full outage Sev 1 Category 3 Outage of a few modules but other functionalities working fine Sev 2 Category 4 No outage on modules but overall application performance impacted or auxiliary functions (like monitoring, back-ups) impacted Sev 3 Component Risk Mapping – Decision flow Coforge’s Point of View
With infrastructure become more agile and distributed the biggest challenge any operations team today faces is to ensure a large distributed global application is up and running and not depended on a single component, datacenter or region. This is where Chaos engineering becomes an integral part of the CloudOps and SRE strategy of the Organization. Our CloudOps Strategy is formed using Chaos Engineering principles along with other prominent operation principle like SRE and DevOps; and tools like AIOps and UniITOps to provide our customers the lowest possible MTTR (Mean time to Resolution) and Downtimes for customer’s applications. Coforge’s CloudOps team uses Chaos engineering to derive various enhancements and act as an advisor to the customer to make their infrastructure more resilient to failures and outages.