An application's resiliency is its ability to withstand and quickly recover from failures, disruptions, or unexpected events within a defined timeframe with utmost quality using a predictable, defined process without compromising the reliability of the transactions.
These failures can include those related to infrastructure, dependent services’ failure, misconfigurations, transient network issues, load spikes, etc. Resiliency is a critical component of an organization's business resiliency strategy and is essential for meeting digital sovereignty requirements.
Making a system resilient involves:
- Partnering with the customer with thought leadership
- Involving experts from diverse areas of application, observability, and SRE
- Utilizing deep technology understanding, accelerators, trained teams with a process-oriented approach
Summary
What: Resilient applications are those exhibiting:
- High availability is the percentage of time the application is available for use. Considering its criticality, the application's availability should be greater than the target set for it.
- Disaster recovery or a continuity of operations plan is in place. This includes failover, backup, restore, and recovery mechanisms for infrastructure, applications, storage, and third-party systems.
- Reliable operations whenever the application is available
Criteria: An application is resilient if it exhibits the following characteristics:
- Has performed according to agreed-upon metrics for SLI, SLO, and MTTx, in alignment with SLAs covering both internal and external systems.
- It has defined processes and stakeholders, a checklist, and a recovery guide/runbook. Periodic automated mock drills will augment this to ascertain their effectiveness.
- An observability dashboard providing a 360-degree view to monitor the system, including functional, infrastructure, and service (internal and third parties), and a real-time mechanism to measure the agreed metrics.
- The following represents a monitoring dashboard for an imaginary shopping cart capturing key parameters such as availability, performance, infrastructure heatmap, resource utilization, and key metrics.
- Alerting mechanism wherever there is a breach of agreed metrics or thresholds, without creating unnecessary noise.
- Runbook and troubleshooting guide for repeated issues.
- Agreed cost of maintaining a resilient system with the sponsors.
Resilient Journey
To make an application resilient
- Assess the gaps in the current implementation
- Recommendation and acknowledgment
- Prioritize, plan, and implement (take cost into consideration)
- Continuous improvement
Assessment
Determining if a system is resilient or not includes
- Assess SRE implementation against
- People – Autonomy and Accountability, Collaboration and Culture,
- Process – Incident Management, Change Management, Capacity Planning,
- Tools – Chaos Engineering, Automation, Toil Elimination
- Assess if the current observability dashboard has real-time monitoring for
- functional, technology, infrastructure, and external system components
- Key metrics covering MTTx, SLA, SLO, and SLI
- Automated alerting mechanism setup to log issues whenever there is a threshold breach or a component is not available
- Analyze the Runbook and Troubleshooting guide to validate that it covers all the repeating critical issues and their resolution.
- For external integrations, there should be a single point of contact defined (email, phone, group, or individual) to whom the support team can reach out to troubleshoot
- AWS Resilience Hub: AWS Resilience Hub service that helps assess, manage, and improve the resilience of applications running on AWS. It allows users to define resilience goals, evaluate applications, and get recommendations for improvement
- AWS Well-Architected Framework: Check if the system is aligned with the five pillars of the AWS Well-Architected Framework to understand the pros and cons of decisions made while building workloads on AWS. It includes best practices for building resilient applications
- Analyze application architecture against decoupling, fault tolerance, circuit breaker, timeout, retries, and decentralized communication
- AWS Trusted Advisor: Evaluate AWS environment using best practices checks and recommendations available from the Trusted Advisor
- Build a monitoring heatmap to assess if all the services are instrumented, health checks are performed, logs are routed in the observability tool stack, and alerting is in place. This includes both internal and third-party systems
- Identify a single point of failure
- Analyze the end of life for services and tools
- Analyze reported issues, say the last 60 or 90 days, to identify the critical repeated issues and noise
- Analyze observability tool logs, say last 60 or 90 days, for unreported issues to identify the critical issues and noise
- Validate that the toolset is in place for automated continuous deployment, testing (functional, regression, and performance), regulatory compliance, DORA metrics, and security vulnerability assessment and recommendations
- Disaster Recovery Sites are defined and are operational, along with periodic mock drills
- Backup, recovery, and restore mechanisms are defined and are operational, along with periodic mock drills
- Analyze the scalability configuration of AWS services, resource utilization, including storage
- Analyze if any of the AWS services are already breaching the performance parameters or running close to thresholds
- AI-driven cloud resilience: Utilize artificial intelligence capabilities (Machine Learning, Deep Learning and Natural Language Processing) to proactively identify and mitigate potential disruptions within a cloud infrastructure, enhancing its overall ability to withstand failures, improved uptime, cost optimization, enhanced security and quickly recover from unexpected events by leveraging predictive analytics and automated responses based on historical data patterns. Key aspects of AI-driven cloud resilience:
- Predictive Analytics: AI algorithms analyze vast amounts of cloud performance data to predict potential issues like resource bottlenecks, network congestion, or application crashes before they occur, allowing for preventative measures
- Anomaly Detection: AI can identify unusual patterns in system behavior that might indicate a developing problem, enable early intervention, and prevent escalation.
- Real-time Monitoring: AI can continuously monitor cloud environments in real-time, providing immediate insights into system health and potential threats
- Automated Response: By integrating AI with cloud management platforms, systems can automatically trigger corrective actions like auto scaling resources, rerouting traffic, or initiating failover procedures when anomalies are detected
Recommendation
Build an action report per the above findings to meet the defined project goals. This includes prioritization, quantitative and qualitative benefits, and the cost involved. The following represents how to capture the recommendations.
Reference Ticket Number |
Area of Improvement |
Issue Detail |
Recommendation |
Criticality |
Cost of Implementation |
Owner |
ServiceNow #1 |
Disaster Recovery |
..... |
..... |
..... |
..... |
..... |
ServiceNow #2 |
Monitoring |
..... |
..... |
..... |
..... |
..... |
ServiceNow #3 |
Alerting |
..... |
..... |
..... |
..... |
..... |
ServiceNow #4 |
End of Life support |
..... |
..... |
..... |
..... |
..... |
Details |
..... |
..... |
..... |
..... |
..... |
..... |
Prioritize, Plan, and Implement
Agree priorities with the business, considering the cost and ROI, and plan and implement the recommendations.
Continuous improvement
Repeat the above steps periodically and keep the troubleshooting guide up to date. Reach the initial goal as per the agreed metrics and Excel to reach the 6 9’s goal of availability.
Conclusion
Making an application resilient is a continuous transformative journey that includes people, processes, and technology. It paves the way for a future where software maintenance is faster and more reliable by addressing key challenges in speed, quality, cost-efficiency, and customer experience with automation at its center. This shift promises to elevate the software industry, benefiting organizations and professionals.