A Deep-Dive into Downtime. Why Does it Happen?
Successfully handling sales peaks while avoiding downtime should be the goal of any business. We’ll be covering every aspect of downtime in a series of posts, including details of how to build resilience into your cloud architecture – ensuring you minimize your business’ exposure to any outages.
While the dictionary would define downtime as a machine or service that’s unavailable, that’s not the whole story. Any significant impact to an end-user is downtime because it results in a missed business opportunity.
For a business, a 404/503 error is much more than just ‘not being available.’ Every minute of downtime adds up, with respect to actual costs and loss of potential revenue. Research by Gartner and Avaya pegs losses somewhere between $5600 and $9000 per minute of downtime for larger enterprises, while for smaller corporations, it sits between $127 and $427 per minute.
Yes, every minute of downtime. And those numbers are pre-COVID! We haven’t even accounted for other losses yet.
You can calculate a rough estimate of your losses from downtime using a simple formula:
Even websites that are slow to respond can contribute to downtime. A survey conducted by Retail System Research (RSR) from April to May seems to agree. They found that ‘90% shoppers will abandon a site if it doesn’t load in a reasonable amount of time.’
Factor in the hit to your reputation and lasting impact to search engine rankings, and even a short outage has big consequences.
An outage of six hours is enough to drop search rankings by 30 percent, and the damage may last up to 60 days.” – Shopping Cart Elite
COVID only ups the stakes. The need for digitization as well as minimizing downtime has become critical. Data collected by McKinsey & Company shows that 75% of the consumers who tried a new shopping behavior intend to continue it beyond the pandemic.
Causes of Downtime
Before we cover how to prevent and mitigate downtime, it’s critical to understand how and why it happens. Causes of downtime could range from a natural disaster to simple human error.
We make mistakes. It happens. Unfortunately, this includes people managing infrastructure, configuring the network, production environments, or security policies.
Many studies have shown that human error is the number one contributor to downtime. It could range from something as simple as changing rules on a firewall in production to altering columns in a live database. One wrong keystroke can cost millions.
Increasing automation is the best way to reduce downtime caused by human error. Although you can’t eliminate human interaction, we can minimize it to the extent that allows us to be both more efficient and error-free at the same time.
Bugs and surprise edge cases can trigger events that cause an unexpected outage. The cause can range from your service codebase itself to your engineers missing an edge case.
Most commonly it’s due to regression defects – a planned fix causing an issue elsewhere in the application. QA teams might miss a regression defect, allowing it to move to production where it then promptly wreaks havoc with your core systems. The fact that production environments and testing environments aren’t identical also contributes to this issue.
The only way to deal with misconfiguration is to ensure identical environments for testing and production – moving towards increased automation when provisioning and configuring them.
Cloud Provider Outages
Moving to the cloud has incredibly significant business and technological advantages. But no one is perfect. No system is 100% on 100% of the time. There are ways to minimize your risk. Major public cloud providers can still guarantee better uptime than the typical on-premises infrastructure.
Things can go wrong anywhere. Whether it’s an entire region going down, or a service outage that impairs your core systems, the results can be catastrophic.
It’s critical to plan for a failure and build it into your cloud architecture. Build fault-tolerant systems assuming everything will eventually fail, which ensures that nothing fails in the long run when it comes to your business applications.
Ad hoc Usage Spikes
The COVID 19 pandemic is a classic example of outages caused by ad hoc usage spikes. Sometimes, when you’re least expecting it, there might be such an inflow of traffic that you find your provisioned infrastructure gasping for air before finally failing and going offline.
Due to the pandemic, e-commerce grew at an unprecedented rate. Unable to handle this influx, several retailers found themselves facing similarly unusual traffic and subsequent outages. The reasons could range from databases being stretched to the limit or applications being bandwidth-starved due to the massive inflow of traffic.
Again, the solution lies at the architectural level, where you must build scalability into your systems. One of the cloud’s most significant advantages is its scalability; ignoring that is a mistake.
Now that we’ve covered the primary causes of downtime, next up we’ll discuss ways to reduce your downtime and mitigate the issues when any unavoidable outages do occur. Hint: it’s your cloud architecture.
You might also like
What the heck is a Service Mesh, anyway?
Your microservices architecture can benefit immensely with a Service Mesh. Here's how.Read article
Elasticsearch for Beginners and SQL Developers
In this video we will learn some basic concepts of Elasticsearch eco-system. Topics that will be covered in this video: 1. What is Elasticsearch and its various use cases. 2. Key SQL Concepts and how ES handles it (or not)? 3. Hands-on demo with ES and SQL queries side by side.Watch on demand
How Route Optimization Improves Efficiency in Last-mile Delivery
It's not just about where your product ends up, but also how it got there.Read article
The Economics of Last-Mile Delivery
The current last-mile environment continues to challenge many retailers and grocers. To address these challenges, they are taking three approaches: subsidize the cost, outsource third parties, or bring last-mile delivery in-house. Find out which one is winning.Read article
The Big Switch: How Grocers are Bringing Last-Mile Delivery In-House
Egen has worked alongside several leading grocery brands and retailers to build a last-mile delivery foundation in under 6 months.Read article
Learn how to build reactive systems using project Reactor and various Spring projects
Let's discuss how to build reactive systems using Project Reactor and various Spring projects.Watch on demand
Handle MLOps across multiple cloud providers using Kubeflow
Machine Learning Models are relatively easy to build but hard to roll out. Learn how to make ML workflows production-ready with Kubeflow.Watch on demand
Role of service mesh in Kubernetes explained
Let's understand the role of service mesh in the Kubernetes world. Learn about Istio and its features (and if you even need it).Watch on demand