A Deep-Dive into Downtime. Why Does it Happen?

Written by

Team Egen

Published on

Jan 25, 2021

Reading time

3 min read

  • Cloud
  • Observability
  • Modernize

Successfully handling sales peaks while avoiding downtime should be the goal of any business. We’ll be covering every aspect of downtime in a series of posts, including details of how to build resilience into your cloud architecture – ensuring you minimize your business’ exposure to any outages.

While the dictionary would define downtime as a machine or service that’s unavailable, that’s not the whole story. Any significant impact to an end-user is downtime because it results in a missed business opportunity.

For a business, a 404/503 error is much more than just ‘not being available.’ Every minute of downtime adds up, with respect to actual costs and loss of potential revenue. Research by Gartner and Avaya pegs losses somewhere between $5600 and $9000 per minute of downtime for larger enterprises, while for smaller corporations, it sits between $127 and $427 per minute.

Yes, every minute of downtime. And those numbers are pre-COVID! We haven’t even accounted for other losses yet.

You can calculate a rough estimate of your losses from downtime using a simple formula:

Make sure to use the same unit of measure for both figures. If calculating lost revenue by second, enter both downtime and overall time period in seconds.

Even websites that are slow to respond can contribute to downtime. A survey conducted by Retail System Research (RSR) from April to May seems to agree. They found that ‘90% shoppers will abandon a site if it doesn’t load in a reasonable amount of time.’

Factor in the hit to your reputation and lasting impact to search engine rankings, and even a short outage has big consequences.

An outage of six hours is enough to drop search rankings by 30 percent, and the damage may last up to 60 days.” – Shopping Cart Elite

COVID only ups the stakes.  The need for digitization as well as minimizing downtime has become critical. Data collected by McKinsey & Company shows that 75% of the consumers who tried a new shopping behavior intend to continue it beyond the pandemic.

Causes of Downtime

Before we cover how to prevent and mitigate downtime, it’s critical to understand how and why it happens. Causes of downtime could range from a natural disaster to simple human error.

Human Error

We make mistakes. It happens. Unfortunately, this includes people managing infrastructure, configuring the network, production environments, or security policies.

Many studies have shown that human error is the number one contributor to downtime. It could range from something as simple as changing rules on a firewall in production to altering columns in a live database. One wrong keystroke can cost millions.

Increasing automation is the best way to reduce downtime caused by human error. Although you can’t eliminate human interaction, we can minimize it to the extent that allows us to be both more efficient and error-free at the same time.

Service Misconfiguration

Bugs and surprise edge cases can trigger events that cause an unexpected outage. The cause can range from your service codebase itself to your engineers missing an edge case.

Most commonly it’s due to regression defects – a planned fix causing an issue elsewhere in the application. QA teams might miss a regression defect, allowing it to move to production where it then promptly wreaks havoc with your core systems. The fact that production environments and testing environments aren’t identical also contributes to this issue.

The only way to deal with misconfiguration is to ensure identical environments for testing and production – moving towards increased automation when provisioning and configuring them.

Cloud Provider Outages

Moving to the cloud has incredibly significant business and technological advantages.  But no one is perfect.  No system is 100% on 100% of the time.  There are ways to minimize your risk.  Major public cloud providers can still guarantee better uptime than the typical on-premises infrastructure.

Things can go wrong anywhere. Whether it’s an entire region going down, or a service outage that impairs your core systems, the results can be catastrophic.

It’s critical to plan for a failure and build it into your cloud architecture. Build fault-tolerant systems assuming everything will eventually fail, which ensures that nothing fails in the long run when it comes to your business applications.

Ad hoc Usage Spikes

The COVID 19 pandemic is a classic example of outages caused by ad hoc usage spikes. Sometimes, when you’re least expecting it, there might be such an inflow of traffic that you find your provisioned infrastructure gasping for air before finally failing and going offline.

Due to the pandemic, e-commerce grew at an unprecedented rate. Unable to handle this influx, several retailers found themselves facing similarly unusual traffic and subsequent outages. The reasons could range from databases being stretched to the limit or applications being bandwidth-starved due to the massive inflow of traffic.

Again, the solution lies at the architectural level, where you must build scalability into your systems. One of the cloud’s most significant advantages is its scalability; ignoring that is a mistake.

Now that we’ve covered the primary causes of downtime, next up we’ll discuss ways to reduce your downtime and mitigate the issues when any unavoidable outages do occur. Hint: it’s your cloud architecture.

You might also like

All Insights