A Deep-Dive into Downtime. Why Does it Happen?
Successfully handling sales peaks while avoiding downtime should be the goal of any business. We’ll be covering every aspect of downtime in a series of posts, including details of how to build resilience into your cloud architecture – ensuring you minimize your business’ exposure to any outages.
While the dictionary would define downtime as a machine or service that’s unavailable, that’s not the whole story. Any significant impact to an end-user is downtime because it results in a missed business opportunity.
For a business, a 404/503 error is much more than just ‘not being available.’ Every minute of downtime adds up, with respect to actual costs and loss of potential revenue. Research by Gartner and Avaya pegs losses somewhere between $5600 and $9000 per minute of downtime for larger enterprises, while for smaller corporations, it sits between $127 and $427 per minute.
Yes, every minute of downtime. And those numbers are pre-COVID! We haven’t even accounted for other losses yet.
You can calculate a rough estimate of your losses from downtime using a simple formula:
Even websites that are slow to respond can contribute to downtime. A survey conducted by Retail System Research (RSR) from April to May seems to agree. They found that ‘90% shoppers will abandon a site if it doesn’t load in a reasonable amount of time.’
Factor in the hit to your reputation and lasting impact to search engine rankings, and even a short outage has big consequences.
An outage of six hours is enough to drop search rankings by 30 percent, and the damage may last up to 60 days.” – Shopping Cart Elite
COVID only ups the stakes. The need for digitization as well as minimizing downtime has become critical. Data collected by McKinsey & Company shows that 75% of the consumers who tried a new shopping behavior intend to continue it beyond the pandemic.
Causes of Downtime
Before we cover how to prevent and mitigate downtime, it’s critical to understand how and why it happens. Causes of downtime could range from a natural disaster to simple human error.
We make mistakes. It happens. Unfortunately, this includes people managing infrastructure, configuring the network, production environments, or security policies.
Many studies have shown that human error is the number one contributor to downtime. It could range from something as simple as changing rules on a firewall in production to altering columns in a live database. One wrong keystroke can cost millions.
Increasing automation is the best way to reduce downtime caused by human error. Although you can’t eliminate human interaction, we can minimize it to the extent that allows us to be both more efficient and error-free at the same time.
Bugs and surprise edge cases can trigger events that cause an unexpected outage. The cause can range from your service codebase itself to your engineers missing an edge case.
Most commonly it’s due to regression defects – a planned fix causing an issue elsewhere in the application. QA teams might miss a regression defect, allowing it to move to production where it then promptly wreaks havoc with your core systems. The fact that production environments and testing environments aren’t identical also contributes to this issue.
The only way to deal with misconfiguration is to ensure identical environments for testing and production – moving towards increased automation when provisioning and configuring them.
Cloud Provider Outages
Moving to the cloud has incredibly significant business and technological advantages. But no one is perfect. No system is 100% on 100% of the time. There are ways to minimize your risk. Major public cloud providers can still guarantee better uptime than the typical on-premises infrastructure.
Things can go wrong anywhere. Whether it’s an entire region going down, or a service outage that impairs your core systems, the results can be catastrophic.
It’s critical to plan for a failure and build it into your cloud architecture. Build fault-tolerant systems assuming everything will eventually fail, which ensures that nothing fails in the long run when it comes to your business applications.
Ad hoc Usage Spikes
The COVID 19 pandemic is a classic example of outages caused by ad hoc usage spikes. Sometimes, when you’re least expecting it, there might be such an inflow of traffic that you find your provisioned infrastructure gasping for air before finally failing and going offline.
Due to the pandemic, e-commerce grew at an unprecedented rate. Unable to handle this influx, several retailers found themselves facing similarly unusual traffic and subsequent outages. The reasons could range from databases being stretched to the limit or applications being bandwidth-starved due to the massive inflow of traffic.
Again, the solution lies at the architectural level, where you must build scalability into your systems. One of the cloud’s most significant advantages is its scalability; ignoring that is a mistake.
Interested in discussing ways to reduce your downtime and mitigate the issues when any unavoidable outages do occur? Contact us at firstname.lastname@example.org to schedule a call.
You might also like
How to Lead and Manage a Flexible Workforce
In this era of disruptions, resignations, and remote work, business leaders are faced with the challenge of adapting to new workforce realities while also offering support, encouragement, and motivation to their employees.Read article
How Data Helps Overcome Supply Chain Issues
5 tips to help businesses manage critical supply chain issues.Read article
The Race for the Ideal Retail Payment Solution
The payments market is expected to increase by more than 50% in the next five years. What is driving this growth?Read article
Does Your Company Need a Chief Purpose Officer?
Purpose is embedded and integrated within the organization and impacts both internal and external communities.Read article
How Route Optimization Improves Efficiency in Last Mile Delivery
It's not just about where your product ends up, but also how it got there.Read article
4 External Datasets to Help Retailers Make Smarter Decisions
Here are the crucial ways to utilize external data to focus on your customers through personalization, convenience, and shared values to drive your bottom line.Read article
How to Choose a Last Mile Logistics Solution: 5 Key Considerations
Does choice exist when you consider things very clearly? Read this deep dive to find out what factors influence last mile implementationRead article
How 4 Retail Strategies Will Future-Proof Your Business
The retail experience is not the same as it was even just a few years ago.Read article
How to build your own Social Audio chat application - Part 2
The best ideas start as conversations. That, and coding on the weekendsRead article