Online outages are serious. Vendors lose money for every minute their users can’t reach their web services, and business productivity tanks when employees can’t access the web applications they rely on to get their jobs done. People can be convinced to forgive the occasional blip, but full-blown outages reinforce the impression that nothing truly critical should be .

A look at some of the outages over the past year reveals a disturbing pattern. While the move to cloud-based architecture and applications has reduced complexity in IT infrastructure, that has come at the cost of resiliency. IT has to regularly balance redundancy — which improves resiliency — with complexity, and recent outages show that redundancy keeps getting left behind. Taking the time to assess potential “what if” scenarios and plan for the worst-case scenario could have, if not prevented, at least minimized the effects of these outages.

“IT needs to plan for redundancy on critical services,” said Nick Kephart, a senior director at network infrastructure monitoring company ThousandEyes.

Department of Redundancy Department

Redundancy is a basic IT tenet. Whether it’s multiple backend servers running the same web applications or setting up disk drives in RAID arrays, IT regularly ensures availability even in the case of a failure. Yet the against DNS (Domain Name System) service provider Dyn showed that many organizations failed to think about redundancy on their critical infrastructure.

wasn’t an aberration — cloud-based DNS provider NS1 was hit earlier in the year, and there was also the June attack that targeted all . “It was a large-scale attack on the most critical part of the internet infrastructure and resulted in roughly three hours of performance issues,” said Archana Kesavan, a manager at network infrastructure monitoring company ThousandEyes.

For many enterprises, Dyn seemed like the logical way to address redundancy for DNS services because Dyn already provides a distributed architecture. IT teams don’t want to have multiple DNS providers because it increases complexity to the network infrastructure, but DNS outages can and do happen, so IT teams need to double or even triple up on their DNS providers. IT should also lower the time-to-life settings on their DNS servers so that traffic can be redirected faster to the backup provider in case of an outage at the primary one.

couldn’t keep up with the frenzy surrounding the mega-million payout. Neither the application nor the network could handle the uptick in traffic, leading to increased packet loss and extended page load times. Powerball avoided complete meltdown by distributing traffic across Verizon’s Edgecast CDN network, Microsoft’s data center, and the Multi-State Lottery Association data center just before the drawing. “The damage was already done, and user experience to the website was sub-standard,” Kesavan said.

PokemonGo’s servers experienced similar outages when the combination of network architecture and overloaded target servers prevented users from playing the game. Apple’s servers struggled to handle the much-anticipated launch of , with sporadic outages affecting all its online stores, including the iOS App Store, Mac App Store, Apple TV, and Apple Music.

Benchmarking and capacity planning is critical, especially before software updates and large-scale events. No matter how well the network architecture is designed, CDNs and anycast servers can support the network and maximize user experience.

Did we say redundancy yet?

Don’t forget about Infrastructure redundancy, either. It’s tempting for IT teams to think, “My ISP can handle this, I don’t need to do anything else,” but even upstream providers can have outages, whether because of a mistaken configuration, hardware failure, or a security incident, Kephsart said. Networks by nature will have outages and face security threats, so IT needs to design into the network architecture the flexibility to react when something fails. Enterprises generally do a good job of building redundancy within their own data centers, but they overlook doing the same for third-party infrastructure providers. 

Don’t rely on a single provider, because that becomes a single point of failure. Distribute dependencies across ISPs, DNS providers, and hosting companies.

It is hard to justify security decisions when the only way to tell if it worked is to be able to say, “Hey, we didn’t get hacked,” or, “We didn’t have an outage,” at the end of the year. Those are great goals, but when there are competing demands, it’s hard to justify the extra expenses or added complexity on the possibility that bad things won’t happen. But that’s the kind of calculus IT needs to be doing every day.