In my organization we manage almost all IT in-house, including the LAN, which is highly redundant. We have 35 Floor Wiring Concentrators around campus, each with around 300 active ports, and the concentrators have dual Gigabit uplinks to network cores in two data centers that are 300 meters apart. The data centers are linked by multiple 10Gb links, and each is connected to the Internet via two trunks that follow different paths.
That should be just about bullet-proof, right?
Well, we recently suffered a four hour outage. This was caused by one of the data center distribution switches generating malformed packets which propagated to the two backbone switches which became unstable, causing the spanning tree algorithm to break, thus killing all network connectivity for end-users. To make matters worse, restarting the two core switches caused two modules of the core switches to fail a known problem which was flagged in the vendor updates but something that we hadn't read.
Our response to the outage was professional, but ad-hoc, and the minutes trying to resolve the problem slipped into hours. We didn't have a plan for responding to this type of incident, and, as luck would have it, our one and only network guru was away on leave. In the end, we needed vendor experts to identify the cause and recover the situation.
From this experience we identified three key risks to network continuity and three corresponding remedies.
Risk 1: The greater the complexity of failover, the greater the risk of failure. The original network design from around ten years ago had been simple, with one core switch in each data center and one interconnection. As the network grew, it was necessary to add multiple distribution layers within each of the data centers, and this additional complexity increases the difficulty of troubleshooting, especially when the root cause is a subtle problem rather than outright hardware failure. Basically, the network is more complex than it needs to be.
Remedy 1: Make the network no more complex than it needs to be. This is a key architecture design principle that is often overlooked by zealous network engineers (or zealous vendor salespeople). The philosophical question it also arises is whether it is actually worthwhile to invest in a zero-downtime system, or whether it is best to have simpler and cheaper manual failover mechanisms which potentially implies more outages but with shorter recovery times. That decision will depend on the organization, but for ours we are seriously considering dumbing down the network in order to guarantee minimum downtime in case of outages.
Risk 2: The greater the reliability, the greater the risk of not having operational procedures in place to respond to a crisis. As the saying goes, success leads to complacency, and it's easy for a successful and reliable network to lead to complacency in terms of operational monitoring and business continuity response plans. With a bullet-proof configuration such as ours, the network can't fail, so who would need a plan, right? Even after our outage, some of our technicians were saying it was a one-off and couldn't happen again.
Sign up for MIS Asia eNewsletters.