Remedy 2: Plan, document and test. Needless to say, having good up-to-date documentation of the network configuration is essential, but of course it should also be kept in hard-copy off the network to ensure it is accessible when the network fails. Having an incident response plan is also crucial in understanding roles and responsibilities and avoiding a crowd-souring approach to problem solving.
Similarly, setting up effective communication channels, both within IT and with end-users, is something that needs to be planned. During our outage, we were left without any way to communicate with end-users beyond walking around the campus informing them. We did identify an option of using the public address system, but since no policy was in place for using it outside of emergency situations, that option couldn't be used.
Risk 3: The greater the reliability, the greater the risk of not having people that can fix a problem. I used to own an old car that required a good deal of routine maintenance oil, water, and a change of a tire once every two months. I'm no mechanic, but became quietly confident that I could fix any minor problems that came up. About four years ago, I bought a brand new car that never fails, and should go wrong a warning light will tell me to take it to the dealer. I'm no longer expected to fix my car and therefore I've lost the skill to be able to do so. The car analogy can be applied to network equipment. Network staff are in danger of having increasingly superficial knowledge of equipment and may be at a loss when it comes to in-depth trouble-shooting.
Remedy 3: Get the right people in-house or outsource it. IT is known for attracting a certain type of personality, but within IT there are certain specializations and networking is one of them that requires a certain obsession that is difficult for outsiders to understand. To build and maintain a reliable network, at least two of these people are required. The alternative to this is to outsource the problem.
While some may feel uncomfortable about outsourcing such a key element of IT operations, the reality is there are many outsourcing companies out there that have specialists that can help you address critical needs. The choice will vary by organization, but in an increasingly industrialized IT landscape, the case for maintaining such skills in-house is becoming harder.
While these risks and remedies primarily concern the operational level, IT of course has to operate within the broader organizational business continuity plans, driven by business requirements. Gartner, for example, has identified five principles of organizational resilience, of which only one relates tightly to IT (Systems, with the other four being Leadership, Culture, People and Settings). Using scenario planning, our intention is to analyze some of the broad risks we face, understand the potential impact at the business level, and then identify options for reducing those risks.
The identification of these three risks and remedies has helped us move forward after our outage. I hope they can help you avoid such outages in your organization.
Sign up for MIS Asia eNewsletters.