Once you have a working and proven DR solution (see the first post in this series), you will need to create and document your procedure for declaring a disaster. Detail the following procedure in your DR Runbook:
Start with who has the authority to declare a disaster. I always recommend that only members of the senior leadership team be authorized to declare a disaster. Why? Simply because the act of declaring a disaster will have a significant impact on the business. Whether from additional costs or disruption to customers, that decision should come from someone that can see the entire field.
Next, have an authorized member of your team contact your support team (ideally, that is an experienced partner, like Contegix) to communicate the situation and request to declare a disaster. You will need to be ready to communicate the scope of the disaster and what protection groups you need online at the recovery location (more on that soon). Keep in mind this is the point where the RTO timer starts.
From there, you are going to decide what you need to failover. If everything is down, your answer is easy. On the other hand, if you are experiencing a disaster that impacts a subset of your environment, then you will need to figure out what predefined protection groups need to come online at the DR location to restore your critical systems. The option to bring online only parts of the environment comes with its own set of challenges and requires planning and testing before the disaster, but it is often the way to go. For example, if you have 50 VMs under protection you may not want to failover all 50 when only 5 are negatively affected by the unplanned outage.
Once you have contacted your provider or somehow started the failover process, and your virtual machines are coming online at the recovery location, the fun starts!
As these servers come online at the recovery location, I want you to imagine for a second that you are in their place instead. Just go with it! Imagine you were bonked over the head, flown at light speed to a different country, and a dark hood was pulled off your head – just like the movies. You are going to be a bit confused, and so are these servers! The good news is that your DR Runbook will guide you to get back to work.
You will need to tackle networking next, and yes, that is because networking is the center of the universe. Hopefully, during the initial configuration of your environment, the DR networking was tested and perfected. However, let’s pause here a moment and talk about changing IPs at the recovery location. I have been through this with many customers, and this is always the case: if your application supports IP changes, you are going to want to do so. I could go on for a while here, but the main reason is that doing so allows for a partial failover of an environment.
Once the VMs are coming back online, it is time to make them work at the recovery location. Occasionally a VM comes online and requires zero interaction to work, but that is most definitely the exception, not the norm. Therefore, your runbook will require details that guide your team through the changes necessary to restore functionality of the systems. If that discovery and documentation is complete, this should go smoothly.
A Note about Business Continuity
Business Continuity (BC) is very important to your business, but it is NOT a substitute for DR. They work together, but are very different. Business Continuity is the act of planning how your company will react in the event of a disaster. This can involve (but is not limited to) a succession plan for your leadership team, a communication plan for your customers and the media, and a logistics plan for how your employees will work should they no longer have an office to work in. You can compare it to your personal will and testament. This is a bit gruesome I know, but think about it…the idea is to document how to go on after a major event. The DR piece of BC/DR is just the part that brings the systems back online (using the DR Runbook). A proper, thorough BC/DR plan will guide your team to properly restore all functions of the organization, not just your IT systems.
I’ve read that Warren Buffett requires all his leaders to have a successor named and ready to step in should the leader no longer be able to fulfill his or her duties. This is an example of a greater BC/DR plan that includes actions taken if the organization’s leader is lost at the same time the office building and data center.
Join me in the next part of this series, where I discuss what happens once you declare a disaster.
Read Part 1 here.
Read what to do “Once I Declare a Disaster” here.
Brian Frank is Product Delivery Director at Contegix, owning the vision, execution, and management of the product delivery strategy and roadmap. In this role, Brian works with all functional areas of the operations team to develop product releases.
Brian’s responsibilities also include product selection guidance, leading requirement gathering efforts with key stakeholders, taking part in product solution architecture, and successful delivery of early adopter solutions.