Doing It Yourself versus Purchasing a Service
What do you need in order to have an offsite business continuity / disaster recovery environment? It depends on your expected RPO (Recovery Point Objective) and RTO (Recovery Time Objective). How much data you are willing to lose in case of a disaster and the length of time that it is going to pass before you can say the organization’s core IT systems are back online, respectively.
If your organization can spend many hours, sometimes days, before the systems are back online, then you may need nothing more than a remote backup. Just bear in mind that backing up files is usually not the difficult part, but restoring backed up data may be the big issue, starting with the bandwidth to download gigabytes and gigabytes of data over the network or the wait for a tape to be shipped from a remote location.
Now, for your core systems, you need a business continuity solution that is going to provide an RTO and RPO, very close to zero in both cases. In order to achieve such aggressive RPO and RTO expectations, you need to have a disaster recovery environment with real-‐time data replication and the ability to failback with no manual intervention. Building such an environment will require that you acquire software (OS license, DR software) hardware (servers, storage, switches, routers, firewall), secondary datacenter space (power, cooling, security), networking (Internet, VPNs, private connectivity), and people. You will need 24×7 support, monitoring, and maintenance.
Once you have been through the strenuous process of selecting vendors and acquiring each one of these components, you will have to assemble it all yourself. No one vendor in the market is going to do the assembling work for you. They may support the implementation of their specific component, but I am sure that none is going to implement other vendors’ products.
Now think about a solution that is ready for use. Install software agents that do not even require restart of the production servers, synchronize data on the primary and secondary sites, and there you go. You’re protected and
so is your organization along with its valuable information.
Secure my Data
“OK, so, tell me: How do you ensure that my data is secured with your services?”
The world of technology is full of risks. We know that. Data security is nothing more than managing risks. If you invest too much, you are going to be wasting precious financial resources. On the other hand, if you invest too little, you’re exposing your organization to unnecessary risks that could have been easily avoided.
There are, however, four major groups of factors that will enhance security – if done properly – or jeopardize data security even further, if over looked.
First is technology: Using the best of breed of technology, and these are just some of the vendors that come to mind: (servers – Cisco/IBM/HP, firewalls – Fortinet/Checkpoint, storage – Hitachi/EMC, switches – Cisco/Juniper, Hypervisor – VMware, etc.) helps reduce the risk of an unexpected rupture. If it does happen, you will have the proper support to get the problem solved ASAP. Then you ask, if this is true, why does Google build their own servers? Should they not use top of the line hardware and software? Well, not necessarily. They have so many servers that the risk of one or two servers failing and causing rupture on Google’s services is negligible. The question is: Do you have as many servers as Google?
In addition, the biggest foes of business continuity are not major events such as hurricanes, earthquakes, fire, or any other overarching event. Minor events that occur frequently, such as disk crashes, burning servers, or corrupted data can wreak havoc on the lives of system administrators and IT operations. By using the highest grade equipment leveraging a robust maintenance and support contract you will be able to minimize such issues as much as possible.
Second is people: According to the Uptime Institute, human errors are responsible for 70 percent of datacenter downtime. Someone will trip on a power cable, will unplug the wrong network cable, will do a wrong configuration of a router or a switch and there you go, downtime of IT systems.
So, how do you reduce this risk? Experienced and certified people will ensure that power cables do not run through the floor or that network cables are properly tagged, color coded, neatly tied and organized. I have recently visited a colocation facility of a major datacenter operator in the United States and the differences between the organization of one cage and the other were staggering. While some of the cages were clean, with neatly arranged cables and equipment, others looked like a multi-‐colored mess of network cables
intertwined with power cables running in all possible directions. Well, people do this, not machines.
Third, processes: Processes are extremely important and should be followed; period. Well-‐designed processes ensure that your hardware and software are properly maintained to minimize failures plus they guarantee that people are following best practices.
Fourth and final point is legal. Various studies show that data security breaches – intentional and unintentional – come from internal sources. Remember Edward Snowden? Even though he was a contractor with the NSA, he was working internally, as a member of the NSA staff.
Replicating errors?
If you’ve worked long enough in IT, you have already experienced blue screens… that “nice” feeling at the end of server maintenance when your operating system starts to load and then all of a sudden, boom, it crashes and shows you a blue screen and loosing Data. I am glad I have a DR environment so I can failover and have users migrated to the offsite environment while I fix the problems in the production environment.”
Well, if you are using VM replication, this may not be the case. With VM replication, the operating system, application, and data are all going to be part of the same image that is being replicated, meaning that your blue screen problem was also replicated to your DR environment.
Effective BC/DR solutions separate data from OS and applications and will ensure that you have a perfect workable copy of your environment to where you will have the ability to failover in minutes in case there is a problem on your production systems. Not only that, in fact, your DR solutions should allow you to perform the maintenance on the replica site, run live tests of compatibility and stability before you do the same patch and update maintenance on your production site. This feature will significantly reduce the risk of unexpected problems in production.
The Details
Think of your daily activities in the IT production environment and ask yourself: what is it that most frequently causes downtime in the IT infrastructure? I can confidently affirm that you will be shocked (or maybe not) to notice that by a significant margin, the small day-‐to-‐day problems affect your operations much more than major issues such as fire, hurricanes, earthquakes, or even major power outages.
Your real problems are small hardware failures such as disk crashes, human errors, such as misconfiguration of a router, or even someone that simple tripped on a power cable. Much more than disaster recovery, you need business continuity solutions. You need solutions that give you the power to failover to secondary sites and maintain your environment up and running in a matter of seconds without any major manual intervention.
You should be able to access a web portal and click failover.
Currently, Software as a Service (SaaS) is the big trend in IT.
What is the limiting factor in all of these architectures? Network Bandwidth. Without enough bandwidth, there is just no way that data can traffic fast enough.
This should be a big red flag in any business continuity / disaster recovery solution that you are evaluating. If the company providing your solution does not have the ability to increase bandwidth at a moment’s notice and relies on third-‐party telecom providers for your access to your DR systems, then you may run into major performance issues when running applications from your secondary datacenter. Either that or you are paying too much for the telecom resources you barely use.
You should be able to access replica servers, apply patches, test compatibility with applications, run stress tests, volume and load tests, all in the replica environment with no impact on the production servers.
At the end of all tests, with everything performing as it should, then you can do the same in the production environment with a minimal risk of failure.
Can you click a mouse and failover your production environment to your replica environment? No? Then you have a maintenance window a lot longer than you should.
When you’re performing maintenance in the production servers, you need to put them offline and have time to implement all necessary changes, right? However, you also need to have a maintenance windows that includes time for a “roll-‐back” strategy, meaning that you have to have enough time to bring your servers to their original stage in case anything goes wrong during the maintenance e.g., incompatibility OS/application, instability, etc.
If you can failover to a perfectly available replica environment in case anything goes wrong with your maintenance, then your maintenance window does not need to include the time for a roll-‐back strategy. Your strategy now should be click and failover to a replica environment while you have the time you need to restore the production environment.
Your disaster recovery and business continuity tool should enable you to access a web portal and manage the failover/failback processes online, via the web, from anywhere in the world with Internet access. You could do this from your smartphone!
What if you don’t have Internet access where you are or you don’t have a smartphone? Not a problem. Your disaster recovery and business continuity solution provider should be able to support you 24×7, performing the migration for you, right?
Oh, they can’t because all they sold you was hardware or software? Too bad… business continuity and disaster recovery solutions should be delivered as a service, so your suppliers can actually provide you a service when you need it. You should be able to call a customer service center, go through the identification process, and request the Center to perform a switch from the production to the replica environment.
Summary
Bottom line is that with the proper BC/DR service you can significantly reduce the risk of failure and optimally enhance operational efficiency.
- You will try before you buy. Does it meet your expectations?
- No CAPEX upfront. Lower risk of losing your CAPEX to a solution that doesn’t work properly.
- You will failover and back by service. Lower risk of moving all services to the replica environment and not being able to activate you SQL Server or Oracle or Exchange or file server or etc.
- You will be able to perform maintenance tasks on your replica environment before the production
environment. Compatibility, stability, stress, volume, and all other tests can be run from the replica environment before you do it in production. - You do not need a vendor. You need a technology partner that will perform the switch for you.
- You will grow at your own pace. No huge amount of capex upfront. You can start by protecting one critical service.
- Don’t spend nights and evenings stuck inside a datacenter. Because you’ve tested changes in the replica environment, you will migrate users over to the updated replica environment and perform changes in the production environment during regular business hours.