Computer news, information and research site offering an in-depth review into the hardware and software technology sectors
Print Article
Email A Fiend
Your Opinion

Business as usual

By CBR

IT professionals finally see back-up and recovery as critical to the business process, but what lessons have been learnt when planning new disaster recovery procedures, and what are the rules for best practice? John O?Brien investigates.

We have all been there – you are just about to file an important document when suddenly your PC crashes and the data is seemingly lost forever. The mental strain can be much, much worse than the technical problem. Responding to the need for a little TLC, disaster recovery firm DriveSavers recently employed a former suicide hotline worker, Kelly Chessin, as a crisis counsellor to deal with distressed customers driven to the brink by loss of data. Thought to be the first company to use the "shoulder to cry on" approach to helping resolve DR problems, the issue underlines the importance of people at a time of crisis, a fact that is sometimes overlooked when putting together back-up and recovery plans.

The terrorist attacks of September 11, 2001, exposed serious flaws in the way businesses would react to major systems failures. Following the invocation of disaster recovery procedures, many companies were at a loss as to how to operate IT and personnel issues side by side, with displaced staff and resources thrown into chaos. Failure to get businesses back online in the timely manner they had been expecting was a result of mismanaged logistics, human resources, supply chain and ultimately a lack of foresight of what would be critical to recovery following a crisis. Kevin Nixon, senior security director at Cable & Wireless, says that many large organisations had failed to consider what external factors would halt the recovery process: "Although many customers' energy generators managed to kick in, their disaster recovery failed because diesel tankers to supply the fuel couldn't get into Manhattan."

John Kersley, VP for business continuity at SchlumbergerSema, says IT managers have reacted to increased exposure to system failures, which has driven three key trends in the market: "There has been a marked increase in the number of invocations over recent years. Five years ago it was considered good if you could recover in 48 hours, but now there is a need to recover in minutes, and no need to do this on-site. This has resulted in a change in buying patterns. Firstly the availability of bandwidth has meant that customers can have data in mirrored environments, second there is a shift towards dedicated resources away from shared, and thirdly, awareness in purchasing DR has increased since it is now a board level discussion."

MGIC INVESTMENT

Insurance giant MGIC cites its business continuity contract with SunGard as an example of high availability services.
Jim Stirling, VP information services for MGIC, says the company transferred across from IBM Global Services last September for the proximity to its primary site and faster recovery times: "We use tapes, and restore to Unix operating systems and mainframes, but the IBM recovery site was 1,600 miles away and it would have taken days if not a week to get online and restore from there. We needed to bring down the recovery time to a 24-hour window and take out the distance to the recovery site," he says.
The SunGard recovery centre in Chicago is 100 miles from MGIC's primary location: "This is a one and a half hour journey for the key people in the event of a disaster, and this means that the distance concerns have been taken out for people getting there," Sterling says.
Through the project SunGard is responsible for non-mission critical services such as computer and collocation hosting space and tier two servers, while internal MGIC IT staff are responsible for managing critical applications around EMC hardware including data from IBM DB2, Oracle and SQL Server databases. Stirling says: "At the moment we are mirroring these applications on a weekly basis, and by May we want to do this daily using high-speed bandwidth."

Risks versus costs

Ironically, although IT professionals now recognise the importance of disaster recovery, implementing best practice seems a long way off. A recent survey by the Storage Network Industry Association Europe (SNIA-E), which interviewed 100 storage professionals across Europe between September 2002 and January 2003, showed that although well over half of respondents saw disaster recovery as of primary importance to their business, less than a quarter had any procedure in place, but felt that their staff would know what to do in an emergency. Most concerning is that less than half backed up and restored data every six months, and 12% did not even have a back-up and restore procedure, which would leave the business paralysed in the event of a system outage.

Inadequate back-up and recovery plans can have potentially catastrophic ramifications for the bottom line. Datamonitor estimates that a financial brokerage room could lose $8.11 million per hour, a credit card sales room $3.29 million per hour, a catalogue sales retailer $1.13 million per hour, an airline reservation firm $1.1 million and a telesales service $86,000 per hour.

Lack of available budget is proving a major bugbear for IT managers weighing up the perceived risks versus the overhead costs. Gartner, for example, surveyed 205 US firms last November, and found that almost a quarter of firms had not implemented a disaster recovery plan due to lack of funds, and that 40% were unable to afford to obtain third-party risk assessments for their businesses, and are effectively having to guess at the appropriateness of their strategy.

John Sharp, CEO of the Business Continuity Institute (BCI), says the picture is little better in the UK: "Top listed companies in the UK tend to already have business continuity in place. Just under half of other firms have BC awareness, and of these companies 55% test their systems every 12 months, which means that about 25% of companies in the UK have BC plans that we would consider of value."

David Quinn, storage manager at Dell, says that avoiding costly downtime should be the primary factor when considering implementing a new recovery system architecture, and this can be driven down by investing wisely: "The ultimate driver for disaster recovery is cost saving, how this relates to support services, return on investment and the total cost of ownership of infrastructure," he says.

Phil Carter, director of strategy planning at SunGard Availability Services, advises firms to consider the impact of failure, with the loss of a company's image posing the single biggest threat to it following a crisis: "This is a major problem", he says. "When eBay went offline for 24 hours a few years ago, the company lost 50% of its share price overnight, and it caused a serious loss of business. If a customer can't get access, they will go elsewhere. Direct and indirect losses can be insured against, but you can't get insurance for loss of business."

Sharp at the BCI warns IT professionals against complacency when backing up data: "Companies don't want to get caught in the backlog trap due to the high volumes of data build up. If you leave back-up for two days, it is possible that the data may never be recovered since there aren't the resources to cover it", he says.

Kevin Nixon, senior director for security business strategy at Cable & Wireless, agrees, but is baffled at the continuing lack of awareness by IT professionals who fail to evaluate which data is critical: "There is a lack of education. We have seen companies down for up to four days as a result of not thinking their email is critical. Lots of companies keep their financial details initially on email while they are checking over statements, and if the system goes down this represents a big risk to the business," he says.

Rapid response

However, Don DeMarco, VP and general manager of IBM Business Continuity and Recovery services, insists that users are no longer willing to wait for their supplier to kick-start the recovery process: "Most companies now aren't satisfied with being up and running in 48 hours. They are demanding dedicated rather than shared network connectivity, mirrored data recovery services, and pushing for sub 24-hour recovery to within a four-hour timeframe."

One problem however, is that IT managers get so engulfed ironing out IT problems that it is the human aspect, involving the people who actually make the recovery procedures tick, that get overlooked. Issues such as what duties everyone performs in the event of an invocation, and logistically how practical it is for them to access the recovery site, seem obvious and yet are given too little consideration. SunGard's Carter says: "Getting people in place is a problem. On September 11 companies lost whole departments of staff, and still aren't addressing this issue."

Pressing issues

Another factor post 9/11 is that IT professionals are demanding exclusion zones around their businesses that ensure a limited number of customers would be relocated to the same site should a multiple invocation occur. Jim Stirling, VP for Information Services at insurance giant MGIC, a customer of SunGard, says his main concern is other customers being located nearby: "We worry that we are the last to make the phone call, and hear the provider say 'we can't house you here'. It is effectively an act of faith. In the event of a multiple declaration at the same site, you have to be cognizant of the fact that you might have to move between sites," he says.

Recent corporate fraud in the US looks set to change the way businesses are required to store and retrieve data. Late last year, the US Government introduced the Sarbanes-Oxley Act regulation in response to the Enron corporate accounting scandal, which led to the collapse of the energy giant and the demise of its auditor Andersen. Nixon at C&W says the bill will force public companies to take responsibility for their own risk mitigation, and the effect it has on shareholders: "The CEO and CIO need to know they have sufficient procedures in place to protect the business against disaster, for example are their outsourced HR processes linked into a disaster recovery plan, and do they have the skills inhouse to continue payroll in the event of a system failure?"

SUCCESSFUL BACK-UP AND RECOVERY

Al Decker, executive director of EDS Security and Privacy Services, outlines 10 golden rules for successful back-up and recovery:

1. Review the business continuity plan for accuracy and currency, paying attention to new technologies and equipment not included in the original plan.
2. Check the procedures for backing up data across all devices. These should be done frequently and backed up offsite. Then test and restore the back-ups randomly to ensure they can be accessed when needed.
3. Ensure physical security plans are up to date and tested, including instructions to local emergency services.
4. Review security procedures around intrusion detection systems, passwords, anti-virus and networks to ensure the right people have access to the right systems.
5. Review HR to ensure the company can communicate with all staff in an emergency. Consider distributing workers, vendors, facilities and processes, and contain the crisis to the one location.
6. Ask critical vendors about their recovery plans, since a crisis at their end could have a knock on effect at your business.
7. Ensure the business has executive protection plans.
8. Review insurance coverage, including property, Internet and key personnel.
9. Perform a thorough risk assessment for all external and internal threats to the business.
10. Review budgets since these are often cut back during cost cutting drives.

Best practice

Naturally best practice for back-up and recovery very much depends on the individual business. At the low end, companies continue to back-up data to tape and then manually drive it offsite for both retrieval and archiving. At the other end of the spectrum, high availability services such as offsite mirroring to disk have become widespread among firms that require fast restore. It is now also possible to replicate data to tape across high-speed networks for deep archive, automating the onerous task of couriering data offsite. Through the process service providers can take snap-shots, or point-in-time logical copies of database or file systems, which usually reside on the same storage system as the original copy. They are ideal for taking swift online back-ups that can either be restored from disk or backed up slowly to tape, with little impact on the application.

Dave Dignam, services development director for Synstar, says this has a significant impact at the top end of the market: "Snapshots are perhaps the most appropriate way for companies to take fast and accurate copies of data that can then be backed up. The service is particularly useful for transactions, which can be backed up and then stored offsite. However the majority of firms continue to back their data to tape at the end of the day," he says.

CBR OPINION

Back-up and recovery may have achieved critical importance among IT professionals, however frozen budgets have made convincing the board to sign new recovery services an uphill struggle.

End users fear major disasters occurring in their local vicinity following the events of 9/11, and are now demanding exclusion zones from their service provider that can guarantee access to their designated recovery sites in the event of a multiple disaster.

Nevertheless, companies should beware of focusing purely on the IT elements of a recovery process, and remember it is ultimately the people that will mean it sinks or swims should the unthinkable happen.

Print Article Email A Fiend Your Opinion