Введите текст с картинки: *

Achieving Rapid Data Recovery for IBM AIX® Environments

Introduction

Planning for recovery is a requirement in businesses of all sizes. In implementing an operational plan that ensures that both data and applications can be recovered, IT personnel are generally confronted with several challenges:

  • How can I ensure my applications and data are recoverable without impacting business operations?
  • Do I have data protection approaches available to me that meet my recovery point and recovery time objectives?
  • Can I afford to implement a comprehensive plan that covers both my local and remote (disaster) recovery requirements?
  • Are there cost-effective alternatives that meet my requirements?

Business requirements are not the only mandates that may be driving the evolution of your recovery plan. Various industry-specific regulatory mandates, including Sarbanes- Oxley, HIPAA and SEC, specify requirements for data retention and recoverability. In meeting these requirements, businesses have to deal with a variety of risks to data: inadvertently deleted files or records (operator error), viruses or hackers that can cause data corruption or deletion, and natural disasters that may put much more than just your data at risk. Distributed or branch offices may also have ease of use requirements that may not apply to larger, more centralized businesses.

Do you have a plan that meets your recovery requirements to your satisfaction across these areas?

Issues with Legacy Recovery Technologies

Leading analyst groups, suchas the Gartner Group,
the Enterprise Strategy Group andthe Taneja Group,
state that as many as 1 in 4 backuptapes suffer from
some sort of problem that precludesperforming a recovery.

If you’re like most Businesses, you’re using some form of data protection today – probably tape-based backup. Periodically, someone shuts applications down to perform a backup to tape. Depending on the volume of data that is being copied, this may take several hours and requires manual intervention to set up the backup job, run it, confirm that it occurred, and then return the application to operation. The backup copy may be kept locally in case data needs to be recovered in the near term, and eventually (after several weeks) it may be moved to an offsite location for archival storage purposes. The reason to make and keep copies of your data is so that, in the event of some sort of event or catastrophe that deletes or destroys data, you have a clean copy safely tucked away to use for recovery purposes.

Tape is used for backup and archive because it is very inexpensive, but it is an old technology
that has been available almost since the dawn of computing. There are several issues with tapebased backup:

  • Tape-based backup is a time-intensive process that is potentially disruptive to your applications; this issue is commonly referred to as the backup window problem.
  • Because of its impact on applications and resources, tape-based backups are usually not taken more than once a day, and often only once every several days, meaning that thereare very few  tape-based recovery points available for use over the course of a week; this is problematic because your data is changing very frequently (on the order of seconds or minutes) and the fewer points in time you have a copy of (for recovery purposes) the more data loss on average occurs for a given recovery; this issue is commonly referred to as the recovery point objective (RPO) problem.
  • Once it is clear that a recovery needs to occur, it takes time to perform the recovery (e.g. finding the right tape, transporting it (if its offsite), restoring it to disk, restarting the application on top of the data, etc.); this issue is commonly referred to as the recovery time objective (RTO) problem.
  • As a storage media for backup, tape is not entirely reliable; in fact, leading analyst groups such as the Gartner Group, the Enterprise Strategy Group and the Taneja Group state that as many as 1 in 4 backup tapes suffer from some sort of problem that precludes performing a recovery.
  • Transporting tapes to offsite facilities for archival purposes also has inherent risk. Widely publicized tape losses during physical transport (by truck) have hit large companies like Bank of America, Citigroup Inc., ChoicePoint Inc. and LexisNexis and resulted in the theft of hundreds ofthousands of company records. Replication of data across secure IP-based networks is a much  faster, easier and safer way to transport data to offsite locations for archival storage purposes. If you are driven by either business or regulatory requirements to deploy a disaster recovery solution, a pure tape-based data protection strategy can subject you to undue risk.

The Proven Solution

“By 2011, some form of CDP
will be deployed in 80 percent
of the Fortune 2000.”
— Gartner

Double-Take Availability for AIX® is designed to resolve the following
common problems:

  • Backup window: Data is continuously and transparently copied from designated servers throughout the day as changes occur, so you never again have to concern yourself with backup windows.
  • Recovery Point Objective: Using a technology called “continuous data protection” (CDP), Double-Take Availability will allow you to retroactively pick any previous point and generate a readable, writable snapshot of what the data looked like at the selected point; this effectively presents you with all possible recovery points to minimize data loss on recovery in a way that tape, with its limited number of recovery points, never can.
  • Recovery Time Objective: Double-Take Availability restores directly from disk, providing you with fast, reliable restores in a way that tape cannot. And your ability to pick the optimal recovery point to minimize data loss means that you will spend less time restoring the entire application environment; this effectively shortens the downtime associated with data recovery and hence the impact and cost of an outage.
  • Redundant Application Server: The backup server provides a manual failover target that will allow a critical application to be rapidly restarted with access to current data (to allow processing to continue if the primary server for some reason cannot be restarted).
  • Remote replication: Double-Take Availability includes the ability to replicate data across IP networks so you can migrate your aged data to a remote facility without exposing it to the risks associated with the physical transport of tape-based media.

Double-Take Availability is already in use with referenceable customers across different vertical markets, including financial services and healthcare. It runs on IBM Power Systems servers with AIX 5.3 and above and is applicable to any application running on AIX. Applications for which Double-Take Availability is a good fit meet the following profile:

  • A 7 x 24 application environment with a small to non-existent backup window.
  • Critical applications (from a business point of view) that have high rates of data change (where fewer recovery points translates to significant amounts of lost data on recovery).
  • Applications with stringent recovery time requirements that are not currently being met with existing data protection technologies.

How Does the Double-Take Availability Solution Work?

Most restore requests are driven by issues such
as an inadvertently deleted file or data corruption
that is introduced by a virus or a hacker.

Figure 1. Double-Take RecoverNow mirrors data to a local backup server, which can then retroactively present snapshots for recovery, analysis, testing or development purposes with no impact on the production server(s).

A dedicated backup System p server is established, and can protect one or more applications running on other IBM AIX production servers that are connected to it locally via an IP-based network. Double-Take Availability mirrors production server writes to designated “Protected Storage” (the storage where the application you want to protect resides) to the backup server over IP. These writes are stored in the “Recovery Storage” that is directly attached to the Backup server. Double-Take Availability runs continuously in the background and does not noticeably impact the performance of the “Protected Application.”

At any time, an administrator can go into the management interface of the Backup server, running on a separate Windows-based PC, and generate a historical view (a snapshot of the data at any randomly selected previous point in time). These historical views can then be presented to any other server on the network for the purposes of recovery or to perform any type of off-host processing. These historical views are fully read/write capable, which means that they can support off-host processing tasks like data analysis, testing, development or backup – all without imposing any impact whatsoever on the Protected Application. A historical view can be presented back to the production server as well, but note that there is another option with respect to the production server, called “Production Restore”, which uses differencing technology to modify the Protected Storage to look like the historical view selected on the backup server.

An optimized recovery window of seven days is configured on
the Backup server... Any restore requirement during
that seven day period is performed instantaneously
from disk, without the need to “build up” a restore image
from multiple incremental backups.

Double-Take Availability also supports asynchronous replication. This allows you to replicate the continuous data stream or selected historical views to a remote facility, as long as it is connected to the primary facility by an IP-based network. Replication of the continuous data stream provides full “any point in time” recovery capabilities at the remote site. This configuration is optimal for disaster recovery capability, since historical views created at the remote site can only be presented to servers at the remote site that are on the same LAN as the remote backup server. Replication represents a much faster, much more secure way to get your data to an offsite storage facility. To use this feature, you will need to purchase another System p server running AIX and the backup server software license at the remote site.

The Correlation Between Data Age and Possibility of Re-Use/Restore

It has been proven over time that most data recovery requests are for relatively recent data, and that there is a direct correlation between the age of data and the possibility that it would be required for restore purposes. Most restore requests are driven by issues such as an inadvertently deleted file or data corruption that is introduced by a virus or a hacker. Typically these problems are discovered within several hours or at most a few days from when they first occur, resulting in restore requests for more recent data.

In general, the only time you may need to restore data that has already been archived would be in the event of a disaster that physically destroys computer equipment and facilities, such as an earthquake or a tornado. While it pays to be prepared against these occurrences, they are very rare. The slope of the red line in Figure 3 varies by company type, but it reflects the general relationship in all industries between the age of data and the chance that it would need to be restored.

Another key factor to note is that as data ages, it becomes less important to support the ability to restore to any point in time. Note the inflection point in the red line in Figure 3 that occurs around Day 3. Restore requests for data drop off significantly after that point. This might suggest that you would want to manage roughly 3 days’ worth of your most recent data with Double- Take Availability, migrating it to less flexible but less expensive media locally thereafter for several weeks, and then eventually storing it in an off-site facility after about 30 days. This 3 day window is referred to as the optimized recovery window.

Two Sample Use Cases

Using Double-Take Availability to Provide Zero Impact Data Protection and Rapid
Local Recovery

In this scenario, we assume the customer wants to solve the rapid recovery problem at the local level. They have chosen, however, not to replicate and will continue to migrate data to tape for physical shipment to an offsite location.

The customer is running an Oracle database as an order entry system on an IBM Power Systems server with AIX and 600GB of internal storage. This server will become the production server.10% of the data changes on a monthly basis, and the overall rate of data growth is forecast at 30% per year. Based on past experience, the customer knows that restore requests tend to drop off significantly after seven days. The customer currently does daily incremental backups and weekly full backups using a 100 Mbit Ethernet LAN. Incremental backups take roughly 90-120 minutes per day, while the full backup takes between ten and fifteen hours using a small tape cartridge autoloader.

To install the Double-Take Availability solution, the customer purchases a second IBM Power Systems server on AIX to act as the backup server. Based on the rate of data change and forecast database growth, 1.5 TB of Recovery Storage is housed internally to the backup server. This backup server is attached to the same LAN as the production server. Double-Take Availability is installed on the production server, while the relevant storage which underlies the Oracle application is designated as the Protected Storage. An optimized recovery window of seven days is configured on the backup server. An initial synchronization between the production server and the backup server is performed while the production server continues to run (it is run as a background process) so that database access is not impacted. Once the initial synchronization is complete, continuous data protection is enabled.

To take advantage of the capabilities of their newly implemented Double-Take Availability solution, the customer makes some changes to their data protection processes. With seven days of data included in the optimized recovery window, the customer no longer needs to perform daily incrementals. Any restore requirement during that seven day period is performed instantaneously from disk and without the need to “build up” a restore image from multiple incremental backups, thus cutting recovery time to minutes.

A weekly tape backup is still desirable to prepare for the eventual archiving of data offsite, but the Oracle application no longer needs to be shut down to perform backups. Once a week, a historical view is created by the backup server, which then uses it to perform a tape backup. The customer continues to use its existing tape backup software to perform this backup. Double-Take Availability is compatible with all backup software packages for the purposes of historical view presentation for off-host backup. These tapes are kept onsite for two weeks, and then sent to an offsite facility for archival storage.

Implemented in this way, Double-Take Availability for AIX provides the following benefits:

  • Backups to tape are now completely decoupled from the production application so they can now be scheduled to occur when it is convenient for the administrator, without concern for impact on business processes.
  • Backups are only taken once a week now (instead of daily), taking less administrative time.
  • Restores within the optimized recovery window occur rapidly and reliably from disk, completely resolving tape media integrity issues for near term restores.
  • Data loss on recovery is minimized because the administrator now has access to the optimal recovery point to minimize data loss for every conceivable failure scenario (this is the RPO issue).
  • Recovery time is shortened in several ways:

          • no restore from tape to disk is required (the application can just be started right up on the selected historical view).
          • a recovery point never needs to be “built up” from incrementals so there is less administrative overhead associated with recovery (the selected point is just immediately presented from disk).
          • there is less time spent preparing the application for production use again after the recovery because the best recovery point to resolve the problem can be selected (e.g. if the problem is a file deletion or data corruption problem, the point right before that event occurred can be chosen).

  •  Recovery time is considerably shortened in the event of a problem with the production server: The Protected Application is simply started on the backup server, using the latest, current copy of the production data (the latest historical view). It can continue to run there until such time as the production server can be repaired and restarted.

In addition to these benefits, there is another advantage that did not exist with the previous tape-based approach. Patched and upgraded applications can be tested against current production data in a manner completely decoupled from the production environment. A historical view of the current data state is created and presented to a staging server (also on the LAN) where the patched or upgraded application can be tested. Once the administrator is satisfied with the stability of the new environment, it can be deployed in production. Double-Take Availability makes it easy to create these historical views for testing purposes, ensuring more reliable patch and upgrade processes against production environments.

Archiving To Tape with A Multi-Site Double-Take Availability Configuration

Tape-only backups are no longer a feasible data
protection strategy in today’s business environment

In this scenario, we assume the customer wants to solve three problems (backup window, RPO and RTO) but they also want to migrate their archival data to a remote facility with minimal risk. For the purposes of this example, we’ll assume they are running an IBM DB2 UDB database on an IBM Power Systems server with AIX.

Adding to their production server, the customer purchases a local backup server with an appropriate amount of storage, and the Double-Take Availability software licenses. Then, to enable the remote replication capability, the customer purchases another IBM Power Systems server, to be located at the remote site, running the same operating system.

The customer wants to take the weekly “full tape backup” from disk at the remote site for archival storage. Both the local and remote backup servers are connected via an IP network. With this configuration, the only change to their former backup processes is that they now keep no tape at all at the local (production) site, only at the remote site.

Once a week, a historical view that represents the full backup is created on the remote-site backup server. The remote backup server then backs up the data to tape. Recoveries of data that is already archived can be restored from tape to disk on the remote backup server, and then replicated back to the local (production site) backup server. At that point, the view can be manipulated for any recovery or off-host processing purposes in the same manner as any locally created view.

This solution provides the following benefits:

  • All of the benefits of the local configuration example accrue here, including removal of the backup window, minimized data loss and much more rapid, reliable recoveries (due to rapid restores direct from disk and to the availability of the backup server as a manual failover platform).
  • The additional advantages that accrue with the remote configuration include a fast, easy and secure way to migrate data from a local site to a remote site without incurring any of the risk associated with physical transport, and a fast, easy and secure way to get that data back to a local site on those rare occasions when a recovery from older data is required.

Recovery Time Comparisons

When downtime costs you money, a rapid recovery capability presents a quantifiable return on investment opportunity. By offering a much faster and easier way to perform data recovery than that offered by tape, savings accrue not only in the area of downtime but in terms of administrative time and expense. As shown in Figure 4 below, Double-Take Availability can shorten recovery times by hours and even days in some cases.

Summary

Any business that is experiencing rapid growth or consolidation is very likely using a suboptimal data recovery solution built around tape-based backup. This type of legacy solution potentially interrupts business processes, due to the requirement for a “backup window”, subjects the business to potentially significant data loss when recoveries are required, and is time consuming and labor intensive for both data protection operations and recoveries.

Double-Take Availability for AIX is a proven solution to the data recovery problem that is in use at a variety of referenceable accounts today. Double-Take Availability leverages CDP technology to support instantaneous recoveries from disk, resulting in minimal data loss (due to its ability to present all possible recovery points), rapid, reliable recovery (due to its ability to restore immediately from disk), all while not imposing any downtime on production applications (zero impact data protection).

Because Double-Take Availability ensures that data on the backup server is always current, it can be relied upon as a manual failover platform that allows application processing to be rapidly restarted in the event of a catastrophic production server failure. In addition, Double-Take Availability supports asynchronous replication that will allow businesses to establish costeffective and secure multi-site disaster recovery strategies that support rapid recovery, even from archived data. Double-Take Availability runs on IBM Power Systems servers with AIX and is applicable to any AIX application, but is applied most often for use with business- or missioncritical applications such as enterprise databases or file systems.

Читать далее... 

Скачать в PDF: Achieving Rapid Data Recovery for IBM AIX® Environments