Level: Intermediate Chris Stakutis (chris.stakutis@us.ibm.com), CTO VitalFile & SANergy, IBM/Tivoli
17 Oct 2005 Continuous Data Protection is a new style of data protection ("backup"). Traditional
backup occures once-per-day (or far less frequently for mobile and home users) and only
captures files as they existed at the time of backup. Lost are the changes occuring throughout
the work day. There are many different flavors of CDP starting to emerge in the market and
each have different value propositions.
Introduction
The computer "backup" world is going through some exciting innovations. A few years
ago we started hearing about "disk based backup" which greatly improved the overall
performance of backup (faster backups, faster restores). Often times customers would
use some type of replication as a "style" of data protection. Continuous Data Protection
(CDP) is somewhat of a blend of those two approaches. Specifically:
- CDP is continually capturing all changes (akin to replication)
- But tagging (versioning) objects so that they can be specifically rolled-back to a particular time.
While disk-based backup offers faster backup and restore times, it does nothing to help
with higher recovery-point objectives; that is, the backup interval is still typically once
per day. The main point of CDP is to provide nearly infinite recover points so that any
change can be recovered. This is proving to be of huge interest to all computer users,
from corporate IT administrators down to individual home users. Traditional style of
backup is now starting to look like seat belts, which only work if you use them and only
in limited cases; the new world wants air bags.
Taxonomy
There are three main types of CDP:
- Block Based
- File Based
- Application Based
All have merits for various situations and customer needs. Let's start with a look at
block-based solutions. Block based approaches are very transparent; applications need not
know that it is present. Most block based solutions are "in fabric" (in the SAN fabric)
and thus also work without regard to the type of server or storage. Quite simply, they see
every block-write go across the storage network and logically keep a time-ordered cache
of those writes. Some solutions are quite sophisticated in their management of that cache
such that they can instantly present a "view" of disk/LUN at any time represented in their
cache (versus having to re-assemble or roll-through transactions in a costly manner).
Block based CDP
Block solutions are great at capturing the data transparently, and great at presenting a
"view" of some past point in time, but sometimes require additional work on the part
of the application or user in order to make use of that historical view. For example,
imagine a database application that is constantly streaming I/O to a storage element.
To roll back to a view at some arbitrary time that wasn't co-incident with a database
synchronize or queisce point would likely mean that the database would have to perform it's own
"crash recovery" from that time view. Often block CDP solutions will support a tagging
operation which allows the CDP device to tag specific "times" that are perhaps matched
with application-side quiescing to allow for discrete recovery points. In between those discrete
points the solution will still be able to provide useful views but perhaps at the expense of an
application resync of some sort (but when you need to truly go to an arbitrary time, the
value is exceedingly high).
So, the charms of a block-based solution are: high application transparency, no performance
effect on the application, typically agnostic of hardware and platforms.
Application based CDP
At the other end of the spectrum there is Application-based CDP. In this scheme, specific
applications (e.g. DB2 or some other database or similar application) are completely responsible
for doing all of the journaling necessary to roll back to any time. Being tightly integrated into
the application means the solution can provide a far richer set of recovery capabilities.
For example, a database could perhaps recover a row or even a column in a table as it appeared
3 hours earlier and do so on the live system without disturbing the running application.
A block based solution, by contrast, has no visibility of tables and rows and columns and
only sees raw blocks. A block based solution would have to present a view of the entire disk
(or disk set) and the application (such as a database) would have to be able to "mount"
that view for use or manipulation.
The charm of an application based solution is: extreme application awareness for powerful
recovery capabilities. The downsides are: only will work with that application and likely
adds significant overhead and resource uses on the application servers.
File based CDP
Next up: File-based CDP solutions. File based CDP solutions run on the application hosts
(file servers or workstations) and are somewhat similar to Application-based CDP solutions
(in that file serving is essentially an application) but broader in value since many
applications and users use file-based data naturally. Whereas a policy in a block-based
solution can only be set per LUN/disk, a file based solution can have different policies
per file or file group. Perhaps a set of files on a given machine simply do not need
CDP-style protection, or another set of files might need a longer history of time captured,
and so forth. Furthermore, file-based CDP solutions add only a modest amount of overhead
because when a file is naturally written out to disk (saved) it is very convenient to make
an instant copy since the data is already in various caches. Restoring is smoother in a file
based solution as well. You do not need to present or mount an entire volume view of some
past time point; rather, you can see individual saved instances of each file and pick the
desired ones by hand (or request that a given time be restored for a set of files or directories).
 | |
The charms of file-based solutions: light weight, file-based policies and granularity,
more natural recovery scenarios, and broad application/user value.
Choosing the right type of CDP
So what type of CDP solution is best? Classic answer: It depends.
If you data is strictly files and those files are being used by typical office workers
(creating and editing documents) or by automated business applications (perhaps XML packages),
then a file-based solution is quite likely best (particularly if you are interested in protecting
user-end-point systems such as workstations and laptops). If your machine is mostly serving
a variety of applications (such as DB2 or Oracle or mail), a block-based solution is
probably best. Last, if the application you are running has its own application-based CDP
capabilities built-in, consider using that provided the overhead seems acceptable.
Table 1. Various types of CDP and their uses
| Platform | Use | CDP approach | Comments |
|---|
| Unix | Databases | HW True Block CDP with marked recovery pts | Vol consistency; performance; non-app impacting | | File serving |
SW True FILE CDP
| Per file policies; ease of use; on-line nature; easy of deployment | | Windows | File & Print |
SW True FILE CDP
| | Desktops |
SW True FILE CDP
| Lightness; flexibility; ease of deployement | | Databases | SW Frequent Snaps | Synchronized with app; Very fast & effecient; Cost effective; Rapid recovery | | Email | SW Frequent Snaps |
Tivoli CDP for File: What is it?
Tivoli CDP for Files is, quite simply, a file-based CDP solution. IBM/Tivoli will likely
have a variety of CDP solutions over time and CDP for Files is the first one brought forward.
Why start with files? Because files are the most prevalent business asset and growing at
the fastest pace and arguably the least protected (especially on smaller end-point machines
such as departmental file servers and workstations/laptops). Furthermore, loss of file data
(due to accidental overwrites or erasures) creates tremendous lost-productivity of our expensive
labor force. While impressive tabulations exist for the cost of help-desk calls to restore files,
it is far more impressive to imagine all the calls that do not even go to the help desk such as
when a user corrupts a file mid-day.
Tivoli CDP for files is designed with two major use-cases in mind:
- Corporate or departmental file servers
- End users (workstations and laptops)
Corporate file servers are typically backed-up once per day which is not nearly enough
protection for our modern users (who are pressured more than ever to work on more things
at once and under tighter deadlines). Adding CDP to those file servers (perhaps still
keeping the existing backup solution in tact) will dramatically increase RPO and end-user
productivity.
Direct end-user workstation or laptop protection has rapidly become a concern among IT
managers. Just a handful of years ago corporate IT managers forced users to store their
material on mapped network volumes and specifically would not backup end user workstations.
Most end users that had some files stored locally would back up themselves by using writable
CD's. Today, users are walking around with 60 or 100 gigabyte disk drives that are a
veritable sponge soaking up all their corporate data and never making it back to a controlled
file server. While it might still be the corporate "policy" to have users push their data to a
corporate file server, it is becoming more and more unpractical. Thus, a per-user semi-personal
backup solution is an easy-to-embrace notion, particularly if it automates that back-end
"pushing" of data back to the corporate file server.
Enter Tivoli CDP for Files
Tivoli CDP for Files is an extraordinarily small and easy-to-use backup application equally
valuable to both file servers as well as workstations and traveling laptops. Tivoli CDP
embodies a unique combination of continuous protection along with a more traditional scheduled
to-disk protection capability. The product breaks-down files into three very sensible categories:
- Files that you truly know are valuable and warrant the special CDP style of protection
- Files that you truly know are not valuable and should never be backed up
(various system areas, temporary areas, replicated email, etc).
- And files that fall into that gray area in between; files you might not even know
they exist nor their importance, until you've lost one of them.
Traditional backup is similar to the third category above; that is, it is gratuitous
in nature which was considered the safest approach to backup. Yet, such an approach
was too lackadaisical for important material and far too ambitious for less important material.
The Tivoli CDP product combines replication with versioned-instances. As files are changed,
the software can take several actions automatically and transparently:
- Create a local versioned instance which allows for restore opportunities regardless of connectivity to any network (e.g. while in an airplane)
- Optionally queue the file for transmission to a corporate server and be tolerant of network disconnect situations
- Remember the file has changed and at a later scheduled time push a copy to some off-machine target.
More than 95% of restore requests are for material recently created and altered. Keeping most
of recent data locally allows for unheralded end-user protection. That said, it is still of
paramount importance to migrate data off-machine even if infrequently. Tivoli CDP for Files
is designed to support any file-class device as a target, such as a file server, a closed
architecture NAS device, another LUN, or simply a removable Firewire or USB drive. Corporate
administrators will gravitate to using a corporate file server as the back-end data store which
allows them to once again take control of those wandering digital assets.
The corporate risk of missing end-user protection or not protecting their file servers
with modern continuous protection is too high in today's world of data-everywhere. The
opportunity for an easy to use file-based CDP solution is vast.
cj
CDP "configuration for continuous protection" screen
CDP "restore" interface screen
Resources
About the author  | |  | Chris Stakutis is a renowned data storage industry inventor, technologist, and author with over 20 years of industry experience. He holds over 6 U.S. patents (8 more filed) along various data and networking inventions. Currently working for IBM managing cutting-edge data storage research and development, he was the founder and CTO of SANergy (high speed data sharing),
which was sold to IBM in 2000. Mr. Stakutis is often published in industry journals and seen speaking at industry events. Mr. Stakutis graduated from Worcester Polytechnic Institute in an accelerated three year program and then went on to obtain an MBA from Babson College at a leisurely 10-year pace. He has held key engineering and product management rolls in various high technology companies including Mercury Computer Systems, Precision Robots, MIT Lincoln Laboratories, and many startups. |
Rate this page
|