Computerworld Storage Networking World Online
The Online Storage Networking Magazine for IT Leaders Services | Subscribe | About Us      


    

Meeting the Deduplication Needs of the Enteprise: Key Considerations
by Miklos Sandorfi, Chief Technology Officer, SEPATON, Inc.
msandorfi@sepaton.com

Enterprise IT and data center managers are facing a crisis. The volume of data generated by most companies has grown at such an explosive rate that many data centers have simply run out of space, power, cooling, and storage capacity to handle it. Fundamental issues of insufficient capacity are being compounded by increasingly stringent regulatory requirements and business initiatives demanding higher service levels, longer online retention times, and higher levels of data protection. For enterprise data centers, meeting these demands is particularly challenging. In these large organizations, the sheer volume and variety of data to be protected requires a level of performance and scalability that few technologies can deliver.

As a result, enterprise IT staff members have several technical objectives to address to handle this exponential growth and meet regulatory and business requirements. They need to:

  • Reduce capacity requirements by using compression and deduplication technologies
  • Meet service level agreements and regulatory requirements for fast data recovery by keeping more data online longer
  • Minimize WAN usage and end-user disruption by completing backups within backup windows
  • Control the cost of adding capacity and performance by avoiding technologies that require complete “forklift” upgrades to scale
  • Ensure data integrity throughout backup, retention, and restore processes
  • Minimize downtime while meeting all regulatory requirements.

The first step to ensuring reliable data backup is to look beyond single point solutions to implement solutions that address root causes of data growth and risk in the data center. For example, instead of simply adding more and more tape libraries, investigate new technologies such as virtual tape libraries with deduplication that help to offset data growth at one of the key sources.

Throughout the data lifecycle, the same data is backed up numerous times. In fact, a single megabyte of data can require more than 52 Mb of capacity to handle incremental and full backups. It enables you to keep more data online longer while reducing the footprint of equipment needed to protect data in the data center. This technology is typically a software application used with a disk-based virtual tape library solution.

However, there are several different types of deduplication technologies and few have the capabilities to meet the specific needs of enterprise data centers.

The Basics of Deduplication
There are two basic categories of data deduplication technology: hash based and byte-level comparison deduplication. The hash-based approach runs incoming data through a hashing algorithm to create a small representation of the data and a unique identifier for that piece of data called a hash. It then compares the hash to previous hashes stored in a lookup table. If a match is found, then the duplicate data is replaced with a pointer to the existing hash. If a match is not found, then the data is added to the lookup table.

Enterprise-class byte-level comparison, uses built-in intelligence about the actual file content called ContentAware™ technology, to compare data as objects (e.g., Word document to Word document or Oracle database to Oracle database) and to identify likely redundancies. It then uses byte-level pattern matching to find duplicate data. Unlike other technologies that use the first instance of a file as the reference copy, this ContentAware byte-level method uses the most recent copy and replaces older duplicate data with pointers to it. As a result, this technology eliminates the need to reconstitute newer data from numerous points and is able to restore files nearly instantaneously.

Inline vs Out-of-Line
A key distinction between deduplication technologies is whether the deduplication process is done in-line as part of the backup process or as an out-of-line process. Deduplication performed in-line require slightly less capacity and is adequate for relatively small backup requirements. However, this method has a significant negative impact on performance and cannot complete large backups required by enterprise organizations within typical backup windows. An alternative method completes backups at full, unimpeded performance. The deduplication process is started as soon as the backup process begins and continues in parallel with the backup in a fully integrated operation. The main benefit of this out-of-line method is that it can handle much larger volume backups within a typical eight-hour backup window. In addition, because it backs up a full set of data before the out-of-line method enables a more rigorous data integrity checking capability.

Backup and Restore Petabytes of Data
A primary consideration in choosing a backup technology for an enterprise or large enterprise is its ability to handle terabytes and even petabytes of data and stay within your backup window without creating dozens of separately managed “silos” of storage.

Many deduplication solutions top out at backup rates of 800Gb/hr per appliance. At this rate, to backup 10 TB of data in an eight-hour backup window, you would need numerous appliances. That would add significant complexity and require you to modify backup infrastructure/policies. As your data grows, more appliances need to be deployed and managed. This creates “silos of deduplication” and a management challenge. Overall efficiency of deduplication is also dramatically reduced because the data comparisons that identify duplicate data are only performed within individual devices. Truly enterprise-class ContentAware deduplication solutions can backup data as fast as 17 TB/hr and handle Petabytes of data in a single appliance.

Performance Over Time
Many solutions see a marked degradation in performance over time as data becomes more fragmented across the disk and the database when duplicate data is stored grows. Choose a solution that delivers at the same level of performance regardless of the timeframe.

Realistic Expectations for Capacity Reduction
Deduplication approaches and results vary widely among solutions as does the time required to achieve maximum deduplication. The effectiveness of deduplication technology also depends heavily on your specific backup policies, backup application and the mix of data types you are backing up.
For example, a typical backup contains about 75percent file data, 15 percent email data, and 10 percent database data. A backup containing primarily database data will typically have a less-efficient deduplication ratio than a backup that is primarily file data. An enterprise-class, ContentAware deduplication solution should be able to reduce the typical mix described above by 25:1 and 50:1 when combined with standard hardware compression.

Deduplication ratios advertised for some technologies only apply to the most favourable file type. Look for vendors that will test and characterize samples of your backup data and provide clear expectations of the levels of deduplication you can expect from their technology-- before you buy.

Also be aware that the deduplication ratios many vendors claim only apply to full backups. Some deduplication technologies perform far less efficiently on incremental backups or “incrementals forever” backup scenarios such as those performed by Tivoli Storage Manager. Read the fine print on the data sheets. Ask for references from customers that are using the same backup application, similar policies and data types as you.

Restore Performance
Backing up data efficiently is only half the challenge. To be successful, you need to restore data quickly and efficiently. In fact, one of the key drivers for adopting deduplication technology is the ability to keep data on disk longer in order to simplify and accelerate restore times

Before adopting a new deduplication technology, be sure to test restore times and efficiency. Most restore requests are for data that is less than two weeks old. Solutions that use the first backup as the reference copy must recreate the most recent backup from weeks or months of pointers. In contrast, solutions that use the most recent backup as the reference copy can restore that data nearly instantaneously.

Ensuring Data Integrity
Enterprise deduplication requires guaranteed data integrity. Some deduplication algorithms can result in data integrity issues. Look for solutions that guarantee data integrity. Enterprise class solutions perform a data integrity check that compares the deduplicated data to the original data set at the byte level before any duplicate data is deleted or disk space is redeployed. This comparison needs to ensure that when deduplicated data is reconstructed, it is byte for byte identical to the original backup.

Enterprise Class Reliability
Since the deduplication solution is going to be the primary recovery source for weeks or months of data, the base platform should have the type of reliability and availability features as those found in enterprise class disk solutions including:

  • Redundant power, cooling
  • Redundant data paths with automatic failover
  • RAID protected storage
  • The ability to maintain nearly full backup and restore performance, even when a node is lost
  • Solutions that enable you to add capacity or performance independently and without disruption to existing infrastructure
  • Management software that reports any faults through email/page etc.

Tuning to Your Environment
Choose a solution that can be tuned to support your policies and procedures as well as to your specific environmental requirements. Some solutions, particularly low-end solutions are designed for smaller, simpler infrastructures and therefore have few parameters that can be adjusted to your needs.

People Factor
Before you trust your data to a new technology, choose a vendor with experience in the specific data protection requirements of enterprise-size organizations. To be effective, you need to work closely with a company that will help you configure a solution that best meets your needs and addresses the specific requirements of your backup applications.

Beware solutions that are built on off-the-shelf servers and low-end storage without enterprise class reliability/availability features.