ARTICLE POSTED November 1st, 2004
What you need to know about cryptographic hashes and enterprise storage By Jered Floyd
Recent research results have raised enterprise storage customers' curiosity regarding a previously obscure topic the design and security of cryptographic hash functions. Cryptographic hashes are often used to verify data integrity, and recently-uncovered weaknesses in certain hash technologies could allow unauthorized users to modify or delete important records.
A thorough understanding of this technology is important, as enterprise administrators increasingly integrate content-addressed storage (CAS) a storage architecture that incorporates secure cryptographic hashing as a repository for electronic records for regulatory compliance. IT managers are also using CAS systems that incorporate hashing as a tier in their information lifecycle management (ILM) strategies.
Other storage components also use hashes for locating data and protecting data integrity. This article outlines the best practices for evaluating secure hash technology.
Hash functions
A hash function is simply a mathematical way of converting data of any size (a file, for example) into a short, fixed-length value. Because hash functions reduce a larger data set into a shorter value, these hash outputs are also often called "message digests" or "fingerprints."
Hash functions are used extensively in storage systems. The most familiar application is the "hash table," a tool for organizing data so it can be located quickly. File systems frequently use hash tables to speed access.
An example of a hash used for protecting data integrity is the CRC, or cyclical redundancy check. Modems, ATA drive controllers (both parallel and serial) and hard drives use a CRC to verify that data has not been accidentally damaged during transfer. Hash functions for these applications are chosen to be fast and convenient, but they're not robust against malicious attacks.
Cryptographic hash functions
Cryptographic hashes, however, are designed to be robust against malicious interference. An attacker might try to find a piece of data that results in a specific hash value or, alternatively, two pieces of data that result in the same hash value. Vulnerability to attack has severe implications, and with a good hash function, an attack should be computationally infeasible. For this reason, cryptographic hashes are often used in data storage to verify data integrity and identify identical data for single-instancing.
Many applications use cryptographic hashes to verify that a file has not been modified since its point of origin. For example, Java JAR archives, Sun Solaris packages and Red Hat Linux packages are all distributed with MD5 hashes as part of their content. By re-computing the hash and comparing it with the authoritative value, end users can verify that the archive has not been accidentally damaged or maliciously modified.
If the hash used to verify data integrity were vulnerable to attack, a software author or third party could undetectably replace a distributed piece of software with one that performed some other function.
Document management systems and storage systems often use cryptographic hashes to demonstrate that a stored record has not been modified. A hash of the document is recorded in a separate location, and it is checked against the data at a later time. If a weakness were to be found in the hash, a record could be undetectably modified or replaced.
Cryptographic hashes are not a substitute for other forms of WORM protection. Instead, hash verification and application design should be used together to ensure that records are non-modifiable.
Time savings
In addition to verifying data integrity, CAS systems also use cryptographic hashes to perform single-instancing (sometimes called "coalescence" or "commonality factoring") of data. These systems quickly identify whether a given item exists in a very large library, which may be tens or hundreds of terabytes in size, to eliminate the storage of redundant data. Without a cryptographic hash, this would require comparing the item with every other item stored in the entire system a very time-consuming process.
Much as hash tables accelerate search in file system directories, this hash search can be performed very quickly, in a mere fraction of a second. Unlike normal hash tables, most CAS systems also assume that no two pieces of data stored will ever have the same hash, so the actual data content need not be compared.
In such a system, a weakness in the hash could be dangerous. If the system compares hashes but not content, a malicious attacker could overwrite a previously stored record, prevent an expected future record from being successfully stored, or upset internal system record-keeping in unexpected ways.
Attacks on cryptographic hashes
Because a hash function converts data of any size into a fixed-length value, it is mathematically impossible for every piece of data to have a unique fingerprint. There are an infinite number of inputs but only a finite number of outputs, so multiple inputs must result in the same hash. Such an occurrence is called a "hash collision." We can, however, make sure that such a collision is statistically unlikely to occur.
Clearly, it is difficult to find inputs that can create an attack on the hash. If a hash were truly random, this would be linked only to the length of the hash output. An attacker trying to find a collision would have to generate hashes for a very large number of inputs and wait to encounter a collision. Such an attack is called an "exhaustive search" or "brute force" attack.
For a 128-bit hash, such as MD5, an attacker would have to generate 264 hashes before expecting to find a hash collision. That is a large number (a bit above 18 quintillion, or 18,446,744,073,709,551,616). An attacker with a machine that could check one billion values per second would take more than 500 years to find a collision. This is a task that is easily distributed, however, so with 1,000 such machines, the time would drop to only 213 days.
The good news is, an exhaustive search becomes exponentially harder with a longer hash, as each additional bit doubles the space to search. For a 256-bit hash, such as SHA-256, an attacker would have to generate 2128 hashes, a number 264 times larger than for the 128-bit hash. That is a very, very large number. For a 512-bit hash, such as SHA-512, an attacker would have to generate 2256 hashes. That is greater than the number of atoms that make up our galaxy (about 1068, or 2226), a truly astronomical number.
The bad news is that while hashes are complicated algorithms, certain hashes have recently been shown to be vulnerable to other forms of attack. Through analysis, cryptographers can find ways to reduce how difficult it is to find a collision.
At the recent CRYPTO 2004 conference, hash collisions were demonstrated for several functions, including MD5. While details of the attacks were not made public, the researchers claim that these could be quickly found with a modern supercomputer.
Common cryptographic hashes
Of hashes in wide use today, the most common are MD5 and the SHA versions. MD5 was developed by cryptographer Ron Rivest in 1991 and has an output 128 bits in length. It is used widely and, until recently, has proved robust against cryptanalytic attack. While a general-purpose computer would take a very long time to find a hash collision in MD5, custom chips can be built to accelerate this task.
Recognizing the need for a stronger hash, the National Institute of Standards and Technology in 1995 published SHA-1, an algorithm created by cryptographers at the National Security Agency. SHA-1 produces a hash of 160 bits, has thus far held up to detailed scrutiny and is widely deployed.
A stronger hash is required for cryptographic authenticity in very long-term archival systems and storage. To address this need, a new family of hashes was standardized by NIST in 2002. These hashes are SHA-256, SHA-384 and SHA-512, which produce 256-, 384- and 512-bit output, respectively. Barring substantial breakthroughs in computation, these hashes will be secure against exhaustive attacks for a very long time. In the four years since these algorithms were first introduced, no weaknesses have yet been found, and these versions are beginning to enjoy widespread use, replacing earlier choices such as MD5 and SHA-1.
What to look for
New cryptographic algorithms are subject to intense scrutiny by experts worldwide. Over time, confidence builds in the robustness of the algorithm, if existing attacks are shown to fail. Given this, what should you look for in a product that uses one?
Always make sure that the product uses public, standard algorithms. Some products may claim "secure, revolutionary new technology." The only assurance of robustness in a hash is that experts worldwide have not found a way to break it. Any proprietary algorithms or modifications will not have been subject to such scrutiny.
Choose a system that uses a hash strong enough for the application at hand. If the hash is used only to verify data integrity against accidental damage, any hash will do. When verifying data integrity against malicious attack, a much stronger hash function must be used.
If the hash is being used for data security, consider federal standards. NIST publishes Federal Information Processing Standards (FIPS) that recommend or require certain algorithms and practices. FIPS 180-2 documents the Secure Hash Standard (SHS), describing the SHA family of hashes. FIPS 180-2 is a compulsory and binding standard on federal agencies, and any system purchased that uses hashes for information security must use one of the hashes outlined in this standard. Additionally, NIST plans to phase out SHA-1 by 2010, so consider choosing a system that already meets these standards today.
Finally, consider your long-range plans. How long do you expect to keep this system in operation? How critical is its security to your business? As new weaknesses were demonstrated in the MD5 and other hashes recently, it is important to understand how to plan future migration to new hashing algorithms.
About the author
Jered Floyd is vice president of technology at Permabit, Inc.
Feedback: feedback@snwonline.com
Questions: questions@snwonline.com
Technical difficulties: webmaster@snwonline.com
|