by Greg Roody
Deduplication is an advanced data reduction technique which can have a large impact on the amount of storage space required for data. In the case of backup, it is especially effective because typically the same data will be repeatedly sent to the backup store. Almost every backup product on the market today offers deduplication based backup to disk (B2D), and the rest have it on their roadmaps.
Because you can configure these backup-to-disk servers to write to a Cloud Storage appliance like CloudArray (and thus replicate your backup store to Cloud Storage, B2D2C), how you configure your appliance will end up having a large impact on performance due to the unique characteristics of deduplication engines
Read the full story after the fold….
Think of the typical backup process where a full backup is done on servers every week and incremental backups are done daily. If you have 100 windows based servers, this would mean that you would be backing up 100 nearly identical operating system volumes every week. You are also backing up full copies of any changed files every night, even if it’s only a 4 byte change to a 100GB file. Once the archive bit gets flipped, that’s the signal to back it up at night.
The traditional backup process can really eat through storage.
How deduplication solves this problem.
Deduplication aimed at backup environments is typically a block level, target deduplication mechanism. While implementations vary, the mechanism that most deduplication engines use will identify identical blocks of data and only store one copy. As new blocks come in, they are checked against what is already in the data store and if they are duplicates, the data itself is not stored, but a small pointer to the already stored data block is.
What this means is that your backups can be compressed to at least a 10:1 ratio. You can achieve higher rates if you have a lot of common data, like the hundred windows servers I mentioned above. Less data stored means less that has to be transferred, and improved performance overall.
If your deduplication engine is doing the deduplication at the file level (typical for a file system based device), you will see much lower deduplication rates, typically on the order of 3x instead of 10x. You may still find it advantageous to use application block level deduplication over file level depuplication you get through other devices.
Additionally, the whole concept of incremental backups goes away. After the first full backup, the system is only sending changed blocks to the data store, and it can retain a very large number of restore points so you have far more than the few backup sets offered under traditional backup policies to choose from. Every backup is a full backup, every backup is changed blocks only, and you can keep dozens of restore points.
So what does this have to do with performance?
Traditional backup applications are very nearly 100% sequential writes. The exception to this is if you choose to do a verify operation after you write the backup to tape or disk (a sequential read operation).
In the case of deduplication, the i/o ratio has a much higher mix of random reads since before every write the system is looking to see if the block of data exists in your datastore already.
A high percentage of random reads in a Cloud Storage environment means that you will have long wait times if the data is not cached locally, completely, and on a dedicated cache volume. If the data is stored on a partial cache volume, or if the cache volume gets shared for a number of data sources, then you run the risk of having to constantly go back out to the Cloud to satisfy reads. This can cause the equivalent of a lot of swapping and paging. If there is no local cache volume, you are guaranteed to see long wait times.
Not all Cloud Storage appliances support fully sized and dedicated cache volumes that can be segregated from other data volumes or sources. CloudArray does.
So if you are going to use the deduplication feature of a modern backup application to save space at your Cloud Storage provider, you would be well advised to use a fully sized (100% of target storage capacity), dedicated cache volume for your backup application to write to. Either that, or plan for long backup times.
Tags: b2d, b2d2c, backup to disk, Cloud Storage, deduplication, performance



Del.icio.us
digg
Twitter
MySpace
FaceBook
reddit
Stumble Upon
[...] efficient to identify and consolidate duplicate blocks within files than duplicate files. See our deduplication performance blog post for more about [...]
download backitup…
by Greg RoodyDeduplication is an advanced data reduction technique which can have a large impact on [...]…