June 9, 2015
Christophe Bertrand

Achieving 40:1 Dedupe Ratios

In recent years, many data backup and recovery solution providers of data deduplication technology have been boasting high deduplication rates for their devices or software. For example, for the non-initiated, a 20:1 ratio would suggest that if you take 20 TB of data it can be reduced to 1TB through the use of the technology. Sounds great, but is it just marketing?

The logic used by certain target deduplication providers is as follows. Let’s assume you would need 20 TB of storage to run full backups every day for 30 days. After applying “vendor X’s”  target deduplication engine, you would actually only need 1TB. Therefore, the ratio is 20:1. Isn’t it awesome? Since the deduplication happens at the target, this logic makes sense, right?

This calculation method is great news for Arcserve. Let me explain why: our global source-side deduplication combined with our infinite incremental technology gives us tremendous efficiencies. Specifically, Arcserve Unified Data Protection (UDP) can keep 30 daily recovery points in a deduplicated and compressed backup data store. With UDP, users typically see a 70 percent reduction in source data size for standard ‘office’ data (DB, Office Docs). This is based on real-life examples, and it is a conservative number. We have actual customer screenshots from their management console to prove it! So, it isn’t just a marketing guy saying it.

How did we get that figure specifically? If you take 1TB of source data with a 5 percent daily change rate stored for 30 days, that is 1TB + 29x50GB = 2.45TB of backup data, reduced to 735GB in the data store. This equates to 30 (synthetic) full backups of 1TB (1024GB) reduced to 735GB, or a theoretical reduction of 41:1, or (30x1024GB)/735GB=41.

And, Arcserve’s deduplication is global, so it applies to all the nodes we backup. A competitors solution that’s claiming a 40:1 ratio saving may only be saving that on a single logical volume, not across the entire backup storage estate!

My point is this: comparing Arcserve using the traditional math used by legacy deduplication vendors works out beautifully for us, but as one customer told me one day (when I was selling a target deduplication system):  “Your ratios are like gas mileage ads for cars; they’re theoretical and unrealistic.” This happens when marketing gets disconnected from reality.

In my view, the only valid measurement that is accurate and proven is simply to compare how much source data you have vs. how much you end up with at the target backup system (appliance or server/storage) after a period of time, taking into account data growth and change. That’s the only real operational measurement that makes any sense to gauge the efficiency of your deduplication. Who cares about a theoretical ratio that doesn’t mean anything?

You, the end user, care about how much money you need to invest on data protection infrastructure to protect all that data in your organization. This means making sure it is recoverable as needed and when needed, in a timely fashion. You want to get the most efficiency and functionality for your needs. The rest is marketing Bizarre Statements.