De-duplication as core infrastructure

It seems as though the subject of de-duplication is becoming more commonplace as it’s getting difficult to avoid conversations related to it. Just the other day, I had another customer ask me for a recommendation concerning it. While their question was interesting to me, their assumptions for considering it were more so.

It’s amazing the assumptions some people will make in my industry. I can remember a while back, I had a CEO for a company state to me they wanted to get into the storage industry but wanted to do so without the use of disk drives because “We all know you can’t make any margin on storage”. Well, that’s just not the case.

Anyway, the presumption on why de-duplication could be a good infrastructure solution is that they’d been told they could reduce storage consumption by as much as 50%. Really? 50%? I suppose if this were high school and plagiarism was pervasive, this might be so. But it’s not likely that people are replicating that many files. Having said that, de-duplication technology can be beneficial and I provided them an example of how it can help.

In my example, I took a fairly typical scenario in my company where I might generate a presentation that is roughly 25MB in size. It’s a corporate presentation so I might actually shoot a copy of it to all of the employees. We’ll round things to 20 just to make this rather easy. Assuming each employee is using an email reader that downloads attachments to a local folder, we’re looking at 500MB of the same file being stored onto laptops, desktops, etc.

In this day and age, 500MBs isn’t very much. So is the issue the actual storage consumption? After all, if we’re talking about 20 unique individuals, on laptops, and working from different remote offices, we’re looking at 25MB of consumption each. But if we throw all of these people into a central office and decide we want to provide backup for all of them, the problem starts to take shape. On the local area network, backing up this much information to a file server should take roughly 5 seconds on a gigabit network if there’s nothing else going on. That’s hardly anything to be concerned about. It’s not until we wish to back things up remotely that we actually see the problem.

Let’s suppose our company has a T1 connection to the internet and we can upload files at 1.2Mb/s (megabits, not megabytes). All of a sudden we’re looking at backing up all instances of this file to the remote site at fairly slow bandwidth. In this example, assuming we can get 100% utilization, we’re looking at a transit time of 3,333 seconds or roughly 56 minutes. Gee, nearly an hour just to backup all of the instances of one file over the internet. That’s a lot of utilization.

If we simply backed up the first instance and could replace all replicated objects with just hardlinks, the total job would only take 2.8 minutes. If only things were this easy. Actually, for the most part, they are.

There is a definitely a case for de-duplication when it comes to cutting down on WAN transit times for remote backups. But does this actually warrant a product purchase when the products typically cost much more than the storage that will be housing the data? Perhaps, if you look at it from a network bandwidth perspective. But when you consider emerging technologies such as WiMAX, you should re-check making expensive purchases.

There has been much hype about rolling out large WiMAX fabrics in key metro areas. With speeds as high as 72Mb/s, we’re talking about a backup window of possibly as low as 76 seconds. How real is this? Well, just in the last couple of weeks, we’ve read about Charter Communications rolling out support for their cable subscribers in the realm of 160Mb/s. The service is available in very select regions and at a premium. But that’s not the point. The point is that network bandwidth to the internet is increasing at a dramatic rate (no pun intended) and at very affordable prices.

When you consider these things, where does de-duplication really fall? Storage is getting cheaper. Network bandwidth is getting much faster and more affordable. And more and more companies are beginning to make de-duplication scripts or applications available free of charge. With these things coming and affecting much more than just backups, it’s hard to justify such capital expenses.

The reason I got on the subject is that I just read a news article on yet another FREE de-duplication tool called Dupe Manager. It’s only a 0.1 release, but it’s not the only one out there. And, it’s free. Did I mention that? Well, it is. This is yet another reason I love working with open source. Open source lives and breathes to solve simple, everyday problems like these, which makes a lot more sense considering the relative nature of the problem. It’s a bit problem for many today. But will it continue to be as big a problem in 6 months? 12 months?


Leave a Reply

Your email address will not be published. Required fields are marked *