How they fit: Hadoop, traditional data warehouses, and ETL

Hadoop is starting to come into mainstream consciousness.  As a result, a lot of people are grappling with understanding the relationship between Hadoop and traditional data warehouses, and how ETL (Extract, Transform, and Load technology)  fits into the picture.  On one of the “Big Data” forums on LinkedIn, someone asked the below question. See below for my answer.


Q: Are companies looking at Hadoop for use cases beyond ETL?

Enterprises are currently looking at Hadoop as an ETL  processing engine that will feed unstructured data into an Enterprise Data Warehouse to do traditional BI (Business Intelligence). But are companies looking beyond this for more value-added uses of Hadoop?

My Answer:

I work with a few vendors of Hadoop-related technology; their end customers are indeed looking to use Hadoop as more than a mere ETL engine that feeds data into a data warehouse.

Instead, these customers are looking for ways to do some analytics directly on the data stored in Hadoop, in order to explore avenues of analysis against the raw data (stored in Hadoop) that were completely not anticipated when the data warehouse was designed.

A (highly imperfect) analogy I sometimes use is that Hadoop is kind of like having a garage (or maybe even a garbage dump) of infinite size. You can throw everything in there, just in case you might need it one day. No need to clean up anything before you throw it in your infinite garage – just toss it in! This means, of course, that your garage will get messy and dirty quickly, and it will eventually get difficult to find what you need. You might even lose track of what’s in there. But at least you will have everything in case you need it later.

A data warehouse, in contrast, is like, well, a warehouse. It has a finite size (due to the expense) with nicely ordered shelves. Everything labeled, and the way things are stored are optimized for the most common tasks, so warehouse workers don’t waste a lot of time.

In this analogy, ETL connecting Hadoop and a database is like “cleaning the garage”: going through your garage: finding all the things you need, dusting them off, making them orderly, and filing them neatly on shelves in your data warehouse.

That’s all well and good, provided you actually know exactly what is in your very messy garage, which you probably don’t. What’s needed are some tools to help you take inventory of what is in your garage and conduct some basic analysis against it. That way, you’ll know what is actually worth moving into the data warehouse.

Anyway, that’s the way I think about Hadoop, with my thinking influenced by what some of my clients’ customers want to do.