• stephen.lloyd 63174 (6/6/2013)


    I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

    Thoughts?

    One thing that has bitten me very hard on the bum in the data warehouse arena is not having a long term source of non-transformed data.

    We can provide high availability by having multiple copies of the data

    We can provide disaster recovery by having a robust and tested backup strategy

    What happens if you find out that a transformation in the data warehouse load missed something crucial? It could be an aggregation from a union query, a data quality dedupe or anything. The resulting data looks legitimate, doesn't cause errors but doesn't reconcile.

    If you don't have access to the original source data you are stuck with a fix going forward and having to ignore the poor data and anything related to it.

    I think, if you are careful, then you can use Hadoop as a glorified BCP (through SQOOP) repository of your source data acting as a staging area. Hadoop was built to scale but you are going to need quite a bit of hardware to get the balance between performance and scale.

    I have a curiosity in Impala which seems to offer MPP like capability to Hadoop. One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files. Mechanically the wheels go around but they don't do so with any urgency.

    With that in mind I'm not sure how Impala/Hadoop would handle slow changing dimensions.