• Eric M Russell (4/14/2014)


    I've got half a dozen retired PC boxes stacked in a corner under my desk on which I'm installing a Ubuntu / Hadoop cluster. My intention is to prove whether or my skunkworks cluster can complete 3 concurrent aggregate queries on a TB of data in less time than the production SQL Server Enterprise instance that we currently use for a staging environment. This type of "project" would normally never get approved, which is just as well, because I've had to put it on the back burner for weeks at time while I focus on my real work.

    It probably won't. If you've got an optimised structure recordset, particularly if you've got page compression on, then a scan through it will be fast.

    Hadoop really comes into its own when you have much larger data volumes or you have data of a structure that doesn't naturally fit into a straight tabular format.

    I was quite disappointed to find that 1 years worth of web site page impressions (roughly 1 billion records) on AWS 3 node EMR took about 3x as long as the same query on SQL Server 2005. The problem is that the type of query, the type of data and the volume of data didn't present a Hadoop shaped problem.

    The beauty of your setup is that you will get to play with a genuine Hadoop cluster and learn the tricks and pitfalls.

    One thing you learn quite early on is that having a process to merge smaller files up into bigger ones improves performance dramatically.