I've been playing with Hadoop both locally and in AWS and although a newbie to it I've had a few reality checks with it.
Firstly, it is still in the early phase of the Gartner Hype Cycle. It has yet to go through the "trough of disillusionment" let alone the "Slope of enlightenment" or to the "Plateau of Productivity".
I had it grinding up a few billion records on 4 large AWS nodes and the answer I wanted came back in 14 seconds.
Out of curiosity I took the same recordset and imported it into modest SQL Server 2008R2 instance. The same text crunching took 10 seconds!
The conclusions I draw from this are as follows:-
- There is obviously a threshold that has to be reached before Hadoop delivers a clear advantage
- That threshold is going to depend on the complexity of what you are trying to do to that data. I was simply extracting parts of a web log.
- The big advantage of Hadoop is the fact that it runs on commodity kit and has been designed with the expectation that such kit will suffer failures.
- Hadoop clusters under utilize their CPU resource, its the disk IO isolation they champion on. Rainstor have an interesting compression and data location awareness technology to boost the performance of Hadoop.
- Apache subprojects such as Hive and PIG are essential for wider scale Hadoop adoption.
Setting up Hadoop & Hive was a baptism of fire as I was and still a Linux newbie.
These tools are 0.x releases so the instructions are of varying levels of completeness and accuracy.
There are loads of instructions out there, but they vary quite a bit.
You'll find that IF
you have the pre-requisites up and running installing stuff on Linux is no worse than any other code deployment in your organisation.
If you don't have the prerequisites you will find yourself tracing through the dependencies or trying to work out what those dependencies might be. It isn't always clear and the error messages are largely Java error reports. Just too long to fit in a scrolling window and the important bit has just fallen out of the scroll window buffer!
A basic understanding of Linux is an absolute must.
I was relieved to find that the Linux community is no longer a training ground for the special forces squadrons of the troll army.