Great article Dave, thanks!
I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?
My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.
The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.
Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.
Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.
Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.
A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.
Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.