• David.Poole (11/24/2015)


    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

    PySpark not an option for you as a middle ground? I know it's not as fast as Scala Spark, but there is a lot more you can do with Python outside of Spark that is going to help with other data needs too.

    At least, that's where I'm heading in the ecosystem. Plenty of Python people around and plenty of Python uses with all types of data with Spark and without Spark such as the other API's with MapReduce and Python, API's in general with Python and so forth.