Getting ready for Hadoop

  • Comments posted to this topic are about the item Getting ready for Hadoop

  • Excellent write up, not just for Hadoop, but many other newer technologies.

    The more you are prepared, the less you need it.

  • Wow, this is a fantastic article, suitable for framing in the management conference room.

  • I second that. Fantastic read. Great insight from someone who has obviously spent a lot of time working with the technology. Thank you for sharing.

  • I third that emotion. Thanks for sharing.

  • Though well written article, I can't see myself ever using this.

  • Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

  • Here's a link talking about Spark.

    http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

    I'm no expert on this, but from the article, Spark is an alternative to Hadoop Map Reduce. Not really a replacement for the whole Hadoop framework.

  • maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

  • David.Poole (11/24/2015)


    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

    PySpark not an option for you as a middle ground? I know it's not as fast as Scala Spark, but there is a lot more you can do with Python outside of Spark that is going to help with other data needs too.

    At least, that's where I'm heading in the ecosystem. Plenty of Python people around and plenty of Python uses with all types of data with Spark and without Spark such as the other API's with MapReduce and Python, API's in general with Python and so forth.

  • Great article.

    I would say that one thing to highlight the most in the cloud is bandwidth (in, out, regional, etc), latency, managed (S3) vs. unmanaged (EMR) and ensuring you're not going to sink yourself in cloud technology due to the ease of use.

  • David.Poole (11/24/2015)


    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

    Thanks for the answer. I agree with you, that time couldn't be better. Seems that we are in an "early adapters" time for those technologies.

  • Excellent article.

    qh

    [font="Tahoma"]Who looks outside, dreams; who looks inside, awakes. – Carl Jung.[/font]

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply