Getting ready for Hadoop

  • David.Poole

    SSC Guru

    Points: 75199

    Comments posted to this topic are about the item Getting ready for Hadoop

  • Andrew..Peterson

    SSCertifiable

    Points: 6657

    Excellent write up, not just for Hadoop, but many other newer technologies.

    The more you are prepared, the less you need it.

  • GeorgeCopeland

    SSCertifiable

    Points: 6896

    Wow, this is a fantastic article, suitable for framing in the management conference room.

  • qbrt

    SSCrazy

    Points: 2422

    I second that. Fantastic read. Great insight from someone who has obviously spent a lot of time working with the technology. Thank you for sharing.

  • andycao

    Say Hey Kid

    Points: 698

    I third that emotion. Thanks for sharing.

  • akljfhnlaflkj

    SSC Guru

    Points: 76202

    Though well written article, I can't see myself ever using this.

  • maciej.skrzos

    SSC Rookie

    Points: 40

    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

  • qbrt

    SSCrazy

    Points: 2422

    Here's a link talking about Spark.

    http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

    I'm no expert on this, but from the article, Spark is an alternative to Hadoop Map Reduce. Not really a replacement for the whole Hadoop framework.

  • David.Poole

    SSC Guru

    Points: 75199

    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

  • xsevensinzx

    One Orange Chip

    Points: 25550

    David.Poole (11/24/2015)


    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

    PySpark not an option for you as a middle ground? I know it's not as fast as Scala Spark, but there is a lot more you can do with Python outside of Spark that is going to help with other data needs too.

    At least, that's where I'm heading in the ecosystem. Plenty of Python people around and plenty of Python uses with all types of data with Spark and without Spark such as the other API's with MapReduce and Python, API's in general with Python and so forth.

  • xsevensinzx

    One Orange Chip

    Points: 25550

    Great article.

    I would say that one thing to highlight the most in the cloud is bandwidth (in, out, regional, etc), latency, managed (S3) vs. unmanaged (EMR) and ensuring you're not going to sink yourself in cloud technology due to the ease of use.

  • maciej.skrzos

    SSC Rookie

    Points: 40

    David.Poole (11/24/2015)


    maciej.skrzos (11/24/2015)


    Great article Dave, thanks!

    I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

    Thanks,

    Maciej

    My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

    The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

    Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

    Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

    Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

    A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

    Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

    Thanks for the answer. I agree with you, that time couldn't be better. Seems that we are in an "early adapters" time for those technologies.

  • quackhandle1975

    SSChampion

    Points: 10963

    Excellent article.

    qh

    [font="Tahoma"]Who looks outside, dreams; who looks inside, awakes. – Carl Jung.[/font]

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply