Getting ready for Hadoop

Question

Post reply

Getting ready for Hadoop

David.Poole

SSC Guru

Points: 76155
More actions
November 22, 2015 at 1:03 pm

#104259

Comments posted to this topic are about the item Getting ready for Hadoop
LinkedIn Profile

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Andrew..Peterson SSCertifiable Points: 6838 More actions · Answer 1

Excellent write up, not just for Hadoop, but many other newer technologies.

The more you are prepared, the less you need it.

GeorgeCopeland SSCertifiable Points: 6997 More actions · Answer 2

Wow, this is a fantastic article, suitable for framing in the management conference room.

qbrt SSCrazy Points: 2737 More actions · Answer 3

I second that. Fantastic read. Great insight from someone who has obviously spent a lot of time working with the technology. Thank you for sharing.

andycao Say Hey Kid Points: 698 More actions · Answer 4

andycao

Say Hey Kid

Points: 698

November 23, 2015 at 12:57 pm

#1841831

I third that emotion. Thanks for sharing.

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 5

Though well written article, I can't see myself ever using this.

maciej.skrzos SSC Rookie Points: 40 More actions · Answer 6

Great article Dave, thanks!

I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?

Thanks,

Maciej

qbrt SSCrazy Points: 2737 More actions · Answer 7

Here's a link talking about Spark.

http://aptuz.com/blog/is-apache-spark-going-to-replace-hadoop/

I'm no expert on this, but from the article, Spark is an alternative to Hadoop Map Reduce. Not really a replacement for the whole Hadoop framework.

David.Poole SSC Guru Points: 76155 More actions · Answer 8

maciej.skrzos (11/24/2015)
Great article Dave, thanks!
I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?
Thanks,
Maciej

My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.

The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.

Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.

Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.

Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.

A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.

Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

LinkedIn Profile

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 9

David.Poole (11/24/2015)
maciej.skrzos (11/24/2015)
Great article Dave, thanks!
I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?
Thanks,
Maciej
My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.
The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.
Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.
Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.
Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.
A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.
Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

PySpark not an option for you as a middle ground? I know it's not as fast as Scala Spark, but there is a lot more you can do with Python outside of Spark that is going to help with other data needs too.

At least, that's where I'm heading in the ecosystem. Plenty of Python people around and plenty of Python uses with all types of data with Spark and without Spark such as the other API's with MapReduce and Python, API's in general with Python and so forth.

xsevensinzx One Orange Chip Points: 25560 More actions · Answer 10

Great article.

I would say that one thing to highlight the most in the cloud is bandwidth (in, out, regional, etc), latency, managed (S3) vs. unmanaged (EMR) and ensuring you're not going to sink yourself in cloud technology due to the ease of use.

maciej.skrzos SSC Rookie Points: 40 More actions · Answer 11

David.Poole (11/24/2015)
maciej.skrzos (11/24/2015)
Great article Dave, thanks!
I've also done some research on Hadoop recently. I noticed new trend among Big Data Engineers - replacing Hadoop with Spark. I can't find reliable source of information with comparison between those technologies. Have you analaysed it?
Thanks,
Maciej
My team have a proof-of-concept using it. I read Nathan Marz's book on Big Data and a lot of the criticisms of his Lambda architecture are answered by Spark.
The attraction from a development point of view is that it doesn't matter whether you want to do batch processing, streaming or graph analysis the Spark framework allows you to use common methods and properties to do radically different things.
Prior to Spark you would have been faced with implementing very similar functionality in radically different technologies to cater for a mixed stream/batch workload. Hell to debug and very different technologies and skillsets required.
Spark will work on any distributed data store. It's beauty is that it's a framework that can be applied to so many fields and endeavours. The DataStax Cassandra stack uses Cassandra's ability to duplicate data across "data centres" and Spark in order to provide analytics from Cassandra. A "Data Centre" in Cassandra can be a physical thing or a set of IP addresses that you wish to consider a data centre. This gets around Cassandra's usual limitation in that you shouldn't scan data.
Spark is written in Scala which provides a conundrum. Do you go with Scala as your main Big Data language, enjoy its elegance but struggle with recruiting suitably skilled developers or do you go with Java where you can recruit more easily but its much more verbose.
A couple of colleagues are experimenting with Spark for machine learning. The next "Holy War" is going to be a Clojure, Scala, Java punch up.
Don't get me started on cascading frameworks. In short, it has never been a more interesting time to be in the data business.

Thanks for the answer. I agree with you, that time couldn't be better. Seems that we are in an "early adapters" time for those technologies.

quackhandle1975 SSChampion Points: 11055 More actions · Answer 12

Excellent article.

qh

[font="Tahoma"]Who looks outside, dreams; who looks inside, awakes. – Carl Jung.[/font]

Getting ready for Hadoop

Cookies on SQLServerCentral