Navigating Hadoop Resources

Question

Navigating Hadoop Resources

Hortonworks see their USP as being entirely open-source
Cloudera see their USP as being the proprietary elements they bring to the open-source stack
MapR see their USP as the re-engineering of the file system and HBASE so it isn't JVM sitting on JVM

The reason these companies exist is because the ecosystem surrounding Hadoop is complicated and fraught with version incompatibilities. A given distribution is a reasonable guarantee that the combination of versions they use will play well together.

Administering a Hadoop ecosystem is a different thing to developing on a Hadoop ecosystem. You will earn your crust as an administrator.
I think your focus is generally in the right direction but I would focus on HIVE and Spark.

Forget the books. The Big Data technologies are evolving so fast that the books become out-of-date before they hit the shelves and in some cases before the authors have finished writing them.
If you are not already a Java or Scala programmer I would learn Python first and use it to get familiar with Spark then decide if you want/need to shift across to Java and/or Scala.

I'm finding that strong SQL skills and understanding the principles of good design and data handling are just as relevant, if not more so, in the Big Data world. 70% of a Data Scientists time is spent getting data into a form where they can do something with it that adds value. That 70% seems to be mainly selecting, joining, filtering, string processing, aggregation. That 70% also plays to the traditional strengths of a database developer.

LinkedIn Profile

Daniel Klionsky Mr or Mrs. 500 Points: 555 More actions · Answer 1

chudman (10/1/2015)
What is the need for web log files, and therefore Flume?
Thanks
Jeff
StLouisMO

All your visits to the website pages are logged into the weblog files that reside on the web servers.

These weblogs record every visit and for the frequently visited sites they tend to be huge.

By sifting through them, we can learn about visitors behavior, for example, and see the trends.

For the sites like Twitter or Uber, the weblogs are growing with very high speed, the data is constantly 'streaming' in.

That's where Flume comes into play. It handles streaming data and loads it into HDFS ( hadoop file system)

Daniel

ebooklub Hall of Fame Points: 3905 More actions · Answer 2

Hi Daniel, thank a lot about DBA carrier path article.

Our company (investment bank ) few months ago came out with plan to use Hadoop.

One part is clear ,they want to use Hortnonworks distribution.

BUT responsibilities not defined ,What DBA team exactly should do..

We are team of ORACLE and MSSQL/Sybase DBA(s) .

Currently if we have disks spaces issues or Win/OS performance problem we invite or delegate task to System administrators.

Who is going to manage Hadoop in our company ? My guess it going stay the same way

1.DBA will add/remove nodes to cluster, monitor alerts /failed jobs/invalid code

2. Sys Admins will take care of OS security /performance/disk space

When it is going to happen is different story..

You article raise interesting questions while looking for next job /postion

1.Would company hire me/you as Hadoop DBA if we have 15 + years experience as SQL DBA (and we expect to be paid big $ for our knowledge), but our Hadoop experience is limited to “home” projects involved cluster with 3-6 nodes and working knowledge of HDFS, Flume, Sqoop and Hive

2.I did “Hadoop”job search in Eastern Canada and US and most of the jobs related to Hadoop are referring to Data Scientist with knowledge of Hadoop.

So the questions is: How to position our self on Job Market allowing potential employer will see us?

Are we SQL DBA with knowledge of Hadoop, Hadoop cluster administrators or someone also?

3.Currently I searching resources for volunteering in Hadoop administration to gain practical troubleshooting experience . Did anyone succeed in finding those resources ?

The is few links bellow helped me better to understand role of Hadoop DBA, but they dated to 2013..

Hadoop Market might be changed.

http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/

https://www.linkedin.com/pulse/hadoop-admin-job-responsibilities-sudhaa-gopinath

Thank you

Alex

Daniel Klionsky Mr or Mrs. 500 Points: 555 More actions · Answer 3

ebooklub (10/3/2015)
Hi Daniel, thank a lot about DBA carrier path article.
Our company (investment bank ) few months ago came out with plan to use Hadoop.
One part is clear ,they want to use Hortnonworks distribution.
BUT responsibilities not defined ,What DBA team exactly should do..
We are team of ORACLE and MSSQL/Sybase DBA(s) .
Currently if we have disks spaces issues or Win/OS performance problem we invite or delegate task to System administrators.
Who is going to manage Hadoop in our company ? My guess it going stay the same way
1.DBA will add/remove nodes to cluster, monitor alerts /failed jobs/invalid code
2. Sys Admins will take care of OS security /performance/disk space
When it is going to happen is different story..
You article raise interesting questions while looking for next job /postion
1.Would company hire me/you as Hadoop DBA if we have 15 + years experience as SQL DBA (and we expect to be paid big $ for our knowledge), but our Hadoop experience is limited to “home” projects involved cluster with 3-6 nodes and working knowledge of HDFS, Flume, Sqoop and Hive
2.I did “Hadoop”job search in Eastern Canada and US and most of the jobs related to Hadoop are referring to Data Scientist with knowledge of Hadoop.
So the questions is: How to position our self on Job Market allowing potential employer will see us?
Are we SQL DBA with knowledge of Hadoop, Hadoop cluster administrators or someone also?
3.Currently I searching resources for volunteering in Hadoop administration to gain practical troubleshooting experience . Did anyone succeed in finding those resources ?
The is few links bellow helped me better to understand role of Hadoop DBA, but they dated to 2013..
Hadoop Market might be changed.
http://www.pythian.com/blog/hadoop-faq-but-what-about-the-dbas/
https://www.linkedin.com/pulse/hadoop-admin-job-responsibilities-sudhaa-gopinath
Thank you
Alex

Hi Alex,

I believe, the following activities are still relevant in Hadoop world:

- working with complex sql

- data modeling

- performance and tuning

In my opinion, Sql server dba/developer loosely translates into the 'Data Engineer' position in Hadoop ecosystem ( and not to "Data Scientist").

Here is a link to how Claudera ( competitor to Hortonworks, very popular in San Francisco Bay Area ) defines "Data Engineer" duties.

http://certification.cloudera.com/CCP-DE.html

In the past 1-2 years, a new generation of SQL engines over Hadoop became popular; namely Apache Spark.

It is very fast and uses SQL but in order to use it correctly you also need to know Java ( or other scripting languages like Python or Scala)

So, in the case of Apache Spark (or Impala, another fast engine), you can not avoid a deep learning curve!

Cheers,

Daniel

ianstirk Ten Centuries Points: 1310 More actions · Answer 4

Hi,

Much Hadoop processing (MapReduce or In-memory) needs low-level programming knowledge. Typically batch MapReduce uses Java, but scripting languages like Pig can also be used. With in-memory processing, Spark is becoming increasingly popular. Spark can be programmed using Scala, Python, or Java (so you’ll need knowledge of object-oriented and functional programming).

Since many more people (BAs etc) are familiar with SQL than low-level programming languages, many Hadoop technologies have developed a SQL-like interface too.

Hadoop’s data can be processed via Hive (Hadoop’s data warehouse), which dynamically creates and runs MapReduce jobs. Impala can make use of Hive’s metastore to perform much faster in-memory processing. Both Hive and Impala use versions of SQL. Spark also has a version of SQL.

Hadoop also has databases e.g. HBase. HBase might be considered a denormalised database, a bit like a massive spreadsheet, potentially having millions of columns and billions of rows, with lots of sparse data.

I suspect much Hadoop data will be from relational databases (even if it’s just an archive store). It may be that Hadoop will contain archive data, and RDBMS contain the related transactional data (maybe for the current month or quarter etc). Alternatively, all the data could be stored in Hadoop.

I suspect in the future, the low-level work will be carried out using languages like Java/Scala, but much of the work (80%?) will be via SQL. The software lifecycle is often weeks for low-level programming languages, and minutes/hours for SQL.

I don’t think there is a role for the typical relational DBA in Hadoop, unless you retrain in one of the (many) NoSQL databases, or learn the various Hadoop technologies.

I would suggest a gentler introduction to Hadoop via the book “Big Data Made Easy”, you can see my review of it here: http://www.i-programmer.info/bookreviews/218-data-science/8414-big-data-made-easy-.html (the same website has reviews of 2 of the 5 books given in the article)

Thanks

Ian

Daniel Klionsky Mr or Mrs. 500 Points: 555 More actions · Answer 5

Agree with Ian, sadly, a typical SQL Server DBA will have to learn much more to stay relevant in Hadoop / Spark business.

Daniel

GAURAV UPADHYAY Right there with Babe Points: 759 More actions · Answer 6

Hi Daniel,

A very informative and well written article indeed. Just wanted to know if lack of JAVA knowledge proved to be an impediment in exploring and learning hadoop. Additionaly, could you please specify the order in which the books were referenced and read

ianstirk Ten Centuries Points: 1310 More actions · Answer 7

Hi Gaurav/Daniel,

sorry to interrupt, but I’ve also gone through this process, so hopefully can add something of value.

I’ve been working with Hadoop for the last year or so. During this time I’ve written 14 Hadoop/big data detailed book reviews (they are really book summaries plus my thoughts). You can see many of them here: http://www.i-programmer.info/bookreviews/218-data-science.html I’m currently in the process of writing an article titled “Road Map to Hadoop and Big Data (from novice to competent)” based on these book reviews and my working knowledge – which you might find useful.

My thoughts on the books given:

Apache Sqoop cookbook – This is about moving data between relational databases and Hadoop. It’s an excellent read, lots of example code. However it is getting relatively old, and doesn’t cover everything.

Hadoop: The Definitive guide – This covers Hadoop and its major components in some detail. Not really a book for beginners. You can see my review here: http://www.i-programmer.info/bookreviews/218-data-science/8806-hadoop-the-definitive-guide-4th-ed.html

Hadoop Application Architectures – Covers best practices and example architectures. It’s the book to read after “Hadoop: The Definitive guide”. You can see my review here: http://www.i-programmer.info/bookreviews/218-data-science/8969-hadoop-application-architectures.html

Microsoft SQL Server 2012 with Hadoop – covers Hadoop, Sqoop, Hive in brief detail. You’ll see how separate SQL Server and Hadoop are, SQL Server is typically used as a data source

DevOps – I’ve not read

To get started, I would recommend you read “Big Data Made Easy”. You can see my review here: http://www.i-programmer.info/bookreviews/218-data-science/8414-big-data-made-easy-.html

My thoughts on Java etc

If you intend to become a programmer, then Java is probably important. You might find Spark/Scala becoming even more important as in-memory processing becomes the norm. (I've written book reviews on Spark and Scala)

There should also be jobs for SQL related people (BAs, complex SQL, reporting etc), so you might want to learn Hive, Impala, QlikView etc. (I've written book reviews on Hive and Impala)

Thanks

Ian

(I’ve also written around 30 SQL Server detailed book reviews, available on the same site).

Daniel Klionsky Mr or Mrs. 500 Points: 555 More actions · Answer 8

GAURAV UPADHYAY (10/5/2015)
Hi Daniel,
A very informative and well written article indeed. Just wanted to know if lack of JAVA knowledge proved to be an impediment in exploring and learning hadoop. Additionaly, could you please specify the order in which the books were referenced and read

Hi Gaurav,

you don't need to know Java to follow the steps described in my article. I strongly recommend to proceed first with Cloudera VM installation. Once installed, you can run simple examples provided by Cloudera. The order of the books is less important.

As for the things `in general`, I defer to Ian's answer 🙂

Daniel

Paul Hernández SSCarpal Tunnel Points: 4964 More actions · Answer 9

Hi, I am right now in a similar situation.

I don't know if someone already mentioned it, but there are really good ways to get started. I would choose either, the HDInsight or the Hortonworks sandbox on Microsoft Azure. Then you can interact immediately with some of the most important technologies like Spark, Storm, Hive, HTables and so on.

There are other non Microsoft good alternatives like the Amazon, IBM or Google cloud base solutions.

I think also depends on what you want to learn, real-time analysis, machine learning, unstructured data processing, etc?

There are also a huge amount of free courses and resources, you can visit the big data university, mongodb university, edx courses, etc.

I would not spend too many time reading books that are going to be obsolete in a few month.

Kind regards and enjoy this new journey,

Paul Hernández

ebooklub Hall of Fame Points: 3905 More actions · Answer 10

From Hadoop theory/testing/improving you skills and working on own projects to landing position/contract in Hadoop

Mistakes/Methodology/Suggestions

Small history

1.Sybase

I was working for 2 years as production SQL DBA at large company (DBA team 150+ people)

SQL and Sybase DBA are one team, ORA DBA is another ..

I got to learn Sybase…

Learning plan: SAP training /study at home/going thought understanding custom Sybase environment at company

Time spend: 3 months

Number of Practical Sybase cased assigned to me in 6 month period : 5

Conclusion : I would not hire myself as Sybase DBA since practical experience is almost 0 and it will take some time to refresh theory and get up to speed with problem resolution.

2. Cassandra

Company wanted to use Cassandra, me and few other DBA are were chosen to support it.

Learning plan: Online training/ sandboxes at work/home study

Time spend: 3 months

Number of Practical Sybase cases resolved: 0

Conclusion: After 3 months of training I was able to install/support Cassandra cluster, but without practical experience 3 months later all the skills get really rusty and I would not be qualified as Cassandra DBA, reason lack of practical experience on day to day basic.

3. Hadoop

Company wanted to use “Hortonwoks” distribution of Hadoop, DBA expected to support it

Learning plan: I went thought sandbox training tutorials, installations of standard and Hortonworks multi node clusters ,UDEMY courses, blogs

Time spend: 5 months

Conclusion: Company not going to implement product in next 6-8 months, number of case resolved : 0

Was the time invested in learning Sybase, Cassandara completely wasted ? not necessary,

but without practical experience result is not enough to apply for the job in new field.

I like Hadoop, I am learning it internals, Hive, Scoop ,Spark, Elastic Map Reduce but keyword Is “learning”.

You can build different solutions at home

SQL Sever -> Scoop ->Hive; Flume –>HDFS ->Hive Text ->Hive and Text->Spark-Hive ...

But to get real experience and resolve problems you need 5 – 10 node cluster and GB of data , this is where you will start seeing memory /CPU/HDFS bottle necks .

(1Hr of running exercise with Elastic Map Reduce will cost 5$ on 5 node cluster)

Questions

How do you manage working 8 hrs a day with SQL server, build and run Hadoop environment at home or in cloud to “simulate” real time production, allowing you without bluffing, after 5-6 month of learning/"playing" apply for Hadoop DBA/Architect/Engineer position?

I am putting myself in position of manager who need Hadoop specialist :-)?

“Good morning Joe/Alex/Ashish our company has several clusters and we need person to support it,

Tell me what you know about Hadoop and you practical experience with this technology.”

Paul Hernández SSCarpal Tunnel Points: 4964 More actions · Answer 11

Hi ebooklub,

I find somehow funny that companies are looking (at least here in Germany) senior big data architects, data engineers and developers with AT LEAST 5 years of experience in different technologies. Some of these technologies are still quite new and don't even have more than 2 years in the market.

I think companies want to use advanced analytics and process large amount of unstructured data but are also trying to minimize risks and costs, which is understandable but they will probably find no one or simple won't succeed.

It is maybe a good opportunity for startups to sell services to other companies that want to outsource the big data related projects.

I am also learning by myself and try to figure out "cheaper" study cases. The key aspect is to generate business cases. You cannot effectively sale study cases.

Btw. My boss always laughs when I talk about these topics.:w00t:

Kind regards and keep learning,

Paul Hernández

Daniel Klionsky Mr or Mrs. 500 Points: 555 More actions · Answer 12

ebooklub (10/28/2015)
From Hadoop theory/testing/improving you skills and working on own projects to landing position/contract in Hadoop
Mistakes/Methodology/Suggestions
Small history
1.Sybase
I was working for 2 years as production SQL DBA at large company (DBA team 150+ people)
SQL and Sybase DBA are one team, ORA DBA is another ..
I got to learn Sybase…
Learning plan: SAP training /study at home/going thought understanding custom Sybase environment at company
Time spend: 3 months
Number of Practical Sybase cased assigned to me in 6 month period : 5
Conclusion : I would not hire myself as Sybase DBA since practical experience is almost 0 and it will take some time to refresh theory and get up to speed with problem resolution.
2. Cassandra
Company wanted to use Cassandra, me and few other DBA are were chosen to support it.
Learning plan: Online training/ sandboxes at work/home study
Time spend: 3 months
Number of Practical Sybase cases resolved: 0
Conclusion: After 3 months of training I was able to install/support Cassandra cluster, but without practical experience 3 months later all the skills get really rusty and I would not be qualified as Cassandra DBA, reason lack of practical experience on day to day basic.
3. Hadoop
Company wanted to use “Hortonwoks” distribution of Hadoop, DBA expected to support it
Learning plan: I went thought sandbox training tutorials, installations of standard and Hortonworks multi node clusters ,UDEMY courses, blogs
Time spend: 5 months
Conclusion: Company not going to implement product in next 6-8 months, number of case resolved : 0
Was the time invested in learning Sybase, Cassandara completely wasted ? not necessary,
but without practical experience result is not enough to apply for the job in new field.
I like Hadoop, I am learning it internals, Hive, Scoop ,Spark, Elastic Map Reduce but keyword Is “learning”.
You can build different solutions at home
SQL Sever -> Scoop ->Hive; Flume –>HDFS ->Hive Text ->Hive and Text->Spark-Hive ...
But to get real experience and resolve problems you need 5 – 10 node cluster and GB of data , this is where you will start seeing memory /CPU/HDFS bottle necks .
(1Hr of running exercise with Elastic Map Reduce will cost 5$ on 5 node cluster)
Questions
How do you manage working 8 hrs a day with SQL server, build and run Hadoop environment at home or in cloud to “simulate” real time production, allowing you without bluffing, after 5-6 month of learning/"playing" apply for Hadoop DBA/Architect/Engineer position?
I am putting myself in position of manager who need Hadoop specialist :-)?
“Good morning Joe/Alex/Ashish our company has several clusters and we need person to support it,
Tell me what you know about Hadoop and you practical experience with this technology.”

Yes, you are correct. The hiring managers are asking exactly that. They may also ask a question: "What kind of performance problems did you experience on 5-7 node Hadoop or Spark cluster?". You will not be able to answer those questions until you are exposed to the practical issues at work-type environment. I believe, the best bet is to find a company that uses both SQL Server and Hadoop. Regretfully, there are not so many of those at the moment.

David.Poole SSC Guru Points: 76361 More actions · Answer 13

David.Poole

SSC Guru

Points: 76361

May 4, 2018 at 3:39 am

#1989238

As an over simplification

Naidu PK SSCrazy Points: 2481 More actions · Answer 14

Daniel Klionsky - Thursday, October 1, 2015 10:09 AM
Yes, you are right.There is a variety of the SQL frameworks on a top of Hadoop (besides Hive).That is the subject of my next article I'm currently working on.

Perfect Summary and will be waiting for your next article

Thanks,
Naveen.
Every thought is a cause and every condition an effect