SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


A Beginners Look at Hadoop


A Beginners Look at Hadoop

Author
Message
Dave Poole
Dave Poole
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16800 Visits: 3403
stephen.lloyd 63174 (6/6/2013)
I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

Thoughts?



One thing that has bitten me very hard on the bum in the data warehouse arena is not having a long term source of non-transformed data.
We can provide high availability by having multiple copies of the data
We can provide disaster recovery by having a robust and tested backup strategy

What happens if you find out that a transformation in the data warehouse load missed something crucial? It could be an aggregation from a union query, a data quality dedupe or anything. The resulting data looks legitimate, doesn't cause errors but doesn't reconcile.
If you don't have access to the original source data you are stuck with a fix going forward and having to ignore the poor data and anything related to it.

I think, if you are careful, then you can use Hadoop as a glorified BCP (through SQOOP) repository of your source data acting as a staging area. Hadoop was built to scale but you are going to need quite a bit of hardware to get the balance between performance and scale.

I have a curiosity in Impala which seems to offer MPP like capability to Hadoop. One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files. Mechanically the wheels go around but they don't do so with any urgency.

With that in mind I'm not sure how Impala/Hadoop would handle slow changing dimensions.

LinkedIn Profile
www.simple-talk.com
Paul Hernández
Paul Hernández
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1096 Visits: 661
One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files.


Hi,

Even if you have small files, but a lot of them, you can obtain a good performance in your Hadoop system.

Background: a characteristic of Hadoop is that computing performance is significantly degraded when data is stored in many small files in HDFS. That occurs because the MapReduce Job launches multiple task, one for each separated file, and every task requires some overhead (execution planning and coordination).

To overcome this drawback you only need to consolidate small files. There are several options to accomplish this task; the first attempt could be the low-level HDFS API, which is in my opinion the hardest way, but some java spiders may feel comfortable with it. Another option is the Pail Library, which is basically a Java library that handles the low-level filesystem interaction.

Kind Regards,

Paul Hernández
erin.north
erin.north
SSC-Enthusiastic
SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)

Group: General Forum Members
Points: 101 Visits: 274
Evil Kraig F (6/6/2013)
David,

Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?


Funny you mentioned MongoDB, we are Microsoft shop but we currently evaluating MongoDB as possible solution to store blobs. Our production system generates a lot of blobs (PDF, HTML and images), they are currently stored as varbinary(max) or on the file system. We are looking for better solution and some of the products are looking into is MongoDB.
babu.manoharan-1113385
babu.manoharan-1113385
Forum Newbie
Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)Forum Newbie (3 reputation)

Group: General Forum Members
Points: 3 Visits: 3
nice work, but couple of things wrong here being an SQL server guy you should not focus on Cloudera because Cloudera is blessed by ORACLE and Hortonworks is blessed by Microsoft, the two eco-systems has some minor differences, hortonworks works with Windows Server, you no need learn LINUX or JAVA. By learning LINUX and going through traditional BIG-DATA approch you are trying to be master in both worlds. but my advice is as an SQL Server person you will be fine if you try to master the HORTONWORKS Big data offering.
trabun
trabun
Grasshopper
Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)Grasshopper (12 reputation)

Group: General Forum Members
Points: 12 Visits: 56
Let me start by saying "Nice article!". Your descriptions were clear and concise and you did not assume knowledge on the part of the reader.
One statement may be a bit too generalized, however:
Up to 1 tonne the BMW wins hands down in all scenarios. From 1 to 5 tonnes it depends on the load and distance and from 5tonnes up to 40 tonnes the truck wins every time.


If the 1-ton truck is SQL Server, what version of SQL Server? I would suggest that the Parallel Data Warehouse (PDW) is also a 40 ton truck with high compression (supposedly up to 15x but I have seen 25x on purely numeric data) and unbelievable speed (far faster than Hadoop in accessing structured data). In addition, it can query Hadoop clusters using standard SQL syntax using a new technology called PolyBase.
I am not saying PDW replaces Hadoop; Hadoop is designed for unstructured data and PDW for structured. The two work together. But even Hadoop users aggregate Hadoop data into structured RDBMS for analysis.

So perhaps the analogy would be better explained thus: SQL Server is a brand of trucks with capacities from 1 to 40 tons. These trucks are designed to carry optimal payloads in an ordered way. Hadoop is a train with infinity capacity but not infinite performance. It must stay on its tracks to be effective.

Well, perhaps someone more prosaic can address the analogy as I think this one is not much better than yours, but it is closer to the mark.


It is a privilege to see so much confusion. -- Marianne Moore, The Steeplejack
hennie7863
hennie7863
Old Hand
Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)Old Hand (367 reputation)

Group: General Forum Members
Points: 367 Visits: 3421
Very nice article and my thoughts are confirmed by this article..Thank you...
prince.rastogi
prince.rastogi
Mr or Mrs. 500
Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)Mr or Mrs. 500 (558 reputation)

Group: General Forum Members
Points: 558 Visits: 295
Good Article for getting to know - what is Hadoop,
Thanks to David Poole.
Dave Poole
Dave Poole
SSCoach
SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)SSCoach (16K reputation)

Group: General Forum Members
Points: 16800 Visits: 3403
I wrote this article just over 18 months ago and in that time the entire Hadoop ecosystem has evolved considerably. The whole Big Data space is moving so fast that if you were to buy a book and spend 1 hour per day working through the exercises the book would be obsolete by the time you reached the end. In my opinion text books in this arena are only useful for teaching concepts. They are not reference books as they would be if written for SQL Server.

YARN = Yet Another Resource Negotiator. This allows YARN compatible Hadoop plugins to be assigned proportions of the available resources. It is good for allowing multiple and mixed loads to be run at once. It is so much more than MapReduce 2.

Then there is Slider. Slider is a container technology that sits on top of YARN. This means that even if you have a Hadoop app that is not compatible with YARN, if it runs in a Slider container it will be resource managed by YARN.

A huge amount of work has gone into TEZ and the Stinger initiative which boosts the speed of HIVE queries by a dramatic amount. Claims are for a 100x improvement over the original HIVE.

Vendors such as IBM and Pivotal have decided that HDFS is not robust enough for the enterprise so have replaced it with their own tech.

Hadoop 1.0 had the Name Node as a weak point. Hadoop 2.0 is much more tolerant of name node failure.

Apache projects such as Knox and Ranger address a number of security concerns.

The Apache Spark computational framework is an alternative to MapReduce and can use HDFS. Spark offers a layer of abstraction to make it much easier to write distributed compute code. A number of applications are being ported to make use of Apache Spark.

The main vendors are now no-longer claiming that Hadoop replaces the traditional data warehouse. They are positioning it as a complementary technology. The genius of AND vs the tyranny of OR. Teradata have embraced AsterData and Hadoop in a very impressive way so each of the three capitalises on the strengths of the other parts.

All in all Hadoop has gone through the hype cycle and people are now gaining realistic expectations as to what the technology can give them.

LinkedIn Profile
www.simple-talk.com
Misha_SQL
Misha_SQL
SSCommitted
SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)

Group: General Forum Members
Points: 1660 Visits: 1010
Great article! Thank you.



erin.north
erin.north
SSC-Enthusiastic
SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)SSC-Enthusiastic (101 reputation)

Group: General Forum Members
Points: 101 Visits: 274
Since the last time I read this article I started working in big data environment with AWS, Hadoop (EMR) and Vertica.

As you mentioned Hadoop is complementary technology, we found out it can be used as great ETL tool to aggregate 50 columns in hundreds of million records and then send the result to Tabular. To do the same in SQL Server was not practical.

We found out that for our purposes Hadoop is not a great data store, and while some tools like Tableau can connect to HDFS and run map-reduce queries against it the performance is not that great (at least in our case)

Also AWS and MapR do not use HDFS

I also started working with Vertica and I was amazed, it takes many minutes in SQL Server to move half a billion records from one table to another (via BCP out , BCP in), it takes seconds to it in Vertica. I did not believe it finished so had to do count to really see that all the records were transferred.!

Vertica does not have many features but its LOAD and SELECT command are blazing fast.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search