Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««1234»»

A Beginners Look at Hadoop Expand / Collapse
Author
Message
Posted Friday, June 7, 2013 1:43 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 5:20 AM
Points: 2,950, Visits: 1,933
stephen.lloyd 63174 (6/6/2013)
I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

Thoughts?



One thing that has bitten me very hard on the bum in the data warehouse arena is not having a long term source of non-transformed data.
We can provide high availability by having multiple copies of the data
We can provide disaster recovery by having a robust and tested backup strategy

What happens if you find out that a transformation in the data warehouse load missed something crucial? It could be an aggregation from a union query, a data quality dedupe or anything. The resulting data looks legitimate, doesn't cause errors but doesn't reconcile.
If you don't have access to the original source data you are stuck with a fix going forward and having to ignore the poor data and anything related to it.

I think, if you are careful, then you can use Hadoop as a glorified BCP (through SQOOP) repository of your source data acting as a staging area. Hadoop was built to scale but you are going to need quite a bit of hardware to get the balance between performance and scale.

I have a curiosity in Impala which seems to offer MPP like capability to Hadoop. One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files. Mechanically the wheels go around but they don't do so with any urgency.

With that in mind I'm not sure how Impala/Hadoop would handle slow changing dimensions.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1460986
Posted Friday, June 7, 2013 2:28 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Today @ 6:21 AM
Points: 133, Visits: 544
One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files.


Hi,

Even if you have small files, but a lot of them, you can obtain a good performance in your Hadoop system.

Background: a characteristic of Hadoop is that computing performance is significantly degraded when data is stored in many small files in HDFS. That occurs because the MapReduce Job launches multiple task, one for each separated file, and every task requires some overhead (execution planning and coordination).

To overcome this drawback you only need to consolidate small files. There are several options to accomplish this task; the first attempt could be the low-level HDFS API, which is in my opinion the hardest way, but some java spiders may feel comfortable with it. Another option is the Pail Library, which is basically a Java library that handles the low-level filesystem interaction.

Kind Regards,


Paul Hernández
Post #1460999
Posted Friday, June 7, 2013 10:55 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, April 1, 2015 2:26 AM
Points: 44, Visits: 238
Evil Kraig F (6/6/2013)
David,

Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?


Funny you mentioned MongoDB, we are Microsoft shop but we currently evaluating MongoDB as possible solution to store blobs. Our production system generates a lot of blobs (PDF, HTML and images), they are currently stored as varbinary(max) or on the file system. We are looking for better solution and some of the products are looking into is MongoDB.
Post #1461164
Posted Monday, June 10, 2013 10:00 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, December 17, 2013 4:20 PM
Points: 1, Visits: 3
nice work, but couple of things wrong here being an SQL server guy you should not focus on Cloudera because Cloudera is blessed by ORACLE and Hortonworks is blessed by Microsoft, the two eco-systems has some minor differences, hortonworks works with Windows Server, you no need learn LINUX or JAVA. By learning LINUX and going through traditional BIG-DATA approch you are trying to be master in both worlds. but my advice is as an SQL Server person you will be fine if you try to master the HORTONWORKS Big data offering.
Post #1461643
Posted Tuesday, June 11, 2013 7:19 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, September 30, 2013 1:17 PM
Points: 6, Visits: 54
Let me start by saying "Nice article!". Your descriptions were clear and concise and you did not assume knowledge on the part of the reader.
One statement may be a bit too generalized, however:
Up to 1 tonne the BMW wins hands down in all scenarios. From 1 to 5 tonnes it depends on the load and distance and from 5tonnes up to 40 tonnes the truck wins every time.


If the 1-ton truck is SQL Server, what version of SQL Server? I would suggest that the Parallel Data Warehouse (PDW) is also a 40 ton truck with high compression (supposedly up to 15x but I have seen 25x on purely numeric data) and unbelievable speed (far faster than Hadoop in accessing structured data). In addition, it can query Hadoop clusters using standard SQL syntax using a new technology called PolyBase.
I am not saying PDW replaces Hadoop; Hadoop is designed for unstructured data and PDW for structured. The two work together. But even Hadoop users aggregate Hadoop data into structured RDBMS for analysis.

So perhaps the analogy would be better explained thus: SQL Server is a brand of trucks with capacities from 1 to 40 tons. These trucks are designed to carry optimal payloads in an ordered way. Hadoop is a train with infinity capacity but not infinite performance. It must stay on its tracks to be effective.

Well, perhaps someone more prosaic can address the analogy as I think this one is not much better than yours, but it is closer to the mark.



It is a privilege to see so much confusion. -- Marianne Moore, The Steeplejack
Post #1462101
Posted Friday, June 21, 2013 6:51 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, April 9, 2015 6:15 AM
Points: 48, Visits: 3,379
Very nice article and my thoughts are confirmed by this article..Thank you...
Post #1466161
Posted Thursday, November 28, 2013 6:51 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Today @ 8:30 AM
Points: 21, Visits: 132
Good Article for getting to know - what is Hadoop,
Thanks to David Poole.
Post #1518347
Posted Thursday, January 1, 2015 12:04 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 5:20 AM
Points: 2,950, Visits: 1,933
I wrote this article just over 18 months ago and in that time the entire Hadoop ecosystem has evolved considerably. The whole Big Data space is moving so fast that if you were to buy a book and spend 1 hour per day working through the exercises the book would be obsolete by the time you reached the end. In my opinion text books in this arena are only useful for teaching concepts. They are not reference books as they would be if written for SQL Server.

YARN = Yet Another Resource Negotiator. This allows YARN compatible Hadoop plugins to be assigned proportions of the available resources. It is good for allowing multiple and mixed loads to be run at once. It is so much more than MapReduce 2.

Then there is Slider. Slider is a container technology that sits on top of YARN. This means that even if you have a Hadoop app that is not compatible with YARN, if it runs in a Slider container it will be resource managed by YARN.

A huge amount of work has gone into TEZ and the Stinger initiative which boosts the speed of HIVE queries by a dramatic amount. Claims are for a 100x improvement over the original HIVE.

Vendors such as IBM and Pivotal have decided that HDFS is not robust enough for the enterprise so have replaced it with their own tech.

Hadoop 1.0 had the Name Node as a weak point. Hadoop 2.0 is much more tolerant of name node failure.

Apache projects such as Knox and Ranger address a number of security concerns.

The Apache Spark computational framework is an alternative to MapReduce and can use HDFS. Spark offers a layer of abstraction to make it much easier to write distributed compute code. A number of applications are being ported to make use of Apache Spark.

The main vendors are now no-longer claiming that Hadoop replaces the traditional data warehouse. They are positioning it as a complementary technology. The genius of AND vs the tyranny of OR. Teradata have embraced AsterData and Hadoop in a very impressive way so each of the three capitalises on the strengths of the other parts.

All in all Hadoop has gone through the hype cycle and people are now gaining realistic expectations as to what the technology can give them.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1647746
Posted Thursday, January 1, 2015 10:15 PM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Saturday, April 25, 2015 4:03 PM
Points: 546, Visits: 843
Great article! Thank you.


Post #1647770
Posted Saturday, January 10, 2015 1:11 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, April 1, 2015 2:26 AM
Points: 44, Visits: 238
Since the last time I read this article I started working in big data environment with AWS, Hadoop (EMR) and Vertica.

As you mentioned Hadoop is complementary technology, we found out it can be used as great ETL tool to aggregate 50 columns in hundreds of million records and then send the result to Tabular. To do the same in SQL Server was not practical.

We found out that for our purposes Hadoop is not a great data store, and while some tools like Tableau can connect to HDFS and run map-reduce queries against it the performance is not that great (at least in our case)

Also AWS and MapR do not use HDFS

I also started working with Vertica and I was amazed, it takes many minutes in SQL Server to move half a billion records from one table to another (via BCP out , BCP in), it takes seconds to it in Vertica. I did not believe it finished so had to do count to really see that all the records were transferred.!

Vertica does not have many features but its LOAD and SELECT command are blazing fast.
Post #1650065
« Prev Topic | Next Topic »

Add to briefcase «««1234»»

Permissions Expand / Collapse