Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««123

A Beginners Look at Hadoop Expand / Collapse
Author
Message
Posted Friday, June 7, 2013 1:43 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 6:20 AM
Points: 2,923, Visits: 1,874
stephen.lloyd 63174 (6/6/2013)
I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

Thoughts?



One thing that has bitten me very hard on the bum in the data warehouse arena is not having a long term source of non-transformed data.
We can provide high availability by having multiple copies of the data
We can provide disaster recovery by having a robust and tested backup strategy

What happens if you find out that a transformation in the data warehouse load missed something crucial? It could be an aggregation from a union query, a data quality dedupe or anything. The resulting data looks legitimate, doesn't cause errors but doesn't reconcile.
If you don't have access to the original source data you are stuck with a fix going forward and having to ignore the poor data and anything related to it.

I think, if you are careful, then you can use Hadoop as a glorified BCP (through SQOOP) repository of your source data acting as a staging area. Hadoop was built to scale but you are going to need quite a bit of hardware to get the balance between performance and scale.

I have a curiosity in Impala which seems to offer MPP like capability to Hadoop. One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files. Mechanically the wheels go around but they don't do so with any urgency.

With that in mind I'm not sure how Impala/Hadoop would handle slow changing dimensions.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1460986
Posted Friday, June 7, 2013 2:28 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, December 10, 2014 4:38 AM
Points: 120, Visits: 491
One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files.


Hi,

Even if you have small files, but a lot of them, you can obtain a good performance in your Hadoop system.

Background: a characteristic of Hadoop is that computing performance is significantly degraded when data is stored in many small files in HDFS. That occurs because the MapReduce Job launches multiple task, one for each separated file, and every task requires some overhead (execution planning and coordination).

To overcome this drawback you only need to consolidate small files. There are several options to accomplish this task; the first attempt could be the low-level HDFS API, which is in my opinion the hardest way, but some java spiders may feel comfortable with it. Another option is the Pail Library, which is basically a Java library that handles the low-level filesystem interaction.

Kind Regards,


Paul Hernández
http://hernandezpaul.wordpress.com/
https://twitter.com/paul_eng
Post #1460999
Posted Friday, June 7, 2013 10:55 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 22, 2014 1:45 AM
Points: 42, Visits: 232
Evil Kraig F (6/6/2013)
David,

Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?


Funny you mentioned MongoDB, we are Microsoft shop but we currently evaluating MongoDB as possible solution to store blobs. Our production system generates a lot of blobs (PDF, HTML and images), they are currently stored as varbinary(max) or on the file system. We are looking for better solution and some of the products are looking into is MongoDB.
Post #1461164
Posted Monday, June 10, 2013 10:00 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, December 17, 2013 4:20 PM
Points: 1, Visits: 3
nice work, but couple of things wrong here being an SQL server guy you should not focus on Cloudera because Cloudera is blessed by ORACLE and Hortonworks is blessed by Microsoft, the two eco-systems has some minor differences, hortonworks works with Windows Server, you no need learn LINUX or JAVA. By learning LINUX and going through traditional BIG-DATA approch you are trying to be master in both worlds. but my advice is as an SQL Server person you will be fine if you try to master the HORTONWORKS Big data offering.
Post #1461643
Posted Tuesday, June 11, 2013 7:19 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, September 30, 2013 1:17 PM
Points: 6, Visits: 54
Let me start by saying "Nice article!". Your descriptions were clear and concise and you did not assume knowledge on the part of the reader.
One statement may be a bit too generalized, however:
Up to 1 tonne the BMW wins hands down in all scenarios. From 1 to 5 tonnes it depends on the load and distance and from 5tonnes up to 40 tonnes the truck wins every time.


If the 1-ton truck is SQL Server, what version of SQL Server? I would suggest that the Parallel Data Warehouse (PDW) is also a 40 ton truck with high compression (supposedly up to 15x but I have seen 25x on purely numeric data) and unbelievable speed (far faster than Hadoop in accessing structured data). In addition, it can query Hadoop clusters using standard SQL syntax using a new technology called PolyBase.
I am not saying PDW replaces Hadoop; Hadoop is designed for unstructured data and PDW for structured. The two work together. But even Hadoop users aggregate Hadoop data into structured RDBMS for analysis.

So perhaps the analogy would be better explained thus: SQL Server is a brand of trucks with capacities from 1 to 40 tons. These trucks are designed to carry optimal payloads in an ordered way. Hadoop is a train with infinity capacity but not infinite performance. It must stay on its tracks to be effective.

Well, perhaps someone more prosaic can address the analogy as I think this one is not much better than yours, but it is closer to the mark.



It is a privilege to see so much confusion. -- Marianne Moore, The Steeplejack
Post #1462101
Posted Friday, June 21, 2013 6:51 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, December 11, 2014 12:48 AM
Points: 48, Visits: 3,371
Very nice article and my thoughts are confirmed by this article..Thank you...
Post #1466161
Posted Thursday, November 28, 2013 6:51 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, December 17, 2014 8:29 PM
Points: 21, Visits: 97
Good Article for getting to know - what is Hadoop,
Thanks to David Poole.
Post #1518347
« Prev Topic | Next Topic »

Add to briefcase «««123

Permissions Expand / Collapse