A Beginners Look at Hadoop

  • Greg_Della-Croce (6/6/2013)


    Thanks for clearing that up for me. Since Hadoop is a great scanning tool for when you have data over a TB, are there tools that can reach in and do discovery data mining on the results? I am not familiar with PIG and the other tools you talked about.

    My interest is in taking linguistic works, running them into a database of some sort, and doing discovery mining on them. The base is large enough for Hadoop to be a candidate to help, but it is discovery tools that I am missing the idea for right now.

    If you are mining text, then Splunk might be a good alternative. I haven't used it yet myself, but I ahve a colleague that is testing it on corporate network logs that total 500GB-1TB per day. Performance is very good. I think there are options for free downloads and testing.

  • David,

    I wanted to thank you for this well laid out walk through of your experiences with Hadoop. The article is rather informative and shows the difficulties in switching from a familiar environment into a less obvious one.

    Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

    Thoughts?

    (I'm messing around with the Hortonworks VM and it seems to work quite well. )

  • Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?

    I haven't evaluated MongoDB as yet. I am more likely to look at Redis (for session state), RIAK (for customer accounts) and Neo4J (potential CRM).

    I'm curious about VoltDB which is one of Michael Stonebraker's children. As he has Ingres, PostGres and Vertica to his name VoltDB should be worth a look. I'm curious to know what similarities there are between VoltDB and the SQL2014 Hekaton stuff.

    It is quite hard to find the time, energy and resource to do a decent evaluation of such products. How much of my SQL Server knowledge can be leveraged in a comparison of NoSQL and is it even fair to attempt such a comparison? The most I can hope to do is to state precisely what the experiment involved so the methods and results are both up for scrutiny.

  • gclausen (6/5/2013)


    Great article!! Do you think it is worth learning a little bit of Java for this?

    I feel it is always useful to have at least one non-SQL development language under your belt. If you already have experience with a language then learning Java on top is a good move.

    If you haven't got experience yet and you are primarily a SQL Server guy I'd start by learning C#, you will probably use it more. Java and C# have a lot of similarities which isn't surprising given their history. Once you've learnt C# then applying your knowledge to Java should be relatively straight forward.

  • stephen.lloyd 63174 (6/6/2013)


    I agree that Hadoop should not be used as a replacement for a data warehouse. It does seem to be getting some use as a staging environment for data warehouses. The idea is that it is a good batch processor and can scale well as data grows.

    Thoughts?

    One thing that has bitten me very hard on the bum in the data warehouse arena is not having a long term source of non-transformed data.

    We can provide high availability by having multiple copies of the data

    We can provide disaster recovery by having a robust and tested backup strategy

    What happens if you find out that a transformation in the data warehouse load missed something crucial? It could be an aggregation from a union query, a data quality dedupe or anything. The resulting data looks legitimate, doesn't cause errors but doesn't reconcile.

    If you don't have access to the original source data you are stuck with a fix going forward and having to ignore the poor data and anything related to it.

    I think, if you are careful, then you can use Hadoop as a glorified BCP (through SQOOP) repository of your source data acting as a staging area. Hadoop was built to scale but you are going to need quite a bit of hardware to get the balance between performance and scale.

    I have a curiosity in Impala which seems to offer MPP like capability to Hadoop. One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files. Mechanically the wheels go around but they don't do so with any urgency.

    With that in mind I'm not sure how Impala/Hadoop would handle slow changing dimensions.

  • One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files.

    Hi,

    Even if you have small files, but a lot of them, you can obtain a good performance in your Hadoop system.

    Background: a characteristic of Hadoop is that computing performance is significantly degraded when data is stored in many small files in HDFS. That occurs because the MapReduce Job launches multiple task, one for each separated file, and every task requires some overhead (execution planning and coordination).

    To overcome this drawback you only need to consolidate small files. There are several options to accomplish this task; the first attempt could be the low-level HDFS API, which is in my opinion the hardest way, but some java spiders may feel comfortable with it. Another option is the Pail Library, which is basically a Java library that handles the low-level filesystem interaction.

    Kind Regards,

    Paul Hernández
  • Evil Kraig F (6/6/2013)


    David,

    Out of curiousity, did you also evaluate Mongo and are you planning one of these for that as well?

    Funny you mentioned MongoDB, we are Microsoft shop but we currently evaluating MongoDB as possible solution to store blobs. Our production system generates a lot of blobs (PDF, HTML and images), they are currently stored as varbinary(max) or on the file system. We are looking for better solution and some of the products are looking into is MongoDB.

  • nice work, but couple of things wrong here being an SQL server guy you should not focus on Cloudera because Cloudera is blessed by ORACLE and Hortonworks is blessed by Microsoft, the two eco-systems has some minor differences, hortonworks works with Windows Server, you no need learn LINUX or JAVA. By learning LINUX and going through traditional BIG-DATA approch you are trying to be master in both worlds. but my advice is as an SQL Server person you will be fine if you try to master the HORTONWORKS Big data offering.

  • Let me start by saying "Nice article!". Your descriptions were clear and concise and you did not assume knowledge on the part of the reader.

    One statement may be a bit too generalized, however:

    Up to 1 tonne the BMW wins hands down in all scenarios. From 1 to 5 tonnes it depends on the load and distance and from 5tonnes up to 40 tonnes the truck wins every time.

    If the 1-ton truck is SQL Server, what version of SQL Server? I would suggest that the Parallel Data Warehouse (PDW) is also a 40 ton truck with high compression (supposedly up to 15x but I have seen 25x on purely numeric data) and unbelievable speed (far faster than Hadoop in accessing structured data). In addition, it can query Hadoop clusters using standard SQL syntax using a new technology called PolyBase.

    I am not saying PDW replaces Hadoop; Hadoop is designed for unstructured data and PDW for structured. The two work together. But even Hadoop users aggregate Hadoop data into structured RDBMS for analysis.

    So perhaps the analogy would be better explained thus: SQL Server is a brand of trucks with capacities from 1 to 40 tons. These trucks are designed to carry optimal payloads in an ordered way. Hadoop is a train with infinity capacity but not infinite performance. It must stay on its tracks to be effective.

    Well, perhaps someone more prosaic can address the analogy as I think this one is not much better than yours, but it is closer to the mark.


    It is a privilege to see so much confusion. -- Marianne Moore, The Steeplejack

  • Very nice article and my thoughts are confirmed by this article..Thank you...

  • Good Article for getting to know - what is Hadoop,

    Thanks to David Poole.

  • I wrote this article just over 18 months ago and in that time the entire Hadoop ecosystem has evolved considerably. The whole Big Data space is moving so fast that if you were to buy a book and spend 1 hour per day working through the exercises the book would be obsolete by the time you reached the end. In my opinion text books in this arena are only useful for teaching concepts. They are not reference books as they would be if written for SQL Server.

    YARN = Yet Another Resource Negotiator. This allows YARN compatible Hadoop plugins to be assigned proportions of the available resources. It is good for allowing multiple and mixed loads to be run at once. It is so much more than MapReduce 2.

    Then there is Slider. Slider is a container technology that sits on top of YARN. This means that even if you have a Hadoop app that is not compatible with YARN, if it runs in a Slider container it will be resource managed by YARN.

    A huge amount of work has gone into TEZ and the Stinger initiative which boosts the speed of HIVE queries by a dramatic amount. Claims are for a 100x improvement over the original HIVE.

    Vendors such as IBM and Pivotal have decided that HDFS is not robust enough for the enterprise so have replaced it with their own tech.

    Hadoop 1.0 had the Name Node as a weak point. Hadoop 2.0 is much more tolerant of name node failure.

    Apache projects such as Knox and Ranger address a number of security concerns.

    The Apache Spark computational framework is an alternative to MapReduce and can use HDFS. Spark offers a layer of abstraction to make it much easier to write distributed compute code. A number of applications are being ported to make use of Apache Spark.

    The main vendors are now no-longer claiming that Hadoop replaces the traditional data warehouse. They are positioning it as a complementary technology. The genius of AND vs the tyranny of OR. Teradata have embraced AsterData and Hadoop in a very impressive way so each of the three capitalises on the strengths of the other parts.

    All in all Hadoop has gone through the hype cycle and people are now gaining realistic expectations as to what the technology can give them.

  • Great article! Thank you.

  • Since the last time I read this article I started working in big data environment with AWS, Hadoop (EMR) and Vertica.

    As you mentioned Hadoop is complementary technology, we found out it can be used as great ETL tool to aggregate 50 columns in hundreds of million records and then send the result to Tabular. To do the same in SQL Server was not practical.

    We found out that for our purposes Hadoop is not a great data store, and while some tools like Tableau can connect to HDFS and run map-reduce queries against it the performance is not that great (at least in our case)

    Also AWS and MapR do not use HDFS

    I also started working with Vertica and I was amazed, it takes many minutes in SQL Server to move half a billion records from one table to another (via BCP out , BCP in), it takes seconds to it in Vertica. I did not believe it finished so had to do count to really see that all the records were transferred.!

    Vertica does not have many features but its LOAD and SELECT command are blazing fast.

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply