The Power of Hadoop

  • Comments posted to this topic are about the item The Power of Hadoop

  • "Where traditional databases hit their limits, Hadoop starts to emerge as a much better fit for solving unique analytics challenges," Lockner says. "Because data can be incorporated from multiple sources with varying types of data structures, Hadoop enables more analysis across multiple data feeds in a single platform -- solving some of the toughest data integration challenges commonly associated with relational data warehouse architecture."

    The article doesn't provide details about what type of data they working with, it's only described as being more or less unstructured, originating from multiple sources, and being analyzed for security purposes. If I had to guess, they are probably using Hadoop as a staging environment to injest things like emails or web forum posts and searching for keywords and phrases that would indicate security threats. For example, there are websites where hackers go to trade or sell account numbers, logins, and other personal information.

    It makes sense to do this sort of document archiving and semantic data crunching outside the relational database.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I was curious about Hadoop, and that let me to Cassandra and DataStax and tutorials and forums--and near the end of my search I still didn't really understand whether this would be useful for me. I saw a lot of big names using it, but I didn't grasp the non-relational database concept. Then I finally read a blog entry from Arin Sarkissian that (although rather crudely written) explained the concept. And I see how and why it works--for blogs and comments, or for social networking sites, or whatever. But it won't work efficiently for my job--providing a way to track huge amounts of store inventory data and generate very quick datamining reports for customers. If I were tracking data that was less structured but more inclined to be enormous in its scope, and I needed to ensure I could keep it all and keep it intact forever, Hadoop and all the things related to it would be, perhaps, ideal.

  • dshaddock (2/28/2012)


    I was curious about Hadoop, and that let me to Cassandra and DataStax and tutorials and forums--and near the end of my search I still didn't really understand whether this would be useful for me. I saw a lot of big names using it, but I didn't grasp the non-relational database concept. Then I finally read a blog entry from Arin Sarkissian that (although rather crudely written) explained the concept. And I see how and why it works--for blogs and comments, or for social networking sites, or whatever. But it won't work efficiently for my job--providing a way to track huge amounts of store inventory data and generate very quick datamining reports for customers. If I were tracking data that was less structured but more inclined to be enormous in its scope, and I needed to ensure I could keep it all and keep it intact forever, Hadoop and all the things related to it would be, perhaps, ideal.

    That's exactly the point, and one that should be brought up to management when they see Hadoop or Cassandra as the latest fad. It works for some places, not for others.

  • I am curious to see more of how Hadoop will play in the MS BI stack with big data. Is it more of a fad, or does it really have staying power?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (2/28/2012)


    I am curious to see more of how Hadoop will play in the MS BI stack with big data. Is it more of a fad, or does it really have staying power?

    There are many applications that have a need to store billions of entity-attribute-value type records, in addition to their more traditional relational data. An example of this would be Microsoft's own Entity Framework product. I think it's in Microsoft's best interest to demonstrate that, for this scenario, the best solution is for a relational database (like SQL Server) to co-exist side by side with a key-value optimized database solution. Attempts at implementing large scale key-value data models in SQL Server will result in frustrated customers.

    Essentially it's same concept as moving OLAP cube data outside SQL Server and into a seperate database product, like Microsoft Analysis Services or moving BLOBs into FileStream.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • You're right on track, I believe. Many years ago--before Windows and the Office suite--I constructed a roll-your-own database I called Megabase, because I wanted to store absolutely anything I felt like--pictures, code snippets, contact information, etc. It was hard to be all things to all people--even when the 'all people' was just me. I used InfoSelect for a while but was frustrated by its structured approach--although it was great to find obscure notes. And when I landed on OneNote I found I love it--but it's based on SQL Server, and I know when I store such a wide variety of junk in it that it has to be fairly bloated and less efficient than it might be if it were based on Hadoop. And perhaps the best architecture is to run something like OneNote with parts of it in SQL Server and parts in Hadoop...

    David Shaddock

  • I've been playing with Hadoop both locally and in AWS and although a newbie to it I've had a few reality checks with it.

    Firstly, it is still in the early phase of the Gartner Hype Cycle. It has yet to go through the "trough of disillusionment" let alone the "Slope of enlightenment" or to the "Plateau of Productivity".

    I had it grinding up a few billion records on 4 large AWS nodes and the answer I wanted came back in 14 seconds.

    Out of curiosity I took the same recordset and imported it into modest SQL Server 2008R2 instance. The same text crunching took 10 seconds!

    The conclusions I draw from this are as follows:-

    • There is obviously a threshold that has to be reached before Hadoop delivers a clear advantage
    • That threshold is going to depend on the complexity of what you are trying to do to that data. I was simply extracting parts of a web log.
    • The big advantage of Hadoop is the fact that it runs on commodity kit and has been designed with the expectation that such kit will suffer failures.
    • Hadoop clusters under utilize their CPU resource, its the disk IO isolation they champion on. Rainstor [/url]have an interesting compression and data location awareness technology to boost the performance of Hadoop.
    • Apache subprojects such as Hive and PIG are essential for wider scale Hadoop adoption.

    Setting up Hadoop & Hive was a baptism of fire as I was and still a Linux newbie.

    These tools are 0.x releases so the instructions are of varying levels of completeness and accuracy.

    There are loads of instructions out there, but they vary quite a bit.

    You'll find that IF you have the pre-requisites up and running installing stuff on Linux is no worse than any other code deployment in your organisation.:hehe:

    If you don't have the prerequisites you will find yourself tracing through the dependencies or trying to work out what those dependencies might be. It isn't always clear and the error messages are largely Java error reports. Just too long to fit in a scrolling window and the important bit has just fallen out of the scroll window buffer!

    A basic understanding of Linux is an absolute must.

    I was relieved to find that the Linux community is no longer a training ground for the special forces squadrons of the troll army.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply