What Happened to Hadoop?

  • Comments posted to this topic are about the item What Happened to Hadoop?

  • My experience with Hadoop was that it was too complex, too slow, too hyped up.  It didn't take long to work out that it wasn't going to replace anyones data warehouse.  I doubt whether any data warehouse person would be doubt free by the end of a morning.

    Another issue was that a lot of technology gets hyped by people who don't really understand the problem that the inventors of that technology were trying to solve.  Or even if the inventors themselves had correctly identified whether problems being solved were symptoms or causes.

    Even though RDBMS' have been around for decades I'd be surprised if the majority of non-database people think they are anything other than machines for running SELECT, INSERT, UPDATE, DELETE statements.

    Hadoop isn't alone in being a technology for solving a specific data problem but not one that was ever designed for more general problems.  Hadoop was designed to help crawl the internet for information available publicly.  It was never designed with security in mind, with concurrency or a whole host of other things that are expected of a data warehouse.

    HDFS was a good concept but other distributed file systems are available.  These days cloud blob storage such as AWS S3 are available and provide scalability and resilience Hadoop can only dream of.

    The benefit of Hadoop was that it taught us to think about approaches to data processing in different ways.

  • STRING_AGG ?? Where is it? Did a search of sqlservercentral and found nothing.

  • HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened.

    I don't think anyone was saying it would disappear completely, but more people over time came to realize it's just a niche product instead of a general purpose everything solution for "big data".  When it comes down to it though, real-time and eventual consistency are not very compatible models.

  • kentscode wrote:

    STRING_AGG ?? Where is it? Did a search of sqlservercentral and found nothing.

    Is this in the wrong topic? Not sure what you mean?

  • Jeff Moden wrote:

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    Map Reduce is a programming technique. It's a part of the Hadoop implementation, along with the HDFS filesystem and resource management usually YARN).

    Using Map Reduce doesn't necessarily make things faster, though it does look to parallelize work that's repeatable. It may or may not make things faster, but you'd have to investigate and test what they want to do

  • Jeff Moden wrote:

    Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    There are actually multiple map reduce capabilities within the Hadoop stack.  The obvious one is "Hadoop MapReduce", the second (more popular) option is Spark.  Spark tends to be more popular from what I can tell is because of its in-memory processing.

    I will say - using map reduce isn't easy and tends to require that special developer that can THINK map reduce for it to work well.  We've had to disassemble rather horrible implementations of it because  someone thought it was the old PC TURBO button and magically made everything go faster.

    We're still in the midst of rolling out our HDP, at a much slower pace than anticipated for many of the reasons Steve brought up.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Jeff Moden wrote:

    Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    No, I don't have practical experience with Hadoop. I read up on it a few years back, but better options arrived, like Azure Databricks and Azure SQL Warehouse As A Service allowing one to spin up a cluster in the cloud and throttle resources up and down as needed. If what you're looking for is a solution to staging and bulk loading large .csv files, then look at Azure Databricks.

    https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Matt Miller (4) wrote:

    Jeff Moden wrote:

    Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    There are actually multiple map reduce capabilities within the Hadoop stack.  The obvious one is "Hadoop MapReduce", the second (more popular) option is Spark.  Spark tends to be more popular from what I can tell is because of its in-memory processing.

    I will say - using map reduce isn't easy and tends to require that special developer that can THINK map reduce for it to work well.  We've had to disassemble rather horrible implementations of it because  someone thought it was the old PC TURBO button and magically made everything go faster.

    We're still in the midst of rolling out our HDP, at a much slower pace than anticipated for many of the reasons Steve brought up.

    I was afraid someone would say that... and glad at the same time.  When the process was explained to me, my question was "Why are you going to spend all of that time learning something new and rewriting ALL the code to make it work with all that new stuff (with no guarantee of improvement, BTW) instead of just fixing the old stuff"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eric M Russell wrote:

    Jeff Moden wrote:

    Eric M Russell wrote:

    HDFS is alive and well - just like SQL was never threatened. I'm reminded of those Reeses Cups commercials from the 80's where two people collide - one with a jar of peanut butter and another with a chocolate bar - but they discover the combined result (a Reeses cup) is even better either option alone.

    https://www.youtube.com/watch?v=DJLDF6qZUX0

    Are you using Hadoop, Eric?  I ask because some folks at work are talking about using "map reduce" (AWS EMR to be specific) and I may have been mislead into thinking that is a part of Hadoop.

    In any case, they think that it's going to be the answer to some slow tabular file imports and,  if you are using it, I'd be interested in your quick take on the subject.

    No, I don't have practical experience with Hadoop. I read up on it a few years back, but better options arrived, like Azure Databricks and Azure SQL Warehouse As A Service allowing one to spin up a cluster in the cloud and throttle resources up and down as needed. If what you're looking for is a solution to staging and bulk loading large .csv files, then look at Azure Databricks.

    https://docs.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse

    Thanks, Eric, especially for the link.

    Bulk Loading large .csv files isn't a problem we have... BULK INSERT flies for such things.  It's all the other junk they think they need to do.  It'll be interesting to see what they end up with.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    <snipped for brevity>

    I was afraid someone would say that... and glad at the same time.  When the process was explained to me, my question was "Why are you going to spend all of that time learning something new and rewriting ALL the code to make it work with all that new stuff (with no guarantee of improvement, BTW) instead of just fixing the old stuff"?

    I might have an answer this time.  As you probably remember we have a LOT of hierarchical, transactional XML coming in with a dizzying amount of content in them.  Most of that content isn't immediately useful in terms of analytics to justify the very labor-intensive process of landing in proper SQL structures (which would require a substantially larger data modelling effort than has been deemed appropriate);  that additional data is however vital for those one-off queries to answer "what if" questions.

    As a result - our HDP is used to triage the high value content into SQL structures for the known reporting, while retaining the rest of the content in a usable  format that can be used for some analytics even if performance is less than ideal.

    That said - in an interesting twist of fate - it may not be necessary to have to hand-code MR code like Hadoop has previously required.  Microsoft has been putting some muscle behind some of these technologies and is starting to leverage some of these technologies in the Azure data platform offerings.  In particular the Azure Data Factory product allows for coding data flows using a guided IDE (try not to cringe but it looks a bit like SSIS) on the cloud stack, but with a lot of the processing being translated/generated into SPARK or HIVE runtimes behind the scenes.

     

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks, Matt.  I appreciate the info.  I took a look a while back at SPARK SQL... the syntax of even simple things like DATEADD is quite different but I was actually able to help someone solve a problem in it just by looking at the syntactical reference for it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • HDFS is a file system implemented using a JVM.

    hBase is a distributed database implemented using a JVM sitting on top of HDFS sitting on top of a JVM.

    The stuff that MapR did was to ask "Why don't we just write a distributed file system"? Then "Why don't we rewrite hBase to sit directly over the top of it"?

    I seem to remember they also looked at the resilience of Hadoop and asked why you'd need name nodes when you've already got the resilience in your data nodes.

    The early version of EMR was based on MapR Hadoop. These days you can choose either HDP or MapR implementation when you set up AWS EMR.

    I once joked that Hadoop encouraged you to think about new ways of solving data problems. Mainly it encouraged you to think of ways of solving those problems without Hadoop!

    Hadoop is a distributed file system, resource negotiator and a compute framework.  Everything else is just a plugin.  Spark does not need Hadoop.  It will work over AWS S3. I used it to process JSON files in an ETL pipeline.

    I am a firm believer that there is a threshold above which these technologies are good, even great, solutions. Below that surprisingly high threshold they just add cost and complexity

     

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply