Hadoop and SQL Server

  • Comments posted to this topic are about the item Hadoop and SQL Server

  • I wrote about Hadoop in 2009 when it was a young project, and I suspected it would enhance and work with, rather than supplant, the RDBMS

    I agree with this. Many of the staunch opponents of nosql like to present it more as either/or versus the RDBMS so I'm glad to see a different take on this!

  • Hadoop is a primitive database management system in terms of indexing, query plans, and usability of it's query language(s). It's a major player only because of it's ability to scale both storage and computing across multiple commodity (cheap) nodes. Run Hadoop on a single node, meaning take the (D) out of HDFS, and it's nothing, just a clunky full table scanner with a just for geeks non-user-friendly set of application interfaces.

    What I predict is that within the next three years we'll see Microsoft implement a Federated version of Heckaton and Clustered ColumnStore in the Enterprise Edition of SQL Server. They've basically already did that years ago with SQL Server PDW Edition, but if they can put that in the Enterprise Edition so it runs distributed on commodity hardware instead of just a high-end PDW appliance, then Hadoop may end up being a footnote in data warehousing history.

    Can you image a ColumnStore table federated across multiple nodes?

    Now that's kick ***!

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (1/14/2015)


    Can you image a ColumnStore table federated across multiple nodes?

    Now that's kick ***!

    That would be something, though I'd also like to see the federation able to handle slower storage on some nodes and not time out. It's the same issue in Azure, so I'm sure they're working on having federations that work, but aren't so tightly bound that queries can fail from timeouts.

  • Steve Jones - SSC Editor (1/14/2015)


    Eric M Russell (1/14/2015)


    Can you image a ColumnStore table federated across multiple nodes?

    Now that's kick ***!

    That would be something, though I'd also like to see the federation able to handle slower storage on some nodes and not time out. It's the same issue in Azure, so I'm sure they're working on having federations that work, but aren't so tightly bound that queries can fail from timeouts.

    For a high-end MPP appliance like SQL Server PDW or Netezza the nodes are balanced with a tightly integrated bus. Unbalanced loads would be more likely in an environment with commodity hardware, shared nodes, or perhaps a slow or flaky network connection in between. Hadoop has a balancer process that periodically re-distributes data across nodes un-evenly based on each nodes I/O and CPU performance statistics and number of past recorded failures or timeouts.

    I'm not sure, but I think that Hadoop (or maybe it was PDW) sometimes sends the same unit of work to multiple nodes that happen to contain redundant copies of the same data partition. The first node to return it's work unit wins and the other node's work is discarded. If that's not how it works, then that's how it should or could work.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Another thing about Hadoop is that the files contain data in raw unstructured format (perhaps not necessarily but at least typically) and the MapReduce process parses and performs complex computing just-in-time. I can see how that's practical for some specific applications, like ad-hoc analysis of documents or emails, and Hadoop may be a natural fit there.

    But my belief is that most enterprise business users actually would prefer their data warehouse contain structured, relational, indexed data that can be retrieved using a familiar SQL interface. They don't want some MapReduce programming geek to get between them and their data.

    Sure, Hadoop can follow-up with indexing, strong data typing, and a more fully featured SQL interface, but then they'd have to compete head to head with SQL Server in the deep end of the engineering pool. For SQL Server to take their existing relational engine and add federation is actually less of a stretch.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I think some of those geeks would also say that they want to supplant Microsoft licensing from their data.

    Microsoft handles the needs of small numbers (less than 100) CPUs on an economical basis. They don't want to compete in areas where the CPU and license count goes over 1000. Its a successful business model that does well for both Microsoft and its customers.

  • Robert.Sterbal (1/14/2015)


    I think some of those geeks would also say that they want to supplant Microsoft licensing from their data.

    Microsoft handles the needs of small numbers (less than 100) CPUs on an economical basis. They don't want to compete in areas where the CPU and license count goes over 1000. Its a successful business model that does well for both Microsoft and its customers.

    The geeks don't own the data, the organizations that employ them do. Scalability and licensing costs of a platform are artificial barriers that Microsoft and it's customers can negotiate and evolve over time to suit their interests. In a world where organizations can spin up as many database instances as they need in the cloud using 3rd party SaaS providers, per CPU licensening becomes less of an issue. The choice is reduced to the technical merits and usability of the specific database platform.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Interesting perspective, thanks.

  • Eric M Russell (1/15/2015)


    Robert.Sterbal (1/14/2015)


    I think some of those geeks would also say that they want to supplant Microsoft licensing from their data.

    Microsoft handles the needs of small numbers (less than 100) CPUs on an economical basis. They don't want to compete in areas where the CPU and license count goes over 1000. Its a successful business model that does well for both Microsoft and its customers.

    The geeks don't own the data, the organizations that employ them do. Scalability and licensing costs of a platform are artificial barriers that Microsoft and it's customers can negotiate and evolve over time to suit their interests. In a world where organizations can spin up as many database instances as they need in the cloud using 3rd party SaaS providers, per CPU licensening becomes less of an issue. The choice is reduced to the technical merits and usability of the specific database platform.

    I too enjoy vacations in Fantasyland!

  • Licensing isn't an issue, except that often we as the geeks do make recommendations. While it can be cheaper to spin things up, it isn't always cheaper. We've done analysis a few times for cloud providers and it's not necessarily cheaper, especially if you constantly run xx cores.

    However, overall it's not so much the technical merits of the platform by itself. It's the merits of what can be done by me, not anyone. I don't hesitate to run PostgreSQL or MongoDB, but I do note that I might not do so well, and I'm hesitant to commit to a project and imply that I'll have success with this particular application.

  • patrickmcginnis59 10839 (1/15/2015)


    Eric M Russell (1/15/2015)


    Robert.Sterbal (1/14/2015)


    I think some of those geeks would also say that they want to supplant Microsoft licensing from their data.

    Microsoft handles the needs of small numbers (less than 100) CPUs on an economical basis. They don't want to compete in areas where the CPU and license count goes over 1000. Its a successful business model that does well for both Microsoft and its customers.

    The geeks don't own the data, the organizations that employ them do. Scalability and licensing costs of a platform are artificial barriers that Microsoft and it's customers can negotiate and evolve over time to suit their interests. In a world where organizations can spin up as many database instances as they need in the cloud using 3rd party SaaS providers, per CPU licensening becomes less of an issue. The choice is reduced to the technical merits and usability of the specific database platform.

    I too enjoy vacations in Fantasyland!

    Licensing is just a tool for generating revenue. If Microsoft feels it is starting to lose market share to non-Microsoft Hadoop providers, not for technical reasons but simply for licensing reasons, then they could simply overcome that revenue challenge by making their licensing model just as elastic as they're their storage model. That's an economics 101 concept.

    As for anyone who thinks that it's we the database programming geeks who own the data; they're living in fantasy-land. A pink slip from HR is all it takes to seperate you from "your" data. The organization can always find someone else to query their database or implement a solution that allows the business to query it without IT support.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Run Hadoop on a single node...

    True but you might as well say "Chop a horses legs off and all you've got is pedigree chum." The whole point of Hadoop is it's ability to scale out and localise processing across a large number of nodes.

    I work with data scientists. Most of their interaction with the data boils down to "Take this column of data and apply some complex aggregation across a subset of it values" and the number of rows they'll do it across tends to be massive. Hadoop uses Map Reduce to push the bulk of that processing out to the nodes where the data resides and then does a final aggregation of the results. That means it can offer a level of parallelisation, and therefore performance, that traditional databases simply can't match at present. Maybe they will in the future or maybe they won't but for now this particular niche is one that Hadoop fills much better.

    I don't see Hadoop replacing relational databases, particularly for transactional processing, but for massive data mining tasks it really does have a role to play.

  • Eric M Russell (1/15/2015)


    patrickmcginnis59 10839 (1/15/2015)


    Eric M Russell (1/15/2015)


    Robert.Sterbal (1/14/2015)


    I think some of those geeks would also say that they want to supplant Microsoft licensing from their data.

    Microsoft handles the needs of small numbers (less than 100) CPUs on an economical basis. They don't want to compete in areas where the CPU and license count goes over 1000. Its a successful business model that does well for both Microsoft and its customers.

    The geeks don't own the data, the organizations that employ them do. Scalability and licensing costs of a platform are artificial barriers that Microsoft and it's customers can negotiate and evolve over time to suit their interests. In a world where organizations can spin up as many database instances as they need in the cloud using 3rd party SaaS providers, per CPU licensening becomes less of an issue. The choice is reduced to the technical merits and usability of the specific database platform.

    I too enjoy vacations in Fantasyland!

    Licensing is just a tool for generating revenue. If Microsoft feels it is starting to lose market share to non-Microsoft Hadoop providers, not for technical reasons but simply for licensing reasons, then they could simply overcome that revenue challenge by making their licensing model just as elastic as they're their storage model. That's an economics 101 concept.

    As for anyone who thinks that it's we the database programming geeks who own the data; they're living in fantasy-land. A pink slip from HR is all it takes to seperate you from "your" data. The organization can always find someone else to query their database or implement a solution that allows the business to query it without IT support.

    I will still take a few hundred thousand of those "artificial barriers" when you can manage it though. Thanks in advance! :hehe: :hehe: :hehe:

  • FunkyDexter (1/15/2015)


    Run Hadoop on a single node...

    True but you might as well say "Chop a horses legs off and all you've got is pedigree chum." The whole point of Hadoop is it's ability to scale out and localise processing across a large number of nodes.

    I work with data scientists. Most of their interaction with the data boils down to "Take this column of data and apply some complex aggregation across a subset of it values" and the number of rows they'll do it across tends to be massive. Hadoop uses Map Reduce to push the bulk of that processing out to the nodes where the data resides and then does a final aggregation of the results. That means it can offer a level of parallelisation, and therefore performance, that traditional databases simply can't match at present. Maybe they will in the future or maybe they won't but for now this particular niche is one that Hadoop fills much better.

    I don't see Hadoop replacing relational databases, particularly for transactional processing, but for massive data mining tasks it really does have a role to play.

    Microsoft already has an edition of SQL Server (Parallel Data Warehouse) that supports Hadoop style Federation that has been available for years. Today the Federation feature set doesn't exist in the Enterprise edition engine, but by 2016 that could change. For example, Clustered ColumnStore and Heckaton features were introduced in PDW editition first, and then intriduced into Enterprise edition when the product and financial departments at Microsoft felt the timing was right. My point is that the fundamental difference between relational databases versus Big Data NoSQL databases is actually not technical or even philosophical; Microsoft and Oracle have crossed that bridge years ago. The difference boils down to marketing and economics which can change overnight.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply