SQL-On-Hadoop: Hive - Part I

  • Frank Banin

    Ten Centuries

    Points: 1323

    Comments posted to this topic are about the item SQL-On-Hadoop: Hive - Part I

    Frank Banin
    BI and Advanced Analytics Professional.

  • stan.geiger

    SSChasing Mays

    Points: 647

    I admire all the work put in to accomplish this stuff.   But now that Polybase is part of SQL Server, why wouldn't you connect directly to Hadoop from SQL Server.  We have found it much more efficient than SQOOP for transfering data and also allows you to use most if not all of the existing T-SQL constructs.   You don't have to know Linux only where on the clusters the data is located.  Another benefit is that Polybase creates all the MapReduce needed to generate the query set and runs it on Hadoop only bringing back the data.

  • rgagne99

    SSC Enthusiast

    Points: 176

    Hadoop demystified for SQL Server Users. Well done and thanks!
    ...Ray

  • Frank Banin

    Ten Centuries

    Points: 1323

    stan.geiger - Wednesday, July 26, 2017 7:29 AM

    I admire all the work put in to accomplish this stuff.   But now that Polybase is part of SQL Server, why wouldn't you connect directly to Hadoop from SQL Server.  We have found it much more efficient than SQOOP for transfering data and also allows you to use most if not all of the existing T-SQL constructs.   You don't have to know Linux only where on the clusters the data is located.  Another benefit is that Polybase creates all the MapReduce needed to generate the query set and runs it on Hadoop only bringing back the data.

    Like Ray put it, one objective is to demystified Hadoop for SQL Server Users and for them to know all the SQL-On-hadoop options out there.
    Besides that, depending on processing objective there might be some benefits to using one against the other.
    Because Hive is built directly on top of Hadood and part of the Apache framework, for instance you can map a HBase table (a NoSQL database table) as an EXTERNAL Hive table.
    Hive integrates directly with Apache Spark & SparkQL an option we will look at later. SparkSQL uses a nested data model based on Hive for tables and DataFrames which makes that option more suited for interactive and real-time processing.

    Frank Banin
    BI and Advanced Analytics Professional.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply