Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

  • Comments posted to this topic are about the item Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

  • Great article, but I have to disagree with this:

    "In the classic SQL Server stack, Analysis Services (SSAS) would be used to house the Data Warehouse"

    A data warehouse is a database with a specific design based on a methodology. SSAS is a presentation layer above this (the DW can also be seen as a presentation layer) and should in no way "house" a DW.

    I know it's picky but it bothers me. I won't sleep tonight now. Thanks. :crying:


    I'm on LinkedIn

  • Hi there PB_BI

    Just trying to make the article fairly general for readers who don't know the stack inside out. I do agree with your point.

    Hope I don't ruin your sleep too much!

    Cheers,

    Mike.

  • mike.mcquillan (6/15/2016)


    Hi there PB_BI

    Just trying to make the article fairly general for readers who don't know the stack inside out. I do agree with your point.

    Hope I don't ruin your sleep too much!

    Cheers,

    Mike.

    I'll live 😀


    I'm on LinkedIn

  • Good article Mike. Informative, precise and tother the point.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Alan, glad you liked it!

    Mike.

  • Hi Mike,

    Great article, these were things that I knew zero about before now, but I have 2 questions about your article:

    1.

    The challenge this approach doesn’t resolve is: what happens if the questions the users are asking change?

    Isn't this the inherent challenge of Data Warehousing? Isn't that the thing that separates the men/women from the boys/girls? I am not a seasoned expert by any means, but I hope to be one day. Heck, I'm only taking my first swing at designing a data warehouse with the BI team that I'm on, but, that seems to be the elephant in the room, that you are attempting to (at the end of a rigorous process) create a system that will "be able to answer the questions that haven't been thought of yet". This question is not in an argumentative tone, but more to make sure that I haven't missed something. If we had all decided that the changing questions in the future would be unanswerable once we built a DW, then maybe I am not pursuing the most effective solution.

    2. Isn't the Big Data arena (including this data lakes concept) really more suited for non or less structured data? I thought that was the main benefit, or, so to say, that whether you put highly structured data into an RBDMS or a Big Data Apparatus, there wouldn't be that much difference in what you could or couldn't do. However, if you have less structured data to deal with, you would be basically crippled by trying to handle that in an RBDMS, but the advantage of using Big Data for structured data would be negligible.

    Once again, both of these are not meant as critical of your article, just want to see if I can confirm my own understanding. Your walkthrough of the Azure Data Lakes product is exceptional, and I know it took you a lot of your own personal time to put that together. You should know that your effort is appreciated. Thanks!

    Clint

  • Great article! Thanks for taking the time to write it.

    Shifting gears and without having anything to do with the article, I think U-SQL only being available to the cloud is a real shame. It's what some of us have been asking for in the local instance world for a long time and it would really be cool if they pushed it down from the cloud to us lowly Earthers that are grounded by necessary requirements.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Clint

    Thanks for the kind words, glad you enjoyed the article.

    I agree on both your points. I was trying to make the point that it should be easier to respond to the changing user questions in a Big Data area, precisely because the data is unstructured. You don't need to spend time modifying cubes, dimensions etc - you can "just" change the query.

    So you definitely haven't misunderstood anything (in my view), I fully agree with both of your points.

    Regards,

    Mike.

  • Hi Jeff

    Glad you liked the article, thanks for the kind words.

    It is a shame U-SQL isn't available locally, although who knows what Microsoft will do in the future. There may be some possibilities on that front, if I come across anything I'll let you know.

    Regards,

    Mike.

  • Maybe I missed someting...

    How does the U-SQL query know what field to use if there are no headers?

    "IMPORTANT NOTE: Before you upload the files, open them in Excel and remove the first row (the header row). U-SQL does not recognise headers at the time of writing."

    Great post by the way!

  • Hi MCDB

    It's up to the developer to know what columns are in the file, and then apply them in the EXTRACT statement. As per this statement:

    @results = EXTRACT postcode string,

    total int,

    males int,

    females int,

    numberofhouseholds int

    FROM "/Postcode_Estimates_1_M_R.csv"

    USING Extractors.Csv();

    You have to specify all columns in the file, you can filter out unwanted columns in a later SELECT statement. This is discussed in more detail in the second part of the series.

    Regards,

    Mike.

  • Good intro, I learned some stuff!

    The lack of header support, or the ability to store/relate stronger meta data seems like a serious weakness. I'm thinking about a lake with 1000's of files and the plan is to open each up to figure out the structure?

    Data lake does sound cooler than "the data file share".

  • Hi Andy

    A feature is coming called SkipFirstNRows, which will, er, let you skip a number of specified rows. That should sort out the header issue, which is a massive problem at the moment.

    It is possible to add better structure to the data, that's all coming soon!

    Regards,

    Mike.

  • Question: is there an on-prem version of this? I'm in banking which is heavily regulated and generally paranoid (and rightly so!) We have an on-prem cloud for server/database deployments and could use something like data lakes. For us though, any off-prem cloud is automatically off the table.

    Gerald Britton, Pluralsight courses

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply