Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

Question

Post reply

Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

Mike McQuillan

SSCertifiable

Points: 6020
More actions
June 14, 2016 at 9:55 pm

#317277

Comments posted to this topic are about the item Stairway to U-SQL Level 1: Introduction to U-SQL and Azure Data Lakes

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

PB_BI SSCoach Points: 17474 More actions · Answer 1

Great article, but I have to disagree with this:

"In the classic SQL Server stack, Analysis Services (SSAS) would be used to house the Data Warehouse"

A data warehouse is a database with a specific design based on a methodology. SSAS is a presentation layer above this (the DW can also be seen as a presentation layer) and should in no way "house" a DW.

I know it's picky but it bothers me. I won't sleep tonight now. Thanks. :crying:

I'm on LinkedIn

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 2

Hi there PB_BI

Just trying to make the article fairly general for readers who don't know the stack inside out. I do agree with your point.

Hope I don't ruin your sleep too much!

Cheers,

Mike.

PB_BI SSCoach Points: 17474 More actions · Answer 3

mike.mcquillan (6/15/2016)
Hi there PB_BI
Just trying to make the article fairly general for readers who don't know the stack inside out. I do agree with your point.
Hope I don't ruin your sleep too much!
Cheers,
Mike.

I'll live 😀

I'm on LinkedIn

Alan Burstein SSC Guru Points: 61152 More actions · Answer 4

Good article Mike. Informative, precise and tother the point.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 5

Mike McQuillan

SSCertifiable

Points: 6020

June 15, 2016 at 7:42 am

#1885149

Thanks Alan, glad you liked it!

Mike.

cwe424 SSC Veteran Points: 256 More actions · Answer 6

Hi Mike,

Great article, these were things that I knew zero about before now, but I have 2 questions about your article:

1.

The challenge this approach doesn’t resolve is: what happens if the questions the users are asking change?

Isn't this the inherent challenge of Data Warehousing? Isn't that the thing that separates the men/women from the boys/girls? I am not a seasoned expert by any means, but I hope to be one day. Heck, I'm only taking my first swing at designing a data warehouse with the BI team that I'm on, but, that seems to be the elephant in the room, that you are attempting to (at the end of a rigorous process) create a system that will "be able to answer the questions that haven't been thought of yet". This question is not in an argumentative tone, but more to make sure that I haven't missed something. If we had all decided that the changing questions in the future would be unanswerable once we built a DW, then maybe I am not pursuing the most effective solution.

2. Isn't the Big Data arena (including this data lakes concept) really more suited for non or less structured data? I thought that was the main benefit, or, so to say, that whether you put highly structured data into an RBDMS or a Big Data Apparatus, there wouldn't be that much difference in what you could or couldn't do. However, if you have less structured data to deal with, you would be basically crippled by trying to handle that in an RBDMS, but the advantage of using Big Data for structured data would be negligible.

Once again, both of these are not meant as critical of your article, just want to see if I can confirm my own understanding. Your walkthrough of the Azure Data Lakes product is exceptional, and I know it took you a lot of your own personal time to put that together. You should know that your effort is appreciated. Thanks!

Clint

Jeff Moden SSC Guru Points: 1004748 More actions · Answer 7

Great article! Thanks for taking the time to write it.

Shifting gears and without having anything to do with the article, I think U-SQL only being available to the cloud is a real shame. It's what some of us have been asking for in the local instance world for a long time and it would really be cool if they pushed it down from the cloud to us lowly Earthers that are grounded by necessary requirements.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 8

Hi Clint

Thanks for the kind words, glad you enjoyed the article.

I agree on both your points. I was trying to make the point that it should be easier to respond to the changing user questions in a Big Data area, precisely because the data is unstructured. You don't need to spend time modifying cubes, dimensions etc - you can "just" change the query.

So you definitely haven't misunderstood anything (in my view), I fully agree with both of your points.

Regards,

Mike.

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 9

Hi Jeff

Glad you liked the article, thanks for the kind words.

It is a shame U-SQL isn't available locally, although who knows what Microsoft will do in the future. There may be some possibilities on that front, if I come across anything I'll let you know.

Regards,

Mike.

MCDB Valued Member Points: 63 More actions · Answer 10

Maybe I missed someting...

How does the U-SQL query know what field to use if there are no headers?

"IMPORTANT NOTE: Before you upload the files, open them in Excel and remove the first row (the header row). U-SQL does not recognise headers at the time of writing."

Great post by the way!

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 11

Hi MCDB

It's up to the developer to know what columns are in the file, and then apply them in the EXTRACT statement. As per this statement:

@results = EXTRACT postcode string,

total int,

males int,

females int,

numberofhouseholds int

FROM "/Postcode_Estimates_1_M_R.csv"

USING Extractors.Csv();

You have to specify all columns in the file, you can filter out unwanted columns in a later SELECT statement. This is discussed in more detail in the second part of the series.

Regards,

Mike.

Andy Warren SSC Guru Points: 119922 More actions · Answer 12

Good intro, I learned some stuff!

The lack of header support, or the ability to store/relate stronger meta data seems like a serious weakness. I'm thinking about a lake with 1000's of files and the plan is to open each up to figure out the structure?

Data lake does sound cooler than "the data file share".

Andy
Connect with me on LinkedIn

Mike McQuillan SSCertifiable Points: 6020 More actions · Answer 13

Hi Andy

A feature is coming called SkipFirstNRows, which will, er, let you skip a number of specified rows. That should sort out the header issue, which is a massive problem at the moment.

It is possible to add better structure to the data, that's all coming soon!

Regards,

Mike.

g.britton SSChampion Points: 13863 More actions · Answer 14

Question: is there an on-prem version of this? I'm in banking which is heavily regulated and generally paranoid (and rightly so!) We have an on-prem cloud for server/database deployments and could use something like data lakes. For us though, any off-prem cloud is automatically off the table.

Gerald Britton, Pluralsight courses