This article just popped into my inbox this morning. But I see from the comments it's at least 2 years old. In any case I've heard of data lakes but did not know what they were or how to use them. Note of disclosure: I haven't fully read the article YET. But I take issue with the opening paragraph.
If you are designing a data warehouse around a set of user questions, you're doing it wrong. All software and software/database solutions are a model of some reality and/or process. The data models, be they operational or BI, should model that reality. Naturally, a DW designed around the initial questions will quickly become obsolete as questions beget more questions. Most users don't really even understand their own questions.
I design my data warehouses as a model of the business,not the initial questions. I work in the transportation industry. My data warehouses typically have 2 main fact tables: moves and financial transactions. Dimensions include the ever present date as well as location, customer, hauler (driver), freight type, etc. With the details of every move and every financial transaction (invoice line item, voucher line item, commission line item, etc.) there is literally no question the business can ask that the DW can't answer.
At my current employer my DW replaces something called the "cost file". This was their original "data warehouse" from the mainframe computer days. It fell into the trap of being designed around a question: "what are our costs?" It has a bad reputation because it is difficult to maintain and is not timely. It's a kluge of trying to shoehorn different attributes to answer different questions.
My data warehouse, designed around the core of our business, is much more robust. The business may change. A big change I see coming is driverless vehicles. I believe a well designed DW will be impervious to even that kind of paradigm shift. We are a transportation company. The hauler (driver) dimension may fall away, but we will always move freight from point A to point B. We'll always have moves and we'll always have to pay for those moves (financial transactions). A data warehouse (or any BI solution) designed around this reality will be far more robust than one designed around initial user questions.
Hall of Fame
Thanks for the series. How many levels will there be in this series? I've got lots of questions but thought I should wait until I get the full picture.
One Orange Chip
steveo250k - Friday, January 19, 2018 5:52 AM
Essentially, data lakes are just globs of structured and unstructured data. Structured in most cases in meaning, a structure the business acknowledges and can agree on, at least IMHO. It can also mean something that is consistently the same versus a structure that is constantly changing.
Regardless, I think most are using them to land data before it goes into a data warehouse. I've personally been using them for that. Since I last commented on this article, I have started to shift more to Azure Data Lake myself. Mostly to land data there first and then use Azure Data Lake Analytics to process that data as part of my ETL using U-SQL as the article suggests. This allows me to use an entirely different set of machines completely detected from the data warehouse to do work for the data warehouse without clogging up the resources the warehouse needs. That way it can focus on just loading approved structured data.
The other benefit is the fact that data warehouses are commonly built with those questions in mind as well questions you haven't asked. The issue for some teams is the data is cleaned, conformed, and structured in such a way that removes it from being RAW to PROCESSED. The raw state is something users are looking for in order to have more control of the data before the warehouse purifies it to something the business wanted. But as you know, the business can change and often the data warehouse is a roadblock for that change because of the turnaround time to get something re-processed. This is where the data lake shines too because it often stores the raw untouched data and then can become an access point for the business user to run analysis directly on the raw data and completely circumvent the warehouse all together UNTIL the warehouse can re-process it.
Note, most of these have nothing to do with quote-on-quote BIG DATA. This is mostly utility options for the end user as well the data architects building the data warehouse.
Apologies for my delayed reply, thanks for your response to the article.
I fully agree that designing a warehouse around a set of questions isn’t an optimal way of doing things, but that’s the reality I’ve seen when working for various organisations down the years. Of course, that’s just an introduction to the article – the rest of the series doesn’t particularly concern itself about data warehouses!
Your way of doing things sounds well thought-out to me!
Thanks for posting. There are currently 18 levels in the U-SQL series, with plenty more planned (I haven't posted for a while as I've been working for Pluralsight).
You can see all articles in the series here:
Feel free to post any questions you have!
Your experience with Data Lakes sounds similar to what I’ve seen out in the field – most people use the Data Lake to transform some data before pushing it into a data warehouse. That’s a perfectly valid use case. I suspect as the library expands we may see the Data Lake become the data warehouse – but that’s some time away.
One Orange Chip
mike.mcquillan - Friday, February 2, 2018 7:03 AM
Well, yes and no. Data Lakes are still document stores and not always in the most optimized model. That's where the data warehouse is going to come into play. It can optimize that model better than a Data Lake can and also ensure it's quality is of the standard of the organization. It also has a better access layer for users and interacts better with typical BI tools like Power BI versus Data Lake. It's also generally faster on smaller data sets and not every report is big data.
I don't see the data warehouse going anywhere anytime soon. But I do see data lakes becoming more useful to organizations where there is long turnaround times on getting new data or models into the warehouse or marts. If data is constantly changing, I also see the use cases there. But to replace it entirely and rely on the data lake is from what I can see, still a pain to work with. It also mostly shifts that ETL from the back to the front even though Azure Data Lake has good ways of transforming the data in Azure Data Lake Analytics.
Fully agree with you...I made that comment as Microsoft seem to be converging a lot of technologies from Data Lakes to Data Warehouses and vice versa. Be interesting to see where it all ends up!
Viewing 8 posts - 31 through 37 (of 37 total)