Data Lakes

  • Comments posted to this topic are about the item Data Lakes

  • As implied, the concept has been around for a long time, tends to be shaped somewhat by whichever technology is in at the time. In my opinion, tossing in another technology/trend is more likely to turn it into a Muddy Data Puddle rather than a Lake.

    😎

  • I guess I'd have to ask why anyone might think this is a new idea. Isn't this the purpose of things like databases?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I disagree on this one. Finding fixes for problems is what we all do if we are designing systems. These days a single sensor array can deliver 100 TB A DAY of data. Data that needs to be handled and knowledge extracted from it.

    I think most businesses would be heading down the wrong path to invest in Big Data initiatives in 2015. The big companies and government will lead the way with the technology, but with enough people solving the problem of how to extract the knowledge from unimaginably large "Data Lakes", I expect successful companies, even mid-sized ones, to be mining the knowledge in the very near future.

  • It is similar to a database, but the scale of data in most companies (overall), exceeds what a single RDBMS can handle. I'm not sure that even in the smaller companies I've worked that we'd have stored all our data in a single db.

    The "lake" is all of the data you collect. Imagine inventory, sales, CRM, accounting, and more in a "lake" that any application (with access) can query. It's a challenge.

  • hans.chambers.ctr (1/12/2015)


    I disagree on this one. Finding fixes for problems is what we all do if we are designing systems. These days a single sensor array can deliver 100 TB A DAY of data. Data that needs to be handled and knowledge extracted from it.

    I think most businesses would be heading down the wrong path to invest in Big Data initiatives in 2015. The big companies and government will lead the way with the technology, but with enough people solving the problem of how to extract the knowledge from unimaginably large "Data Lakes", I expect successful companies, even mid-sized ones, to be mining the knowledge in the very near future.

    Not sure what you mean? You disagree with Jeff? Your statement seems to be saying two things. We need to solve the big data issue, but we shouldn't invest?

  • Unsupervised digital landfill. Doesn't sound as nice as a lake does it?

    Actually Hadoop fulfils a data lake need reasonably well. Lets suppose that you have data that you want to retain. You suspect that it has value so you don't want to throw it away. You want to be able to evaluate it to find out its worth but suspect that you may want several iterative experiments to determine its value.

    Landing this data in a traditional database assumes that it can be represented in a tabular format. If you have to transform it into the tabular format then the danger is the raw data will be thrown away. Should you discover that your transformations could have extracted more valuable information from the raw stuff that has been thrown away then you are in a bit of a bind.

    The reasons why we would throw the raw stuff away in the traditional world is that the cost of storing and processing the data is simply too high for something where we can't provide an upfront assessment of its worth.

    Hadoop with its commodity hardware provides cheap resilient storage for data that does not have to be in a tabular format.

    We can use Hive, Pig, Spark, Map Reduce etc to experiment with the data. Having found its worth we can use Hadoop as a distributed ETL system and provide files that we can SQOOP into a more traditional database. The lake is the natural raw stuff, the traditional DB is the refined drinkable stuff.

    Hadoop, with its cheapness and scalability makes it financially possible to retain and process data far beyond what we could do before.

    The bit that is tricky is making sure that any data that is landed in the lake has sufficient metadata so when you come to process it you have a good idea what it is. This is where serialization frameworks such as Avro, Kyro and Thrift come in. Apache Falcon keeps track of data lineage once it has landed within Hadoop.

    A lot of traditional vendors are putting a lot of investment into Hadoop. We think of the obvious ones like IBM, Microsoft and Oracle but some of the not so obvious ones are investing heavily as well such as Intel.

    Is Hadoop a mature and finished product like Oracle and SQL Server? No, of course not but it is evolving extremely quickly. The community is coalescing around Apache Spark so we are getting machine learning algorithms, Hive and streaming products being ported to using Spark. Again, massive investment by a cast of thousands.

  • We're actually building out our "lake" using CDC and cherry picking the pieces and parts to fill a data warehouse. The warehouse is then exposed via the new"er" power BI tools provided by MicroSoft. So far, the results have been stunning and the "lake" is still fairly small. But this allows users to get at the exposed data via role base access in the warehouse with tools they are already familiar with, i.e PowerPivot, Excel, SSRS, etc.

    I think the idea is ominous and unwieldy when you try to pack and ocean into a lake. Everyone doesn't need access to everything, but a lot of people need limited access to a lot of data. Figuring out what is needed and who needs it is the other half of making it accessible, once you've figured out how to move and transform that data into something meaningful. Furthermore, creating security models so lakes appear as puddles is key. Very few have access to the lake as a whole or even a half! But many have limited views to data that has been deemed meaningful to their teams.

  • Steve Jones - SSC Editor (1/12/2015)


    It is similar to a database, but the scale of data in most companies (overall), exceeds what a single RDBMS can handle. I'm not sure that even in the smaller companies I've worked that we'd have stored all our data in a single db.

    The "lake" is all of the data you collect. Imagine inventory, sales, CRM, accounting, and more in a "lake" that any application (with access) can query. It's a challenge.

    I read an interesting book the other day (well actually it was back in the early 1990s) by guy named Kimball. His ideas sounded similar to this "data lake" thing, but he was calling it a "data warehouse".

    Going with the lake motif, there are natural lakes that accumulate organically over time, there are engineered lakes, and then there are lakes that pool up overnight when some beaver decides to dam up a stream and claim it as his own.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Interesting thoughts, David. Good to hear how you're viewing things. You up for a second piece on Hadoop?

  • sgtmango333 (1/12/2015)


    We're actually building out our "lake" using CDC and cherry picking the pieces and parts to fill a data warehouse. The warehouse is then exposed via the new"er" power BI tools provided by MicroSoft. So far, the results have been stunning and the "lake" is still fairly small. But this allows users to get at the exposed data via role base access in the warehouse with tools they are already familiar with, i.e PowerPivot, Excel, SSRS, etc.

    I think the idea is ominous and unwieldy when you try to pack and ocean into a lake. Everyone doesn't need access to everything, but a lot of people need limited access to a lot of data. Figuring out what is needed and who needs it is the other half of making it accessible, once you've figured out how to move and transform that data into something meaningful. Furthermore, creating security models so lakes appear as puddles is key. Very few have access to the lake as a whole or even a half! But many have limited views to data that has been deemed meaningful to their teams.

    Interesting. Any chance you'd like to (or can) share details in an article?

  • Eric M Russell (1/12/2015)


    I read an interesting book the other day (well actually it was back in the early 1990s) by guy named Kimball. His ideas sounded similar to this "data lake" thing, but he was calling it a "data warehouse".

    Going with the lake motif, there are natural lakes that accumulate organically over time, there are engineered lakes, and then there are lakes that pool up overnight when some beaver decides to dam up a stream and claim it as his own.

    That's what I'd seen it as, at least initially. However the points about cost of storage, and lack of transformation to get into tabular make some sense. There's a lot of engineering in RDBMSes that needs to happen for the data to be useful.

    I tend to agree. A data lake can be a well put together, or it can be a cesspool.

  • Eric M Russell (1/12/2015)


    ...I read an interesting book the other day (well actually it was back in the early 1990s) by guy named Kimball. His ideas sounded similar to this "data lake" thing, but he was calling it a "data warehouse"...

    I have to admit that this was how I read it.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Gary Varga (2/6/2015)


    Eric M Russell (1/12/2015)


    ...I read an interesting book the other day (well actually it was back in the early 1990s) by guy named Kimball. His ideas sounded similar to this "data lake" thing, but he was calling it a "data warehouse"...

    I have to admit that this was how I read it.

    Basically, although it seems the differentiation is that a "data lake" is an accumulating pool of unstructured and unfiltered data which is parsed at runtime (Hadoop / MapReduce). In constrast, a traditional "data warehouse" is contained in a RDMS in either 3rd normal form (Inmon), star-schema (Kimball), or OLAP (SSAS, MicroStrategy).

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (2/6/2015)


    Gary Varga (2/6/2015)


    Eric M Russell (1/12/2015)


    ...I read an interesting book the other day (well actually it was back in the early 1990s) by guy named Kimball. His ideas sounded similar to this "data lake" thing, but he was calling it a "data warehouse"...

    I have to admit that this was how I read it.

    Basically, although it seems the differentiation is that a "data lake" is an accumulating pool of unstructured and unfiltered data which is parsed at runtime (Hadoop / MapReduce). In constrast, a traditional "data warehouse" is contained in a RDMS in either 3rd normal form (Inmon), star-schema (Kimball), or OLAP (SSAS, MicroStrategy).

    Thanks for the clarification. Much appreciated.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply