Should the Data Lake be Immutable?

, 2019-02-26

There's a concept in computer science of immutability. At a high level, this means once something is set, it isn't changed. Various computer science languages do this with variables, where values don't change, though variables can be destroyed and recreated.

In the PASS keynote, Dr. Ramakrishnan pointed out that we have silos of data, often in disparate systems where we keep our information. We want to query this together, so we transfer this to a data warehouse or data lake (the future view) and that items in the data lake are immutable. They aren't allowed to chang in the way that we update values in our relational databases. We should just read the most recent version of any data, and if there is an update, just add a new set of data.

That's an interesting concept, but not sure I agree. I think that while we might often want to use a simpler process, there are cases where we do need capabilities to edit. Imagine I had a large set of data, say GBs in a file, would I want to download this and change a few values before uploading it again? Do we want a large ETL load process to repeat? Could we repeat the process and reload a file again? I don't think so, but it's hard to decide. After all, the lake isn't the source of data; that is some other system.

Maybe that's the simplest solution, and one that reduces complexity, downtime, or anything else that might be involved with locking and changing a file. After all, we wouldn't want queries that could potentially read the data in between us deleting a value and adding back a new one.

If you're a data warehouse or analysis person, what do you think? Does it make sense to keep the data lake as immutable and reload data that might not be clean? Let us know today.





Related content


Will the next version of Windows be a "Mini-Me" version of Vista? Who knows, and it's too early to tell, but apparently there's a mini-kernel version of Windows 7, the one after Vista, which fits into 25MB on disk. That's a touch lower than the 4GB that Vista takes up. Granted it's not a full […]


60 reads

An Hour in Time

Daylight Savings time switches a little later this year. In fact it's November 4th this year, after having been in October for all of my life. In case you don't remember which way we move the clocks, here's a saying: Spring forward, fall back.

5 (1)


199 reads

Software is Like Building a House

One of the really classic analogies in software is that it's like building a house. You have a foundation, multiple teams, lots of contractors that specialize in something, etc. And it's an analogy that's debated as to its relevance over and over. I won't go into the correctness of this analogy, but I wanted to comment on it.

2012-10-08 (first published: )

291 reads