SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Should the Data Lake be Immutable?


Should the Data Lake be Immutable?

Author
Message
Steve Jones
Steve Jones
SSC Guru
SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)SSC Guru (689K reputation)

Group: Administrators
Points: 689842 Visits: 21594
Comments posted to this topic are about the item Should the Data Lake be Immutable?

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
Dave Poole
Dave Poole
SSC Guru
SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)

Group: General Forum Members
Points: 69223 Visits: 4112
In my world a data lake is not a data warehouse.
The data lake contains data in as near to the raw source data format as possible at the time it was captured. The general idea is that it is cheap storage into which data can be stored indefinitely. It isn't a database as traditionally described. I'd say that, yes, it should be immutable.

The data warehouse is a thing of rigour and discipline where facts become available at different times and therefore updates may be required. If I am using the Kimball model then I have the various slow changing dimension (SCD) approaches to choose from. This has to be driven by business requirements. As the business requirements change then the data lake means I can revisit the SCD strategy on the basis of those requirements.

Even if I tilt towards something more Inmon esque I may still need the concept of updateable data due to facts becoming available out of sequence. For example, the process of switching an energy supplier is a lengthy process for many reasons including legislated cooling off periods. There is valuable insight (subject to GDPR permissions from the customer) even though the data won't be fully complete until 2 months after the original application.

LinkedIn Profile
www.simple-talk.com
montgark
montgark
SSC-Enthusiastic
SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)SSC-Enthusiastic (134 reputation)

Group: General Forum Members
Points: 134 Visits: 113
I am a fan of anchor modelling and temporal databases. I find powerful the idea that the database's current dataset includes all previous datasets, and that the current schema version includes all previous schema versions. One organization should be able to obtain today a report with the same results already obtained in the past, even if the underlaying dataset has been corrected, completed, or has evolved in any way. A data warehouse should be a temporal database, allowing to provide information both in its current state and in the past states (when it was still wrong or incomplete), as well as provide details on when and how was later corrected, completed or even deleted. And the same concept should apply to a datalake, whatever its implementation. Rather tan "immutable", it should be updatable in a non-destructive way.
dsor
dsor
SSChasing Mays
SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)SSChasing Mays (602 reputation)

Group: General Forum Members
Points: 602 Visits: 818
Just because a dataset is immutable doesn't mean we can't amend it when looked at.
Eric M Russell
Eric M Russell
SSC Guru
SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)SSC Guru (123K reputation)

Group: General Forum Members
Points: 123923 Visits: 15632
Is this more of a religious debate about whether everybody is 'sposta' make their data lake immutable?

Azure Blob Storage already has an optional immutable (Write Once, Read Many) feature. It's used by financial, legal and other industries that must comply with auditing and retention regulations.
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-immutable-storage


"The universe is complicated and for the most part beyond your control, but your life is only as complicated as you choose it to be."
roger.plowman
roger.plowman
SSCrazy Eights
SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)SSCrazy Eights (9.9K reputation)

Group: General Forum Members
Points: 9933 Visits: 2039
The first question that should be asked is, should you even have a data lake or data warehouse?

Harking back to the whole security issue, a data lake is precisely the kind of holy grail hackers would be salivating for. Since you're dumping (mostly) raw data into it, what are the chances that it contains PII? Or even sensitive information that could embarrass/seriously threaten your company?

Second, if you make the data immutable how do you update data that's erroneous? Or delete data in accordance with GDPR / some as yet unwritten law?

I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.
skeleton567
skeleton567
SSCarpal Tunnel
SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)

Group: General Forum Members
Points: 4735 Visits: 804
Eric M Russell - Tuesday, February 26, 2019 6:49 AM
Is this more of a religious debate about whether everybody is 'sposta' make their data lake immutable?

Azure Blob Storage already has an optional immutable (Write Once, Read Many) feature. It's used by financial, legal and other industries that must comply with auditing and retention regulations.
https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-immutable-storage

Eric, you make an excellent point on the legal aspect, and it is the first thing that came to mind when I read the article. My wife and I are actually involved in a legal situation that is scheduled to go to trial in April. and the historical aspect of the original medical data is critical. In our situation, one of the first things I did three years ago was collect records as they existed at that time, under the guise of providing them to our other medical care providers. Of course, the argument of modification can work both ways, and we as technical people understand that, but at least we have basis for argument in the event data has been 'interpreted'.

As another aside here, the current trend in attempting to 'rewrite history' in this nation presents an interesting situation in the effort to have it both ways, depending on motivation. As the current discussion of 'reparations' illustrates, financial motivation is a powerful force, and the victims are not always the perpetrators. Rewriting history can change who the victims are.

The legal ramifications of data are immense regardless of it's veracity.


Rick
Simplicity is the ultimate sophistication.
- L. DaVinci
skeleton567
skeleton567
SSCarpal Tunnel
SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)SSCarpal Tunnel (4.7K reputation)

Group: General Forum Members
Points: 4735 Visits: 804
roger.plowman - Tuesday, February 26, 2019 7:24 AM
The first question that should be asked is, should you even have a data lake or data warehouse?

Harking back to the whole security issue, a data lake is precisely the kind of holy grail hackers would be salivating for. Since you're dumping (mostly) raw data into it, what are the chances that it contains PII? Or even sensitive information that could embarrass/seriously threaten your company?

Second, if you make the data immutable how do you update data that's erroneous? Or delete data in accordance with GDPR / some as yet unwritten law?

I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.

Having or not having data is a real problem in the legal arena, and can help or hurt. Obviously it can and does work both ways, and unfortunately will likely depend on the skills of your legal defense versus the opposition. Ideally I would have to favor the historical method of correcting rather than modifying the original, but either way carries its own risks. At least preserving history and making a real record of corrections offers an honest approach and would serve to remove suspicion of tampering from consideration of other problems.


Rick
Simplicity is the ultimate sophistication.
- L. DaVinci
ZZartin
ZZartin
One Orange Chip
One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)One Orange Chip (29K reputation)

Group: General Forum Members
Points: 29775 Visits: 18374
roger.plowman - Tuesday, February 26, 2019 7:24 AM
The first question that should be asked is, should you even have a data lake or data warehouse?

Harking back to the whole security issue, a data lake is precisely the kind of holy grail hackers would be salivating for. Since you're dumping (mostly) raw data into it, what are the chances that it contains PII? Or even sensitive information that could embarrass/seriously threaten your company?

Second, if you make the data immutable how do you update data that's erroneous? Or delete data in accordance with GDPR / some as yet unwritten law?

I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.

In a lot of cases yes there is usually a lot of value for a company to be able to see historically what changes have been made to data over time. And as was mentioned above from auditing perspective it might in fact be required to store historical change of PII.

kinzleb
kinzleb
Old Hand
Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)

Group: General Forum Members
Points: 378 Visits: 151
I subscribe to the idea of using zones in a data lake, and having a "raw zone" that is immutable. The data in other zones should be replaceable by rerunning a repeatable process that "selects" data form the immutable raw zone and places/replaces it in the downstream zone. Melissa Coates has a great diagram showing a data lake with different zones for a visual: https://www.sqlchick.com/entries/2017/12/30/zones-in-a-data-lake

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum









































































































































































SQLServerCentral


Search