Should the Data Lake be Immutable?

  • jj3pa - Tuesday, February 26, 2019 2:57 PM

    skeleton567 - Tuesday, February 26, 2019 1:53 PM

    jj3pa - Tuesday, February 26, 2019 1:19 PM

    Maybe I'm missing something, but the CS idea of immutable doesn't mean the variable can't change. It just means it needs to be copied, which can be expensive.

    For instance in C#:

    String x = new String("abc")';
    x  = x + "def";
    Is perfectly legit. But the run time x to a temp place, reallocates x then concatenates it.
    The issue is efficiency, which is why C# (and I believe Java) have StringBuffer.  That is allocated dynamically,

    Hope this helps.

    JJ

    JJ3pa, I think we're thinking more along the lines of the pros and cons of 'fixing' original data that is proven defective, versus making adjustments that can be traced historically, keeping the original in tact and unmodified.  For instance, if you identify that a system has a bug that is creating invalid data, how much do you fix by alteration and how much to you fix by traceable adjustments.  In other words, do you bury the evidence, or provide evidence why the data is being modified.  One can always adjust results without altering the elements.  

    My experience is that by far the greatest problem is getting management to admit that the original data is invalid and that something needs to be done.  All downhill from there.  🙂

    Regarding the example provided above, I like it overall, but might suggest that the original data actually be taken 'offline' and archived in the event is it appropriately needed.  This accomplishes availability AND protection Just because cloud space is available doesn't mean it MUST be used.  Offline data in the company vault is sometimes good, even off-premises.  Even back in the 70's I was trading off-premises bulk data storage with another company nearby.  It was secured in company vaults and extremely inexpensive, and we always had the unmodified version.

    I'm sorry, I wasn't speaking in db terms - I'm a newbie at that :).  That's why I'm here, to learn from you guys and gals.  I just meant  perhaps it takes a different meaning in CS vs DB circles.
    I'm getting my head around window functions 🙂

    jj

    And then we get even further into the swamp when dealing with 'fixing' data that is 'invalid' ( read that doesn't MATCH other data ) would further invalidate OTHER related data from various sources that themselves would ideally also be immutable.  For example, I work with data that involves stock and bond trades.  These calculations often provide six decimal places for both number of shares and per-share price.  Various software will then round and/or truncate these numbers to fewer decimal places, often doing the same calculations with different results.  The quandry then becomes which result do you use.  In these instances, one must choose one or the other source as the one to be 'immutable'.   One common software package, for instance, presents you with the option of altering either the number of shares in the transaction OR the per-share price for reconciling your records. Of course, in reality this is not a significant issue, but serves to illustrate the problem of deciding which version prevails.  And it is further complicated by the need to consider FUTURE needs for reconciliation of more current data.  In a nutshell, 

    Number of shares            X            Share price              =              Transaction total
    999.999999                                   999.999999                                999999.99

    OK, so your calculation total does not match that of the financial institution.  In this example, you want an accurate number of shares in your history for future reconciliation, and need an accurate transaction total for matching to other records such as account totals, taxes, etc.  Since you need two of the three numbers to match, you decide to sacrifice the 'accuracy' of the per-share price.  This is logical because the per-share price is, in the future, itself subject to historical averaging.  

    Obviously the decisions regarding immutability of data have many, many ramifications. 

    Isn't this fun?

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • skeleton567 - Wednesday, February 27, 2019 7:57 AM

    And then we get even further into the swamp when dealing with 'fixing' data that is 'invalid' ( read that doesn't MATCH other data ) would further invalidate OTHER related data from various sources that themselves would ideally also be immutable.  For example, I work with data that involves stock and bond trades.  These calculations often provide six decimal places for both number of shares and per-share price.  Various software will then round and/or truncate these numbers to fewer decimal places, often doing the same calculations with different results.  The quandry then becomes which result do you use.  In these instances, one must choose one or the other source as the one to be 'immutable'.   One common software package, for instance, presents you with the option of altering either the number of shares in the transaction OR the per-share price for reconciling your records. Of course, in reality this is not a significant issue, but serves to illustrate the problem of deciding which version prevails.  And it is further complicated by the need to consider FUTURE needs for reconciliation of more current data.  In a nutshell, 

    Number of shares            X            Share price              =              Transaction total
    999.999999                                   999.999999                                999999.99

    OK, so your calculation total does not match that of the financial institution.  In this example, you want an accurate number of shares in your history for future reconciliation, and need an accurate transaction total for matching to other records such as account totals, taxes, etc.  Since you need two of the three numbers to match, you decide to sacrifice the 'accuracy' of the per-share price.  This is logical because the per-share price is, in the future, itself subject to historical averaging.  

    Obviously the decisions regarding immutability of data have many, many ramifications. 

    Isn't this fun?

    And hence the need for BCD :).  But we digress ...

  • jj3pa - Tuesday, February 26, 2019 1:19 PM

    Maybe I'm missing something, but the CS idea of immutable doesn't mean the variable can't change. It just means it needs to be copied, which can be expensive.

    For instance in C#:

    String x = new String("abc")';
    x  = x + "def";
    Is perfectly legit. But the run time x to a temp place, reallocates x then concatenates it.
    The issue is efficiency, which is why C# (and I believe Java) have StringBuffer.  That is allocated dynamically,

    Hope this helps.

    JJ

    The idea of the data lake immutability is the same concept. You could change the file/data set, but it's expensive. You end up replacing it, so it apperars "changed", but it's really an expensive and difficult process

  • As data lakes have been maturing and we're starting to see best practices emerge, one of those practices I've seen is the concept of early ingestion and late processing.  So the concern of getting the data right/correct and the difficulties and expense involved with that would come after the original raw data (which I would argue should be our immutable data) has been ingested.  After the raw data has been ingested subsequent data improvements should be applied to copies of the data so that the original detailed source data remains intact.  I suppose a lot depends ultimately on what you're trying to accomplish -- if you're doing streaming/real-time analytics or something along those lines I could see it being more tempting to want to do some up-front improvement of the data as you're ingesting it and try to minimize extra hops and copies of the data along the way.  In recent years as new technologies like Spark have emerged in combination with new file formats like Parquet; I've become much more comfortable with the notion of processing/reprocessing copies of data on the regular (and scale out if needed) and putting more emphasis on the processing code being what you focus on being "right" and iteratively improving it, then rerunning the process to generate clean data which overwrites the data that had previously been generated from the process; as opposed to cleaning a given copy of the data itself.

  • I mean, I often take the approach mentioned here so I can just INSERT new datasets of both new and existing values. The same applies for the data lake. The document just gets entirely replaced with the new one. I mean, if I could only get the few records that needed to be updated versus both the new and old with each republish, I likely would opt to only update the few records that needed to be changed, but I am never that lucky. Easier to to flip and switch the data by day.

  • ZZartin - Tuesday, February 26, 2019 7:57 AM

    roger.plowman - Tuesday, February 26, 2019 7:24 AM

    The first question that should be asked is, should you even have a data lake or data warehouse?

    Harking back to the whole security issue, a data lake is precisely the kind of holy grail hackers would be salivating for. Since you're dumping (mostly) raw data into it, what are the chances that it contains PII? Or even sensitive information that could embarrass/seriously threaten your company?

    Second, if you make the data immutable how do you update data that's erroneous? Or delete data in accordance with GDPR / some as yet unwritten law?

    I suspect immutability should be asked after asking if you should even have the data lake or warehouse in the first place.

    In a lot of cases yes there is usually a lot of value for a company to be able to see historically what changes have been made to data over time.  And as was mentioned above from auditing perspective it might in fact be required to store historical change of PII.

    The question of how you 'update' data that is erroneous is a good one.  I think one of the things to consider is the time factor.  When making corrections, as I often do, in financial data that is found to have been recorded incorrectly or differently, my preference has always been to preserve the original transaction and then to record the adjusting transaction in the same time frame, thereby making the periodic results correct while retaining the fact that the original was defective.  If designed correctly, we are pretty sure that queries will include both the original and the correction.  If corrections are not kept in the correct time frame, and properly identified with the original, they would seem to lose lots of their value.  And a further benefit is that this also allows the analysis of exactly what and how much data is needing correction in the first place.  Also would help with things like fraud and tampering detection.

    Just a further thought on the PII data, you're going to have to keep it somewhere in order to preserve it's value.  The overriding issue here is not WHAT you keep, but HOW and WHERE you keep it.  I'm obviously an old-timer and hark back to the old days before lots of the security issues even developed, but to me it makes sense to 'not keep all your eggs in the same basket'.  In other words, you probably need TWO 'data lakes', one internal and one external, with very limited exposure and controlled access between. Sure, it gets lots more complex, but as Reagan said 'It CAN be done'.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • jj3pa - Tuesday, February 26, 2019 1:19 PM

    Maybe I'm missing something, but the CS idea of immutable doesn't mean the variable can't change. It just means it needs to be copied, which can be expensive.

    For instance in C#:

    String x = new String("abc")';
    x  = x + "def";
    Is perfectly legit. But the run time x to a temp place, reallocates x then concatenates it.
    The issue is efficiency, which is why C# (and I believe Java) have StringBuffer.  That is allocated dynamically,

    Hope this helps.

    JJ

    I'm thinking you mean stringbuilder!

  • patrickmcginnis59 10839 - Monday, March 4, 2019 1:15 PM

    jj3pa - Tuesday, February 26, 2019 1:19 PM

    Maybe I'm missing something, but the CS idea of immutable doesn't mean the variable can't change. It just means it needs to be copied, which can be expensive.

    For instance in C#:

    String x = new String("abc")';
    x  = x + "def";
    Is perfectly legit. But the run time x to a temp place, reallocates x then concatenates it.
    The issue is efficiency, which is why C# (and I believe Java) have StringBuffer.  That is allocated dynamically,

    Hope this helps.

    JJ

    I'm thinking you mean stringbuilder!

    Yes StringBuilder ... thanks.

  • jj3pa - Monday, March 4, 2019 1:27 PM

    patrickmcginnis59 10839 - Monday, March 4, 2019 1:15 PM

    jj3pa - Tuesday, February 26, 2019 1:19 PM

    Maybe I'm missing something, but the CS idea of immutable doesn't mean the variable can't change. It just means it needs to be copied, which can be expensive.

    For instance in C#:

    String x = new String("abc")';
    x  = x + "def";
    Is perfectly legit. But the run time x to a temp place, reallocates x then concatenates it.
    The issue is efficiency, which is why C# (and I believe Java) have StringBuffer.  That is allocated dynamically,

    Hope this helps.

    JJ

    I'm thinking you mean stringbuilder!

    Yes StringBuilder ... thanks.

    Also I wonder if maybe dynamic allocation is really tangent to efficiency, the way I read it with stringbuilder is that the append operation is just tacking on the to the end of the existing string without any need to copy to a new string.

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply