Data Overload

  • Comments posted to this topic are about the item Data Overload

  • Removing old data is something that I have heard discussed more and more over the last couple of years. However, almost everytime someone has said that they are not prepared to be the one to recommend it, let alone authorise it. There is so much fear. Not of the data being lost forever but of being known as the person who lost the data. I guess as data professionals that would be a far worse reputation to have than if a developer, for example, attained the same reputation.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • There are so many problems with retaining data with no clear use case.

    It has to be backed up, restored, go through DR rehearsals, all of which have an increasing impact on operational systems even when in separate tables or an obscure part of the database.

    An approach I've seen work is to have transactional replication forward on the data to a staging server and then have a historical data store facility grab that data for longer term access. This allowed operational systems to be kept as lean as possible.

    Another side affect is that rarely used tables come from rarely used systems. How do you know if that system has been deprecated or is simply hibernating? In the absence of decent communication and documentation you are forced to operate under a climate of fear

  • I'm fighting an uphill battle to get old data archived.

    Archived data doesn't need to be backed up every day and it can be put somewhere relatively slow (not on our shiny new SSDs, for example). Archiving keeps the volume of hot data relatively static. It does come at a processing cost, to be sure. It also prevents the DB from becoming progressively slower as the volume of data grows. Added to that that hardware is continually improving in performance, there is a good chance that with a working archival system, the DB becomes faster with each hardware upgrade (as against running to stand still in order to cope with ever-decelerating index and query engine performance).

    It costs time and money to retrofit an application and database with an archival system that is accessible by the application. SPs and Entity-Framework queries (may they burn in Hell) have to be re-written or newly written. The application has to be changed and so on and so forth. It all must be tested. All this instead of developing something new.

    Whenever a new application is developed, I tell them of the need to build in archival system from the beginning but there is little interest. There is always an excuse — the application will never be big enough to need one, we don't have the budget (i.e. the interest), we'll build it later, if one is needed (i.e. we don't care. It's not our problem).

    If the application gets slow, just oblige the company to throw more hardware at it. Space and performance isn't our problem, it's yours (namely DBAs and sysadmins).

  • I used to bang on to management about the vast amount of archived client data that was just sat doing nothing and big opportunities were being missed. Some bright lads in consulting, sales and marketing would surely have been able to sell statistical data derived from it back to clients. There was never any appetite or interest, despite the low cost and resource investment compared to the potential returns.

  • Sean Redmond (8/9/2016)


    ...Whenever a new application is developed, I tell them of the need to build in archival system from the beginning but there is little interest. There is always an excuse — the application will never be big enough to need one, we don't have the budget (i.e. the interest), we'll build it later, if one is needed (i.e. we don't care. It's not our problem)...

    Dependant upon the system, I am not sure it is the right sell.

    Cost now versus cost on the never never? It is always going to be deferred. I believe what is important is to consider archival and have a design documented and in place. The UI might need replacing before the data level and/or age require archiving. It might not. The business processes might change. They might not.

    What is needed is agility which requires careful consideration. In order to implement an archiving strategy as simply as possible the technical issues need to have been thought through from day 1. The business requirements on archived data may be very different when needed.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • At a previous company I worked for, they instituted a nation-wide archival/cleanup system. This had a lot to do with the costs associated with discovery in the event of a law suit. As long as you meet the various national/state/local requirements and it is an official company policy, you are able to delete data and not have that held against you in the event of a lawsuit. The cost of retrieving the data, presenting it in the required format to the concerned parties, review of the content by lawyers and knowledgeable experts and the presentation in court can be enormous.

  • One of the DBAs here refers to this as 'landfill data'. It's especially prevalent when reorganization happens and no one knows why the data was created but no one wants to be the one to say that it can be deleted.

    When I started building out my new project I added some maintenance that looks at the index statistics to find tables that haven't been accessed in the last X days. It gives us a start on finding those unused tables and hunting down the need and owner. Then it can be documented, deleted, archived, etc.

    It's not perfect but it's something. Storage space may be 'cheap' but leaving data sitting around for no good reason isn't something to encourage.

  • If your job is to manage an ever expanding database containing a record of every eCommerce website mouse click or a record of every public toilet flush across a metropolitan city, then you certainly are a poor suffering bastard.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (8/9/2016)


    If your job is to manage an ever expanding database containing a record of every eCommerce website mouse click or a record of every public toilet flush across a metropolitan city, then you certainly are a poor suffering bastard.

    I am a developer...it is Premier CRU all the way 😉

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • At my old job this was an issue I was concerned about. Since it was a very small IT shop and we (all 2 of us) had to wear multiple hats, I was concerned with data storage, especially of older data. Federal regulations required us to hold onto data for something like 7 years. Data on adolescents had to be held onto for even longer, 15 years if my memory is correct. However, we didn't come up with any solution to this problem. And I'll take some of the blame for that, I could have done better at it. I'm still not sure what the solution would have been, but somehow we could have come up with something. However, at least in that situation it wasn't too large of an issue as the data we did have, even after almost 20 years of being in business, was still only a few gigabytes in size.

    Now I work for a large state government agency with a much larger IT staff. This is no longer a consideration for me, as I don't work any longer as a DBA. But looking at it from a developer's point of view, I see that some of the software systems we work on do a good job of this, whereas others don't.

    Looking at both my old job and new one, the thing that seems to me to stick out is that at the beginning of any new software solution there's a tendency to collect as much data as possible. The better systems seem to take the approach of collecting everything including the proverbial kitchen sink for a few years, then they'll analyze what it is that makes sense to collect and then stop collecting all other data points. Some data points gives you insight and the number of data points which does give you insight might be large. But not all possible data points gives you insight.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I think there's a huge difference between archiving data and deleting data. Yes that one audit log record out of millions from 10 years ago might mean that it takes a lot to keep all that data alive but when you actually find out you need it it's worth it to have it around.

  • jimg-533118 (8/9/2016)


    At a previous company I worked for, they instituted a nation-wide archival/cleanup system. This had a lot to do with the costs associated with discovery in the event of a law suit. As long as you meet the various national/state/local requirements and it is an official company policy, you are able to delete data and not have that held against you in the event of a lawsuit. The cost of retrieving the data, presenting it in the required format to the concerned parties, review of the content by lawyers and knowledgeable experts and the presentation in court can be enormous.

    This is a big driver. If you know what your retention limit is, legally, I'd adhere to it.

  • For some outfits that collect a lot of data (our GCHQ, your NSA) it's been suggested that they would be far more effective if they collected less data. Of course that may be incorrect. (But personally, I suspect it's actually spot on. ) ANyway, as we extend the limits or our communications volume I wonder how they will cope with all the intercepted traffic when they arrive at the situation depicted at XKCD[/url] - after all, if they get there could be a crazy amount of intercepts to filter and a crazy amoun not discarded mechanically.

    Tom

  • I'm coming to the end of my career and looking back I can see huge amount of data collected, allegedly for auditing purposes, that was never accessed and no clear stakeholder for that audit data.

    I see developers and DBAs trying to 2nd guess requirements that have never been stated or clarified. To be brutally honest the 2nd guessed requirements have two characteristics

    • Are only unnecessary in about 20% of cases
    • Are probably the clearest set of requirements you have to work with

    One of the main reasons that outsourcing ends in tears is that the outsourcer can ONLY supply the requirements that are stated. They have little intrinsic knowledge of the particular business.

    The 20% of cases where the 2nd guessed requirement is wrong is where the ever growing data comes from.

    I think data retention policies are best argued from the legal perspective. I'd say operational systems be kept down to the minimum needed to fulfil their business function. Backend systems, I'd say, where there is no legal requirement, put a line in the sand and state 3 years retention on warm storage, 6 years on cold storage. Beyond 3 years the chances of senior management being in the same position is every diminishing so the chances of a decision returning to haunt you is low.

    In 3 years time the powers that be will be in a position to blame a plethora of people who have left the business should an issue arise.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply