Query on a large database

  • I mean - you do not need to delete those 20 million rows.

    You may leave them where they are - their presence is not gonna affect queries against the sets of rows from the current periods.

    So, why bother?

    _____________
    Code for TallyGenerator

  • Why?

    Why keep data that we don't need to keep in detail and only need to keep in an aggregated form?

    Why make the backups larger, the DB larger, the maintenance longer, the restores longer, etc for data that doesn't need to be there at all?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (11/1/2015)


    Why?

    Why keep data that we don't need to keep in detail and only need to keep in an aggregated form?

    What kind of data we don't need?

    Health history?

    Credit history?

    Criminal records?

    The only area where detailed history is not required is accounting.

    But even there - "not required by law" does not mean "not needed".

    The only reason data users agree on aggregation of historical data and removing detailed information is the common belief that keeping it will cause database performance degradation. Which is caused by way too common bad database design patterns.

    As for database backups - I don't think there is a place for it in the current world.

    We must get over it.

    Restoring databases from backups is for "school projects" kind of databases only.

    _____________
    Code for TallyGenerator

  • We're going to have to agree to disagree here, on all of those points.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • It's a funny and sometimes stupid thing at work. Everyone is required to NOT keep emails over a year and so they setup automatic deletions. It almost killed us for audits because they wanted to see 14 months of enterprise reports (job runs, disk status, etc, ad infinitum). I had to setup my email to squirrel those morning reports away in a separate email folder (the information on disk was disallowed... they want the actual emails) that's immune from the daily purge of email over a year.

    My biggest single table is 350GB. It's telephone calls and I do mean recordings of the calls. We're also required to delete old audit data after a year... except for those clients that are still active and the inactive clients that haven't asked for the purge. In other words, we keep all the calls. It's the big thing that drove me into partitioning to drop the backup durations of 6-10 hours down to 6-10 minutes. I've got 5 years of calls in that baby.

    I normally wouldn't do the backups because of the things that Sergiy has spoken of. There are too many ways to get the data back and doing backups for restore planning shouldn't be one of them. Unfortunately, I don't trust the cloud for privacy nor continuance and the offsite DR clustering and logging that has been done seems unreliable to me (nice way of saying "not ready for prime time").

    So, I'm caught in the middle. It would be nice to do deletes only because of the size of legacy data that we have so that the backups for the restores that I ultimately will be able to get over but can't now, still work. Then, there's the restores to Dev and Staging because they won't spend the nickel on SAN snapshots except for DR can't be used for anything else.

    So, Push-me-pull-you. :blink: I agree with Sergiy and that's the point I'm trying to get to but I'm not there yet so I have to agree with Gail on all points until it's no longer a point except when someone at work thinks it is. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Gail, I'm talking from experience.

    I went through a "whole site" disaster.

    Not quite a "fire in the room", but with the same consequences - everything is down and non-recoverable.

    Yes, backups were all right and up to date. Less than 15 minutes loss of data.

    Good.

    Then what?

    We need a new server. No cloud.

    Luckily, a server of an appropriate scale was available from our service provider. Couple of hours - it was up and running (pure luck - for a new server from a shop we'd need couple of days just to get it through the door into the premises).

    Restoring databases: 75GB + 125GB + 250 GB.

    Nothing extraordinary. But it took about 12 hours to complete the task.

    At the time we needed to recreate security environment, register our new site for accessing web-services our system used, recreating the full infrastructure, etc.

    End of the day - we had 24 hours outage.

    Plus tidying up odds for the whole following week.

    With all the backups and the hardware available on site.

    With less than a TB of data.

    Can anybody in serious online business afford such outages?

    The only way to provide more or less reliable service is to have a DR site up and running. Mirroring, shipping transactions, or any other way - does not matter.

    But with more or less big databases you must forget about restoring it from full backups.

    Well, you might need to do a full backup to create a new DR site. But that's pretty much it.

    _____________
    Code for TallyGenerator

  • Sergiy (11/1/2015)


    Gail, I'm talking from experience.

    Yeah, well, so am I.

    Three times this year I've had to assist with the restore of a production backup to recover deleted data (2 cases) or corrupt file (one case). If all we'd had was a mirror/DR site, that would have been a 'well, you're screwed, hope the data wasn't too important' conversation.

    Backups are not the first line of defense (that's my HA setup). It's not the second (that's my DR), but that doesn't mean I'm going to throw away my 3rd line of defense just because I have two others. Defend in depth. It's not backups OR HA OR DR, it's a combination, whatever combination is needed to meet the RPO and RTO requirements.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I faced similar with my billion+ record table. I had about 150 million records in that table all of a sudden become invalid do to a bad data push. I'm not keeping those records in that table. It's already super huge and indexes on those records is expensive due to the sheer size. They are getting removed from the table in a data warehousing environment.

    Another case is the retention. After 90 days, most records in the database are not touched. They are not invalid, but they are surely adding to the storage, backups and so forth. They get removed and archived so we don't have 2 - 5 billion record tables sitting there with 70% of the records unused.

  • Sergiy (11/1/2015)


    As for database backups - I don't think there is a place for it in the current world.

    How do you handle a fraud investigation with no backups?

    Lets say company employee has done something dishonest, but it doesn't come to light for some time. You have Audit Trails etc. built into the database but lets assume they don't entirely cover what the investigation wants to look at.

    That has happened to clients I have done work for. We painstaking restored backups, investigated the data as-of-that-time and then restored the next backup and ran comparisons etc.

    However good the audits there is the risk that the "Insider" knows the extent of them and is working "outside" that safety net. Hence, for me, I think I need backups.

    Also (in addition to the "How do you recover accidentally deleted data" scenario) my software has run amok at times - not often I hasten to add! - and I have been able to fix-the-mess by knitting-spaghetti recovered from backups ...

  • Kristen-173977 (11/2/2015)


    Sergiy (11/1/2015)


    As for database backups - I don't think there is a place for it in the current world.

    How do you handle a fraud investigation with no backups?

    Logs, I'd say?

    For how long do you keep those backups if you were able to restore them to investigate something which "went unnoticed for some time"?

    _____________
    Code for TallyGenerator

  • GilaMonster (11/1/2015)


    Sergiy (11/1/2015)


    Gail, I'm talking from experience.

    Three times this year I've had to assist with the restore of a production backup to recover deleted data (2 cases)

    Oh, I see your point.

    DevOps, how could I forget they are on the rise...

    _____________
    Code for TallyGenerator

  • Me and Jeff have been having a side bar on this discussion and he will be posting his findings shortly.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (11/23/2015)


    Me and Jeff have been having a side bar on this discussion and he will be posting his findings shortly.

    I have already posted my findings here:

    http://www.sqlservercentral.com/Forums/FindPost1738471.aspx

    _____________
    Code for TallyGenerator

  • As mentioned in the blog, in a DW with over 100 Million rows , most of the queries will end up performing scans

    I dont know what qualifies as "most queries." One does not know others beyond ones own. If I only need to sum a date range, and not the whole table, how can a simple index not help? Most of the queries I see are sliced and diced, and it takes only one query to give a sum of all rows.

    About creating the clustered index on ref_age column. You realize that clustered column is going to sort the data on disk and in my example new ages will be added later on that will fragment the index.In my original example I was correct in using a NC on the Age column. If ref_age was an ID column or sequential in anyway I would understand the logic for making REF_AGE a Clustered index but in this example it doesn't work because i will add a million customers today with ages between 1 to 100 and load another set of customers with ages 1 to 100 tomorrow in which case my clustered index would be fragment

    When the data is altered by a large percentage, you are better off rebuilding the index. Foregoing it altogether is is not in my playbook.

    ----------------------------------------------------

  • MMartin1 (11/25/2015)


    As mentioned in the blog, in a DW with over 100 Million rows , most of the queries will end up performing scans

    I dont know what qualifies as "most queries." One does not know others beyond ones own. If I only need to sum a date range, and not the whole table, how can a simple index not help? Most of the queries I see are sliced and diced, and it takes only one query to give a sum of all rows.

    Funny enough - I'm working right now on a DW with one of the tables containing 105 million rows.

    If a single query within numbers of procedures we and online customers run against this table every day would perform a single scan we would be most likely out of business.

    :hehe:

    _____________
    Code for TallyGenerator

Viewing 15 posts - 91 through 104 (of 104 total)

You must be logged in to reply to this topic. Login to reply