is there a reorg available in sql server?

  • hi i have a 67 million row fact table that needs about 30 million rows deleted after they've been archived somewhere/somehow.

    im not a dba but things like filling logs (even in a recovery simple mode after millions of deletes) and fragmentation (ive heard indexes are the biggy here, not data pages) concern me.

    the db is recovery simple.  backups are easy to do and in fact available already if necessary from i think druva.

    tempdb seems to have 17.6 gig available.   the table occupies 27 gig in data and 26 gig in index.

    because of habits we used to have many years ago my gut says to unload all of it (unless archive is going a different way/technology see below) to a dev or other server without surrogate keys and without building indexes on the target, truncate the original fact table, drop the original indexes, use an etl tool to reload what wont be archived and in a second df ssis component load what should be archived to an archive location, then (ouch) rebuild the indexes .

    can the community comment?  is there a file system format i might use for the archived data that is easily turned into an sql table only if i need to later but otherwise is an alternative to loading the archived data right now into a new archive db?...a format that has been shown to be agnostic to whatever sql version we are on at any given point in time?  maybe analagous to the old ssis raw data file type?

    i am already concerned about transferring 67 million (maybe just 37 million depending) records across the wire to our azure vm.  the db the table is on is 200 gig.   maybe our dba can restore a 2019 backup to a 2022 target as a viable alternative.   i'd still have to transfer at least 37 million records back over the wire.

     

     

     

  • A lot to unpack here and not entirely sure what you are trying to achieve, but will do my best to answer.

    First, if you are deleting 30 million rows, you can likely do that in a loop of 1000 (or more, depending on your system) at a time and just have it run until the records are gone. Alternately, you could move the useful rows to a new table, truncate the original, and move the records back (mind you this MAY be slow and will likely impact users).

    As for logs filling in simple recovery, I wouldn't worry about that. The log shouldn't fill up from that, especially if you are deleting things in batches. Just give your DBA a heads up first. It also depends on how big (KB) each row is. You say you are deleting less than half the data, so if we assume you are removing 24 GB (1/2 in the table, 1/2 in the index) of data from an SSD, that should be pretty quick to do. Still will have user impact as you will lock the table and index once you start the delete, but if you do it in batches, the locks will be taken then released pretty quickly. AND if you delete them in batches, the log won't fill up at all. The reason to give your DBA a heads up is just so they are aware you MAY need more log space and they can pre-grow it before you start your work and shrink it when you are done.

    As for unloading it to a new dev/test system, I don't see the point. IF that is the route you want to g0, I'd just put the good data in a new table on the same server, truncate the source table and move the good data back or drop the source table and rename the new table. Mind you this approach requires that nobody is inserting or updating the original table while you are working on it AND that there are no foreign key constraints. Bringing SSIS into the mix feels like overkill to me.

    I am not sure what you mean by file system format for the archived data that can be turned into a SQL table if you need it later. The file system format must be something SQL supports, so for windows the recommendation is NTFS and it must be uncompressed for SQL to use it. BUT if it's just archive data, why not move the archive data to a new table, backup the database with compression and a name that makes sense for the archive data, then drop/truncate the archive data table? Then if you need the archive data back, it's in the backup. The backup file SHOULD be able to be restored onto any newer SQL version, so that solves that problem. I say SHOULD because if the version difference is too high it MAY fail, but you can always test during your upgrade process and if it fails, find a lower SQL version to restore to and take a fresh backup for the newer instances. If you REALLY need it in a format that will work with any SQL version, then you could export the data to CSV but do you really want to keep a bunch of CSV files on disk?

    I think the big question is what problem are you trying to solve? Is it deleting the data efficiently? Is it archiving the data efficiently? At my company, we create a database to hold archive data for our large systems then we archive data and backup the archive database then we can safely clean up the database without any data loss if needed and with no extra clutter being created on the main system databases.

    Just my 2 cents. Might not hurt to reach out to your DBA to see how they would recommend cleaning up the data too.

    Also, not sure what the tempdb comment was needed for. Do you need to do something with your tempdb? It grew to that size for a reason, so if you shrink it, it'll probably re-grow to a similar size over time UNLESS you had abnormal queries being run on the database in which case shrinking it MAY make sense.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • I recommend archiving fact table rows to a Parquet file using Python. Parquet is a columnar storage format supported by many data analytics platforms and SQL Server can query Parquet files as external tables using PolyBase.

    For deleting rows from fact tables in SQL Server consider the following approach:

    1. Export Data: Use bcp to export the rows you want to retain in native (binary) format for efficiency.

    2. Create New Fact Table: Create a new fact table, optionally partitioned to simplify future maintenance and deletions.

    3. Import Data: Use bcp to import the retained rows into the new table. You can control the batch size to manage resource usage during the import.

    4. Swap Tables:

    Rename the original fact table (e.g., to OldFactTable).

    Rename the new table to the original name.

    5. Cleanup: Once everything has been verified, drop the old fact table.

    Important: Always test this process on a development system before applying it in production

    • This reply was modified 2 months, 2 weeks ago by Ken McKelvey.
    • This reply was modified 2 months, 2 weeks ago by Ken McKelvey.
  • thx brian, we would do this on a sunday when users wouldn't be impacted and our level of potential embarrassment would be minimal.

    we might be a bit more frugal about space even if only temporary than you guys are.     I believe the way we requisition space automatically means we cant give it back to our 3rd party once automatically requested and provided.

    our dba didnt offer an opinion, he just asked when.   see below.

    Here is what we are going to do...but i feel your post is a great answer that i'll mark as the answer shortly...btw we recently found some useless indexes that we've already started deprecating...

    Guys, this is what i want to do, let me know what you think. Its to avoid deleting 30 million records…maybe on a Sunday…it could be that deleting 30 million with the indexes dropped wouldn’t make a big mess ...not sure.

    1. Backup the dw db on production

    2. Sample 25 records that are to survive…put them in excel for later use

    3. Copy (with surrogate keys) the fact sales records that will survive to the vm (our dev sql server dw), database purgepoc (see #7 for an alternative) . "DBA's name" using ssis works but may not be ideal to do this…can a druva or manual backup of the production dw db be “restored” to/on the vm as a db called purgepoc say in less than a half hour? …such that the factsales table in purgepoc can then be read by etl for the steps you see below? That server has enough space i believe to fill >xxx gig. The dw db is yyy gig i believe.

    4. Copy (without keys) those factsales that will be archived to tab separated files as shown in image 2, a separate file (with datestamp) for each erp, im thinking on a share created by "infrastructure guy" if we are all in agreement and "our company name" is ok with such things. A million tsv records occupies 322.8 meg. Maybe we can archive to a tape instead of a share?

    5. Truncate factsales on production

    6. Drop the indexes on latter

    7. Load back the surviving (invoice date >=mm/dd/yyyy) records to production from the vm using ssis. Being 37 million records, i may have to find an alternative. Its possible unloading these initially to tsv’s instead would be a good alternative. And reloading (from tsv's) could go more quickly.

    8. Rebuild the indexes, perhaps only those that look like they really have value.

    9. Verify 25 records on prod db dw against the excel spreadsheet created in step 2.

  • very informative, thx Ken.

  • My opinion - if you have a 200 GB database and have 10-15% that you are wanting to replicate short term and you don't have enough space for that, you might want to look at getting more disk for the database. I usually try to aim for 20% free in my databases so I have room for data growth and 15% free space on the actual disk so I have room to grow my databases. If either number drops below my thresholds, I take action. Some databases are tiny but they hardly ever grow and bumping them up by 5-10% when they get below 20% free makes hardly any impact. Others are large (for my company, small for a lot of other companies) and bumping it up by 5-10% when it gets full eats up the disk to below 15% and I need to bug the infrastructure team to get more disk. Usually this is no problem, but if it does turn into a problem, I work with them to see if there is stuff we can purge from the database disk (accidental file storage such as backups stored on the wrong drive or log files that we no longer need for example). If we can purge stuff, we do that to get it above the 15% mark again. Worst thing that can happen is we fill a disk (which has happened due to poorly written queries and badly configured databases) but it is very rare and when it does happen, I am quick to fix it.

    IF the database growth is temporary, which it would be in this scenario, I would work with the DBA to pre-grow the database to handle the additional data and then shrink the database and log once complete. I would also ask the DBA if a 17.6 GB tempdb makes sense or if it is usually empty and it is an outlier. If it is an outlier and a smaller size makes sense, I would ask them to shrink it. If it is normal, I would look at some of your queries because tempdb being nearly 10% of your database size seems pretty large to me. It may be normal and expected, but it would raise some eyebrows with me. The exception being if the 200 GB database is not the only database on your system. MOST of my tempdb's are around 5% of the total size of all DB's on the system. It's not a hard rule and 10% or more may be expected and required for your specific scenario; just in my environment it usually doesn't get that big.

    I would also be a bit concerned if copying 322.8 MB to disk is a space concern that you are thinking of putting it on tape backup. That's 1/2 of a CD. That's not that much storage in todays world. If 322.8 MB is going to make the infrastructure team worry about disk space, it may be time to invest in more disks or some deduplication software in the event you have duplicate files stored on disk.

    Also, as for dropping indexes, it really depends on how many indexes you have. I'd keep the clustered index for sure when you repopulate the table and maybe disable the non-clustered indexes, but that may not be necessary either. I've done large ETL loads with disabling indexes (due to index quantity, but not sure they were all needed, I just got told not to drop any of them as they were essential) rather than dropping the indexes. Advantage to disabling the indexes instead of dropping is you won't miss re-creating any. You may miss rebuilding one but that is easier to fix in my opinion.

    As for Kevin's suggestion about Parquet and Polybase, I personally have not used Parquet but I've found Polybase similar to using a linked server to a non-SQL Server instance. It kind of works. It's not the worst thing out there, but it has quirks and can misbehave on you. Wherever possible, I prefer to keep things in a format that I know SQL can handle and handles well. That's just my experience and that was a few years back when Polybase was pretty new, so maybe it's improved and I'm just a grumpy old man stuck in my old ways, but I feel that wherever possible, you should try to use the right tool for the job.

    If it is archive data and nobody is going to look at it, trust your backups and don't worry about keeping it in a DB. if anyone needs to look at it, restore from backup. Well, assuming you have tested your backup and restore process because backups are only known good after a restore. Even if SQL verifies it and says "looks good", unless you do a restore, you don't know it is good. That and make sure you are running dbcc checkdb on your databases so you can catch corruption early. Nothing worse than finding out you have corruption in a database with no recent known good backup...

    If it is archive data and someone is going to be looking at it, put it in a separate database so it is easy to query but you can have different disk (such as on a HDD instead of an SDD) for example.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • stan wrote:

     3. Copy (with surrogate keys) the fact sales records that will survive to the vm (our dev sql server dw)

    You will not be able to use a developer license on you dev vm as this is production work.

     

  • i'll check that Ken, thx.   Never thought of that.   we might download the keepers to tsv then also.   and ingest them straight from tsv's.   our dba didnt catch that potential problem.    this is why i use this forum,  not just for things a dba might miss, but also things outside of his scope.

    • This reply was modified 2 months, 2 weeks ago by stan.
  • thx brian, its not the only db on that server.  and that 322.8 meg you referenced is just a baseline/poc for a million tsv records.  the forecasted total would actually be 10 gig.

  • I agree with what Ken said about the dev license - dev/visual studio licensed SQL Server can have prod data copied to it, but you can't copy prod data from it to prod. dev/vs licenses are for testing and development.

    I still feel like moving the data (10+ GB) to a secondary SQL instance is overkill. copy it to a new table on the same database or a different/new database on the same instance. Keeps you within license, reduces network I/O, and keeps your data all on the same instance and reduces risk of accidental data leak due to permission differences between instances. Still can have potential for data leak due to permission differences if you use a different database on the same instance, but that is usually easier to manage and maintain than copying across instances.

    Or if nobody is going to look at the archive data, do a backup, test the backup, disable the indexes, delete the data, rebuild indexes. The index work may not be required mind you.

    I would still test your plan on a test system so you don't get hit with any surprises, especially if you are moving the data to different instances. IF any of the data is encrypted (for example), copying it to a new sql instance (or really anywhere) you'll likely want to make sure you decrypt it first and re-encrypt it at the destination or possibly you want to keep it encrypted in motion too and are only going to test decryption after you get the work completed...

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply