• Doing a much deeper dive on everything that you've presented so far and being a bit of a troubleshooter, I'm having a logical problem with a point or two on this and want to help your company stay out of trouble and protect your customers in the process. And I apologize if I've missed the point. I also didn't take the time to create a mountain of randomized test data because I realized that only performance "problem" that your posted code has is the 12 or so seconds that you have with making a copy of the unprocessed rows to another table, which might be unnecessary to begin with.

    The #TestDataDump is a permanent table that has 50,000-100,000 rows per day added to it. All of those rows are initially "unprocessed" and are marked as such so that some process can scan through them and pick them up for processing. I believe I get all that.

    The first problem that I'm having with that is that there is absolutely nothing in that table to prevent the accidental double charging of a customer's card for any given purchase because the PurchaseDate is just that, a whole date. It has no time associated with it and it has no transaction number associated with it. The process that creates the file that you import has no guarantees that there are no bad duplicates within the file itself nor from day to day.

    For example, while it may be perfectly legitimate that a custom bought ProductNumber 1 twice in the same day, there's nothing in the table that would allow us to determine if that's a mistake in the data either in the same file or in "tomorrow's" file where the process that creates the file improperly picked up on the same transaction and included it in another file, effectively duplicating the single transaction and charging the customer twice for the single transaction.

    So, [font="Arial Black"]Problem #1 is that I see no way to prevent true duplicate entries being included either within the same file or across previously processed files[/font] and I truly believe that needs to be addressed. It can be done very simply (IMHO) by including the time of the purchase on the PurhaseDate in the file.

    Problem number 2 is that it looks like you might be unnecessarily moving data for processing. Your final INSERT in the code you presented takes all rows from the #TestDataDump file that haven't been processed and copies them to the #TestProcessedData table. What happens there? Are the rows processed in that table and then the #TestDataDump table is updated to identify which rows were processed? I ask because if there's code that displays only the processed rows in the #TestDataDump file, then no unprocessed rows should be allowed into that table. The file should be input directly into the #TestProcessedData table, processed, and then only the rows that have been successfully processed should be moved to the #TestDataDump table.

    This will also prevent the full clustered index scans that are occurring on the eventual 9-12 million rows in the table without the overhead of an index based on the low selectivity column of HasProcessed. No matter what you do for such an index, it will need to be rebuilt on a regular basis (at least daily) because rows will be updated as they are processed which will change the value in the HasProcessed column and it will cause some pretty big page splits in the related index. Those page splits may be bad enough where it may cause a timeout during modification and may cause substantial blocking during if nothing else.

    So, [font="Arial Black"]Problem #2 may be unnecessary data movement and indexing that can be solved by only moving processed data to the #TestDataDump table.

    [/font]

    Problem #3 is based on the additional requirement to archive the data after 3 months. What is the plan there? Will this be done daily or are you going to wait until the beginning of the next month and then attempt to archive a whole month's worth (1.5-3.1 million rows in a month according to your daily inputs that you mentioned) of rows? And, how many months worth of data are you going to keep in the "archive" table (assuming that's what you mean by "archive"). What I'm concerned about here is what happens to the log files Right now, it's a monolithic table and the archived rows will need to be DELETEd, which impacts the log files. All those rows would also need to be INSERTed into the "archive" and that will also affect the log file. Any effect on the log file will also affect Point-in-Time backups, as well. Again, all of that is based on some form of data movement. Partitioning of the tables (both the "Archive" and the #TestDataDump) table would prevent all that data movement. And, no... it doesn't matter what edition of SQL Server you have. We could use "Partitioned Views" in the Standard or Enterprise edition or Partitioned Tables in the Enterprise edition. Both are quite effective and can also be made in such a fashion so as to not have to backup data that will never change once processed and archived (and the "archive" table is going to get really big if you have to keep several years of data in it).

    Yes, I do agree that partitioning represents a serious "pre-optimization" (that's what some will call it but I call it "proper database design" for what is known will happen) but I've seen this problem enough times to know that this type of "pre-optimization" is a whole lot easier to resolve while the data is small rather than when it becomes large enough to dictate that it must be done.

    So, for the long haul, [font="Arial Black"]Problem #3 would indicate that you might need some form of partitioning for the long haul.[/font]

    Problem # 4 is based on Scott's recommendation and I absolutely agree with it. The clustered index on the #CustomerProduct table is based on an identity column. Ostensibly, the ProductNumber column will contain unique values. You can keep the IdCustProduct column as a PK if you really think it necessary (and I don't believe it is) but it would be much better if the clustered index was on the ProductNumber column and that it was UNIQUE. To be honest, I'd drop the IdCustProduct column and assign the ProductNumber column as both the PK and Unique Clustered Index (usually, an automatic assignment if no clustered index already exists. You don't usually need to define it twice).

    Yes, I know that dropping the IdCustProdcut column there would also require a modification to at least one other table, such as the #CustomerProductScore table, but Scott is absolutely correct about "not everything has to have a PK or Clustered Index on an IDENTITY column. The only reason to NOT make such changes is if the ProductNumber could be duplicated by revision over time but then you'd better make sure to add SCD 2 (Slowly Changing Dimension Type 2) datetime columns and/or a revision number column to the product related tables.

    I DO think that it's a good idea that there's a IDENTITY column on the #CrediCard (is that misspelled?) table so that you don't have to expose the credit card number in more than one place but I believe that I'd also make it contain just the credit card number and not the customer ID. There should probably be a "bridge table" between the customer table and the credit card table so that you can make the credit card number column a unique value, as well.

    If you come across more tables where it would make sense to NOT have the clustered index on an IDENTITY column, remember that SQL Server really likes narrow, ever-increasing, unique, not null clustered indexes for more reasons than I can list here including but not limited to some serious space savings in the clustered index itself and all the non-clustered indexes associated with that table. Again, if you haven't already spent a couple of hours studying it, the following video provides all of the reasoning behind that. Search for "The Clustered Index Debate" on that page when you get there...

    https://technet.microsoft.com/en-us/dn912438.aspx

    So, [font="Arial Black"]Problem #4 is to seriously reconsider the structure of at least the #CustomerProduct table and its related clustered index as Scott suggested.

    [/font]

    I believe that Problems #1 and #4 above are the most important. Problems #2 and #3 could be put off for the moment but I wouldn't wait until the data becomes huge to do them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)