Help Needed in Performance Tuning

Question

Help Needed in Performance Tuning

KGJ-Dev

SSCertifiable

Points: 5780
More actions
August 29, 2015 at 12:30 pm

#302688

Hello Friends,
I have attached my sample script with this post and i an trying to deal with 3 + million data for the testing. The script has all my sample data creation and my sample query to get my actual data. My testing is working fine. I am planing to keep this logic in transaction as something goes wrong, nothing should be processed.
I have two questions
1. Please take a look at my query and advice me if there is any better way to speed up this query. right now it's taking 2-3 minutes to process the data.
2. Since i am planning to keep the data in transaction ,how to avoid blocking the tables because my website will use the first 4 tables in my logic. In order to play with sample data, i have given as temp table. but actually the first 4 tables are actual table in my database with different table and columns. Whatever i provided is sample table name with structure and columns.
Note : In my productioncard table i have user cards around half a million and my users table i have half a million users. also production customerproduct table i have 1 million products.
Please help me on achieving this in better way with great performance.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 1

Except for two foreign key errors (you can't put FK's on Temp Tables), your attached code ran up to the final SELECT to the display as you posted it in 6 seconds on my little i5 laptop. The final SELECT for display took an additional 15 seconds.

Other than the fact that all rows in your large test table are identical (and, therefor, only clustered index scans, which are really table scans, are the rule of the day), I'm not seeing a performance issue with the code. If it's taking substantially longer for the code to run insitu on your boxes, I'd start looking into the hardware and issues like network latency.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100948 More actions · Answer 2

The most critical element for performance is the best clustered index on every table. That is rarely an identity column.

Therefore, the first thing you need to do is review the missing index status and index usage stats and determine, and put in place, the best clustered index for every table. Until you do that, any other "tuning" is likely to result in wasted I/O with multiple covering indexes, etc..

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 3

In the process of considering a clustered index, you can't just consider SELECTs. You also have to consider page splits during inserts and updates and many other factors like what the table will be used for and how and the uniqueness of the lead column of the clustered index so that you can avoid things like the overhead of and automatic addition of a conditional de-duplication column.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KGJ-Dev SSCertifiable Points: 5780 More actions · Answer 4

Hi Scott thanks for the reply.

Hi Jeff , thanks for the reply and yes, the dump table has 3 million same kind of data. I also mentioned in my thread about it. Because this is sample data that made to post on the internet. Also, i ma sorry that i ran this over VPN. that's why it took 2-3 min. after seeing your reply i ran ion the actual box and it took 26 seconds. thanks for the clue.

Is there any way to identify which column needs index?

Any suggestions please

ScottPletcher SSC Guru Points: 100948 More actions · Answer 5

I, too, await what Jeff will propose as the best method to identity the proper clustered indexes, since we clearly very differently weigh what's most important in that regard. I know what I would do, but I don't stress the things he does; edit: that is, I consider other factors more important than possibly having multiple places for INSERTs and/or duplicate clustering keys initially.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 6

ScottPletcher (8/29/2015)
I, too, await what Jeff will propose as the best method to identity the proper clustered indexes, since we clearly very differently weigh what's most important in that regard. I know what I would do, but I don't stress the things he does.

Heh... in that case, Scott, please feel free to go first. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 7

KGJ-Dev (8/29/2015)
Hi Scott thanks for the reply.
Hi Jeff , thanks for the reply and yes, the dump table has 3 million same kind of data. I also mentioned in my thread about it. Because this is sample data that made to post on the internet. Also, i ma sorry that i ran this over VPN. that's why it took 2-3 min. after seeing your reply i ran ion the actual box and it took 26 seconds. thanks for the clue.
Is there any way to identify which column needs index?
Any suggestions please

Understood on the test data. We can fix that for testing purposes but let's hold off on that and the indexes for a minute because, other than the join columns, there doesn't appear to be any criteria that would prevent a full table scan (full clustered index scan) other than the HasProcessed column of the #TestDataDump column. Here's the code that's doing the work for you according to the code you attached (everything else is setting up test data).

insert into #TestProcessedData(IdCard,ProductName,ProductCustScore,CardNumber,CutomerName,CustomerId)

SELECT CC.IdCard,CP.ProductName,CPS.ProductCustScore,CC.CardNumber,C.CustomerName,C.CustomerID

from #TestDataDump TD left join #CrediCard CC on(TD.CardNumber = CC.CardNumber)

left join #Customers C on C.CustomerID = CC.CustomerId

LEFT join #CustomerProduct CP on TD.ProductNumber = CP.ProductNumber

LEFT join #CustomerProductScore CPS on(CP.IdCustProduct = CPS.IdCustProduct)

[font="Arial Black"]Where TD.HasProcessed = 0;

[/font]

select * from #TestProcessedData;

Here are my questions (and you may have already stated answers for some of these but wanted to try to get them all in one spot)...

1. Does the #TestDataDump table represent all new data or is this maintained in a real table that you keep adding to? If you keep adding to it, how often and how many rows are typically added?

2. Are there duplicates in the real table represented by #TestDataDump table insofar as the CardNumber and and ProductNumber?

3. What percentage of rows will have HasProcessed = 0 in the table represented by #TestDataDump?

4. You construct the ProductNumber in your test data using N+1000, which is 100% numeric. Is the ProductNumber totally numeric or does it have non-numeric characters in it or any leading zeros?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KGJ-Dev SSCertifiable Points: 5780 More actions · Answer 8

Hi Jeff,

Please find my answers to your questions,

1. Does the #TestDataDump table represent all new data or is this maintained in a real table that you keep adding to? If you keep adding to it, how often and how many rows are typically added?

The structure of #TestDataDump is almost the same structure of my production table and every day the table gets inserted into ~ 50000 to 100000 at this moment. but in future it may get increased. In order to tackle those kind of situations i am testing with large sets of data

2. Are there duplicates in the real table represented by #TestDataDump table insofar as the CardNumber and and ProductNumber?

Whenever the customer purchases the product on the store using the card, the information will come through text file and i am inserting into actual production table( like #TestDataDump) table. so lets take an example of purchase of 100 items can be made using the same card per day. so obviously card number will get repeated. Even product number as well. If i by the product @ 10am and the same product @ 12 am using the samecard obviously it will be as separate records.

so i am grouping the data as cardnumber with purchasedate. this logic i didn't included on the sample query

3. What percentage of rows will have HasProcessed = 0 in the table represented by #TestDataDump?

Everyday it will get ~50000 - 10000 rows appended. once the records processed then hasprocessed will be set as 1 and every 3 months once all the old data will be archived.

4. You construct the ProductNumber in your test data using N+1000, which is 100% numeric. Is the ProductNumber totally numeric or does it have non-numeric characters in it or any leading zeros?

The productnumber on my production table is varchar(30). Not sure why the original designer created with varchar. but all are numbers and no alphabets or mixed on it. As i don't want to confuse, in my sample table i kept it as int

Any suggestions please and i have ased transaction related question as well on my original post. please help me

Thanks

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 9

Thanks for the answers. It's kinda late tonight but I'll build a shedload of data tomorrow based on your answers and we'll dive into indexes and use of clustered indexes. Scott is spot on when he says that Identity columns shouldn't automatically be the lead column for clustered indexes but there are also times when it really is a good idea.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KGJ-Dev SSCertifiable Points: 5780 More actions · Answer 10

Thanks Jeff and i will wait for your workaround. Sorry for disturbing on the weekend......

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 11

Doing a much deeper dive on everything that you've presented so far and being a bit of a troubleshooter, I'm having a logical problem with a point or two on this and want to help your company stay out of trouble and protect your customers in the process. And I apologize if I've missed the point. I also didn't take the time to create a mountain of randomized test data because I realized that only performance "problem" that your posted code has is the 12 or so seconds that you have with making a copy of the unprocessed rows to another table, which might be unnecessary to begin with.

The #TestDataDump is a permanent table that has 50,000-100,000 rows per day added to it. All of those rows are initially "unprocessed" and are marked as such so that some process can scan through them and pick them up for processing. I believe I get all that.

The first problem that I'm having with that is that there is absolutely nothing in that table to prevent the accidental double charging of a customer's card for any given purchase because the PurchaseDate is just that, a whole date. It has no time associated with it and it has no transaction number associated with it. The process that creates the file that you import has no guarantees that there are no bad duplicates within the file itself nor from day to day.

For example, while it may be perfectly legitimate that a custom bought ProductNumber 1 twice in the same day, there's nothing in the table that would allow us to determine if that's a mistake in the data either in the same file or in "tomorrow's" file where the process that creates the file improperly picked up on the same transaction and included it in another file, effectively duplicating the single transaction and charging the customer twice for the single transaction.

So, [font="Arial Black"]Problem #1 is that I see no way to prevent true duplicate entries being included either within the same file or across previously processed files[/font] and I truly believe that needs to be addressed. It can be done very simply (IMHO) by including the time of the purchase on the PurhaseDate in the file.

Problem number 2 is that it looks like you might be unnecessarily moving data for processing. Your final INSERT in the code you presented takes all rows from the #TestDataDump file that haven't been processed and copies them to the #TestProcessedData table. What happens there? Are the rows processed in that table and then the #TestDataDump table is updated to identify which rows were processed? I ask because if there's code that displays only the processed rows in the #TestDataDump file, then no unprocessed rows should be allowed into that table. The file should be input directly into the #TestProcessedData table, processed, and then only the rows that have been successfully processed should be moved to the #TestDataDump table.

This will also prevent the full clustered index scans that are occurring on the eventual 9-12 million rows in the table without the overhead of an index based on the low selectivity column of HasProcessed. No matter what you do for such an index, it will need to be rebuilt on a regular basis (at least daily) because rows will be updated as they are processed which will change the value in the HasProcessed column and it will cause some pretty big page splits in the related index. Those page splits may be bad enough where it may cause a timeout during modification and may cause substantial blocking during if nothing else.

So, [font="Arial Black"]Problem #2 may be unnecessary data movement and indexing that can be solved by only moving processed data to the #TestDataDump table.

[/font]

Problem #3 is based on the additional requirement to archive the data after 3 months. What is the plan there? Will this be done daily or are you going to wait until the beginning of the next month and then attempt to archive a whole month's worth (1.5-3.1 million rows in a month according to your daily inputs that you mentioned) of rows? And, how many months worth of data are you going to keep in the "archive" table (assuming that's what you mean by "archive"). What I'm concerned about here is what happens to the log files Right now, it's a monolithic table and the archived rows will need to be DELETEd, which impacts the log files. All those rows would also need to be INSERTed into the "archive" and that will also affect the log file. Any effect on the log file will also affect Point-in-Time backups, as well. Again, all of that is based on some form of data movement. Partitioning of the tables (both the "Archive" and the #TestDataDump) table would prevent all that data movement. And, no... it doesn't matter what edition of SQL Server you have. We could use "Partitioned Views" in the Standard or Enterprise edition or Partitioned Tables in the Enterprise edition. Both are quite effective and can also be made in such a fashion so as to not have to backup data that will never change once processed and archived (and the "archive" table is going to get really big if you have to keep several years of data in it).

Yes, I do agree that partitioning represents a serious "pre-optimization" (that's what some will call it but I call it "proper database design" for what is known will happen) but I've seen this problem enough times to know that this type of "pre-optimization" is a whole lot easier to resolve while the data is small rather than when it becomes large enough to dictate that it must be done.

So, for the long haul, [font="Arial Black"]Problem #3 would indicate that you might need some form of partitioning for the long haul.[/font]

Problem # 4 is based on Scott's recommendation and I absolutely agree with it. The clustered index on the #CustomerProduct table is based on an identity column. Ostensibly, the ProductNumber column will contain unique values. You can keep the IdCustProduct column as a PK if you really think it necessary (and I don't believe it is) but it would be much better if the clustered index was on the ProductNumber column and that it was UNIQUE. To be honest, I'd drop the IdCustProduct column and assign the ProductNumber column as both the PK and Unique Clustered Index (usually, an automatic assignment if no clustered index already exists. You don't usually need to define it twice).

Yes, I know that dropping the IdCustProdcut column there would also require a modification to at least one other table, such as the #CustomerProductScore table, but Scott is absolutely correct about "not everything has to have a PK or Clustered Index on an IDENTITY column. The only reason to NOT make such changes is if the ProductNumber could be duplicated by revision over time but then you'd better make sure to add SCD 2 (Slowly Changing Dimension Type 2) datetime columns and/or a revision number column to the product related tables.

I DO think that it's a good idea that there's a IDENTITY column on the #CrediCard (is that misspelled?) table so that you don't have to expose the credit card number in more than one place but I believe that I'd also make it contain just the credit card number and not the customer ID. There should probably be a "bridge table" between the customer table and the credit card table so that you can make the credit card number column a unique value, as well.

If you come across more tables where it would make sense to NOT have the clustered index on an IDENTITY column, remember that SQL Server really likes narrow, ever-increasing, unique, not null clustered indexes for more reasons than I can list here including but not limited to some serious space savings in the clustered index itself and all the non-clustered indexes associated with that table. Again, if you haven't already spent a couple of hours studying it, the following video provides all of the reasoning behind that. Search for "The Clustered Index Debate" on that page when you get there...

https://technet.microsoft.com/en-us/dn912438.aspx

So, [font="Arial Black"]Problem #4 is to seriously reconsider the structure of at least the #CustomerProduct table and its related clustered index as Scott suggested.

[/font]

I believe that Problems #1 and #4 above are the most important. Problems #2 and #3 could be put off for the moment but I wouldn't wait until the data becomes huge to do them.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KGJ-Dev SSCertifiable Points: 5780 More actions · Answer 12

Hi Jeff,

Thanks you so much for your elaborated explanation and really helpful for me to move forward. I am in the position to explain few other points,

Regarding the point #1 - I have transactionNumber column in the testdump table and i am generating this number based on file data. Sorry that i didn't mention in the sample table structure. I thought something to tune the query and thats why didn't mentioned about this. I apologize for missing that column. So duplicated would be avoided and i am already taking care of them.

Regarding point#2 i totally agree because this is what i suggested to my customer. Dump the data from file into staging table and do the process then move the processed data to Master table.

Regarding Point#3 . So far i couldn't think it of because this process will take place another one month to implement in my production. So I don't have any idea now about archive procedure. As far as i heard, every three month they will do the archive action

Regarding point #4: I understood your point, right now more tables have the reference to the CustProduct table and i couldn't do modification on the IdCustProdcut.

I will watch the video you prescribed. Thanks a lot for your time on this. One last question that i was asking question about Transaction in my original post.

Question is below,

1. As you have noticed that all the logic's are inside transaction.Lets assume that if the process dealing with more data and if any exception occurs during the process, will the card and product table gets locked until rollback completes?

Because i don't want to lock those tables at any cost because it will be used by my website. Please suggest me best option to tackle this situation and if you have any sample please share it.

thanks

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 13

KGJ-Dev (8/30/2015)
Regarding the point #1 - I have transactionNumber column in the testdump table and i am generating this number based on file data. Sorry that i didn't mention in the sample table structure. I thought something to tune the query and thats why didn't mentioned about this. I apologize for missing that column. So duplicated would be avoided and i am already taking care of them.

I believe you're missing the point. Just because something is in a separate file is no guarantee that it won't also be in another file. You could still end up with duplicates. The transaction number would have to be included from the source, not from anything you do after you receive the file.

1. As you have noticed that all the logic's are inside transaction.Lets assume that if the process dealing with more data and if any exception occurs during the process, will the card and product table gets locked until rollback completes?
Because i don't want to lock those tables at any cost because it will be used by my website. Please suggest me best option to tackle this situation and if you have any sample please share it.

I don't believe that anything in this needs to be included in explicit transactions. If you import to a staging table and process it there, you can make the implicit transactions run very fast and there will be no need for multi-query explicit transactions because you're not yet affecting permanent data. The only thing that I saw in your original code was to insert unprocessed rows from the dump table to a process table. If you do the staging table thing, then even that's not "permanent" and you can skip having any explicit transactions.

Even the upcoming archive process is like that. You insert to the archive and don't delete until the insert is successful.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100948 More actions · Answer 14

If the tables are part of a live system, I'd always want to see at least the current missing index stats and index usage stats from SQL itself (cardinality can also be very useful to include in the analysis). I can look at DDL and try to divine the access patterns, based on what looks obvious, but sometimes that's not right. I admit, even if/when the developers insist they know what the patterns are, I still prefer to confirm it with SQL's own, known-to-be-accurate stats.

Given that those stats are not available yet, my general analysis is as follows.

"Customer" table is fine. CustomerId, like OrderId, is a natural "master" table where identity often is the best clustering key. [See, unlike Celko, I don't actually hate identities just for the sake of hating them.]

"CreditCard" table. I very strongly prefer the idea of a separate master table for credit cards, as Jeff suggests, especially since the CreditCard master table needs extreme security on it. I'd key that table by an int identity (4 billion values (using negative numbers) seems enough to me). You would then create a separate CustomerCreditCard to link the two, clustered on ( CustomerId, CreditCardId ), no identity needed or wanted (the horror!).

If you decide to stick with this one table, it should be named CustomerCreditCard to indicate that it is an intersection table, not a master table. And cluster it as noted above. Btw, an int value won't hold a full cc number, so the CardNumber column is somewhat confusing.

"CustomerProduct" table. This appears to be a product master table, but is named as if it were an intersection table. The table name should just be Product, and, again as Jeff noted, it should be clustered by a unique ProductNumber.

"CustomerProductScore" table. I don't see a reason for this to be a separate table with only the columns that are shown. Why can't this be stored in the "CustomerProduct"/"Product"? Are there multiple values? If so, what key values distinguishes each Score value?

Btw, this points out a very major issue with slapping "magic" identity columns on every table just to, allegedly, give you non-fragmenting clustering (falsely claimed to be the holy grail of table/key design). Without an identity, you can easily see that you don't have any key value(s) to distinguish multiple rows. That makes you realize that (a) the detail column(s) should go back into its main table, since they are 1:1 values or (b) you need to identity and use another distinguishing key value for this relationship/table. By delaying adding an identity unless/until it's actually [/i]required[/i], you must first identify natural keys, which points out possible data relationship issues.

"TestDataDump" table. First, table must be renamed. Not sure what the new name should be, but that name is absolutely meaningless.

Most likely this table should be clustered first on PurchaseDate, then perhaps by identity, but not enough here to tell. Index stats are required here. The product number should be stored, not the product name. If you need historical product names, as of a point-in-time, create a separate table to retain those.

"#TestProcessedData" table. Interesting, as typically you'd want to first cluster this on date/datetime, but I don't see such a column in the table. I'll have to come back to this later, as I don't have time now.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.