wow you guys are awesome in response times:
so here goes -- I have less than 2 years experience so forgive the odd errors in design.
- The MD5 hash key was implemented as a suggestion from another forum that stated it would help with dupe checking and lookups since my combination keys were large strings (entryId and searchterm). I did not know that it would be problematic and useless anyway so I will remove that.
- So from what I gather then is the identity field ok or is this unnecessary as well.
- This purpose of this DB which contains 3 other similar tables but are accessed from other applications; is to house the data for 2 days and then it is removed and only serves as a temporary storage for dupe checking the data I write to the xml files which are then delivered to our API index. Also the tables are used to deliver reports to a web application that shows daily counts per table and different stats on the searchterms like counts per searchterm and daily list so I the queries are run whenever someone runs the sql server 2008 report tool.
- The DB maintenance involves daily backups which only includes cleaning history, rebuilding index and full DB backup. So I will look into the defragmentation.
- I am not sure about non-clustered indexes and if I need them here.
- The .NET app is not single threaded in that there are around 50+ threads running and each creates up to 100 tasks where each task on its own handles abt 10000 potential new records for the table.
I assume (searchterm) is your "business key" that you're using to locate the existing records. Also, based on your further info it sounds like you're single-threading the INSERTs. This being the case, I would change your indexing strategy to make (id) the clustered index and put a nonclustered index on (searchterm). The id column will automatically be appended to the end of the (searchterm) nonclustered index, which should reflect better how you query for the existence of records.
- I use a combination key of entryId and searchterm since I can have posts with the same entryid but different searchterm. How do I make the id the clustered index as it is setup to auto incremented and not sure I understand the logic behind it? How would I make it a clustered index and searchterm a nonclustered index.
- I will correct my table design and thanks for picking up these things
- I will also look at compression.
- the sql server version on this server houses another DB which has a table that is accessed more than my application in that each task I create (which has +-10000 records to work with) accesses the table in this DB that is linked through a web application to validate whethee the username in each record exists and receive a unique id for that username. So when I look at the task manager the CPU usage is at 98-100%.
- I will work on testing a bulk insert with not exists and see how this pans out.
- took a look at sp_who2 to check the transaction stats and most of my associated transactions are sleeping AWAITING COMMAND
Agreed, the MD5 as implemented isn't going to help. It's only hashing one column so it's not going to help as-is. Even with 2 columns it's not worth the extra processing power to generate the hash in this instance.
Based on your description above, here are my immediate suggestions:
* Check and fix your data types. As was pointed out, SEX (if it means "GENDER") does not require 50 characters to identify/code. After all there are only 2 genders + Lady Gaga. Other data types you have might be better as DATETIME, INTEGER, etc. Look at the domain (the possible values that can be held in any given field) and range (the min and max) for every field and use the smallest adequate data type possible to cover it.
* Get rid of the MD5 hash.
Now this one depends on a couple of factors:
* If you are going to stay with the current methodology (50 threads pounding the same table with SELECT ... INSERT) simultaneously then you're going to want to do what we were talking about earlier in the thread and change your indexing strategy, partitioning, or something to remove the hotspot. My simple suggestion is to change the Clustered Index to (entryId, searchterm) and make the PK (id - the identity column) nonclustered. If your (entryId, searchterm) combinations are being sent to the server in a random fashion with a good distribution, then this will resolve the hotspot issue.
However, since you don't have a "real-time" delivery SLA, I would recommend going a slightly different route:
* Create a staging table and populate it with Service Broker calls, individual inserts, or whatever method. You'll need to optimize the staging table as we described previously. Schedule a regular push of this data to the final table with a single INSERT ... WHERE NOT EXISTS ... statement. You can also avoid the initial SELECT to check for existence in the staging table, and just grab the most recent (or oldest) if you get dupes (duplicate business key values) in there.
This is slightly more complex versus what you're doing now, but not significantly more so. It should be a lot more efficient though.
* If possible, you might queue up the changes at the client in a flatfile or some other storage and then bulk insert the data into the staging table. You can use the .NET SqlBulkCopy class to bulk load data this way.
* On the client side you might also consider playing around with the number of parallel threads running. Do some simple performance tests with the 50 you're running now, then try again -- but this time limit it to the number of physical processors on your server. You might play around with it a bit to find the optimal number of threads, but I doubt 50+ is optimal.
* As mentioned, since you have so much stuff running on your SQL Server box, make sure SQL Server itself is not starved for resources.