sql DB table handling thousands of inserts a sec

  • I have a sql server 2008 database setup that accepts insert transactions from a visual studio application that is running using threads. If code detects a timeout, it will just reattempt to insert the data. Now I want to know how does sql server handle inserts that could potentially be hundreds per second and if it doesn't is there someway I can code or configure my table to handle this without bogging down my application.

    Someone told me about sql server broker but I am not sure if this is the best approach - the table has a combination primary key of two text fields and I have return value to check if record exists or not.

    Does someone have suggestions on this or is there no way I can improve this?

  • 100's per second isn't a significant issue, usually, but it will depend on hardware and setup.

    You'll want to make sure your clustered index is controlled to try to append to the tail at almost all times. There's some other things you might look into, but this isn't a serious volume on today's hardware.

    If you'd like post the table's DDL and your CRUD procs and we'll see if we can help you keep it clean and optimized.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • j_depp_99 (9/23/2011)


    I have a sql server 2008 database setup that accepts insert transactions from a visual studio application that is running using threads. If code detects a timeout, it will just reattempt to insert the data. Now I want to know how does sql server handle inserts that could potentially be hundreds per second and if it doesn't is there someway I can code or configure my table to handle this without bogging down my application.

    Someone told me about sql server broker but I am not sure if this is the best approach - the table has a combination primary key of two text fields and I have return value to check if record exists or not.

    Does someone have suggestions on this or is there no way I can improve this?

    It depends. I've optimized this type of system before and there are a lot of variables:

    . Can you insert in bulk or does it have to be one row at a time?

    . How are you indexing the table?

    . Is the data coming in with a good random distribution on PK or is it ordered?

    . What type of hardware do you have?

    . What's the SLA on inserts for this system?

  • Evil Kraig F (9/23/2011)


    You'll want to make sure your clustered index is controlled to try to append to the tail at almost all times. There's some other things you might look into, but this isn't a serious volume on today's hardware.

    I might disagree with this statement, but I need to know how many parallel threads he's running INSERTs from the app.

  • Hundreds or even thousands per second isn't a lot really and if you are having timeouts I might say your existing hardware/structures are incorrectly spec'd. If you expect that to grow substantially be sure to size your server appropriately, especially your disk sub-system.

    Clustered index, non clustered indexes, and table width ( how many fiels and how wide those fields are) can have huge implications on insert performance. Also if you can do bulk inserts that would help immensely too..

    For very high insert volume tables I recommend structuring them so records are always added to the end NEVER in the middle(per the clustered index), that they be VERY lightly indexed, and that they be as narrow as possible.

    CEWII

  • Mike C (9/23/2011)


    Evil Kraig F (9/23/2011)


    You'll want to make sure your clustered index is controlled to try to append to the tail at almost all times. There's some other things you might look into, but this isn't a serious volume on today's hardware.

    I might disagree with this statement, but I need to know how many parallel threads he's running INSERTs from the app.

    Agreed Mike. One comment about one of your concerns above, though. The PK is never the concern, the clustered index is.

    In most cases though when you're dealing with 100's coming in at a time you've got enough threads to make this worth it for a highly transactional system.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks to everyones responses..

    - I can only insert one record at a time since I need to know whether that record already exists in my DB table before I write the record to a xml file.

    - the table is setup with a numerically incremented field to answer how the records are distributed.

    - the DB is on a windows 2008 server, intel Xeon CPU E5620 2.4 GHZ, 2.39 GHZ(dual processors) 32 Gig Ram

    - there are other DB and tables on the machine as well as web apps running and no particular service agreements with client since data is not realtime requested

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    SET ANSI_PADDING ON

    GO

    CREATE TABLE [dbo].[PostFeed]

    ( [id] [int] IDENTITY(1,1) NOT NULL,

    [entryId] [varchar](100) NOT NULL,

    [varchar](max) NULL,

    [entryContent] [nvarchar](max) NULL,

    [pubDate] [varchar](50) NOT NULL,

    [authorName] [nvarchar](100) NOT NULL,

    [authorId] [nvarchar](100) NULL,

    [age] [nvarchar](50) NULL,

    [sex] [nvarchar](50) NULL,

    [locale] [nvarchar](50) NULL,

    [pic] [nvarchar](100) NULL,

    [fanPage] [nvarchar](400) NULL,

    [faceTitle] [nvarchar](100) NULL,

    [feedtype] [varchar](50) NULL,

    [searchterm] [nvarchar](400) NOT NULL,

    [clientId] [nvarchar](100) NULL,

    [dateadded] [datetime] NULL,

    [matchfound] [nvarchar](50) NULL,

    [hashField] AS ([dbo].[getMd5Hash]([entryId])),

    CONSTRAINT [PK_Feed] PRIMARY KEY CLUSTERED ( [entryId] ASC, [searchterm] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,

    IGNORE_DUP_KEY = OFF,

    ALLOW_ROW_LOCKS = ON,

    ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    - what do you mean controlling my clustered index and how do I do this?

    I now understand that the DB itself should handle the load but this was what I saw when doing a trace in profiler

    Lock:Timeout 60468 sa 0X01 247 G01 sa 2011-09- 21 16:00:44.750 1:4557807 3463314605 .Net SqlClient Data Provider 0 0XEF8B45000100000000000000070006 22164 7 postFeeds 0 2011-09-21 16:00:44.750 1 G02 0 - LOCK 8 - IX 0 72057594075152384 1 - TRANSACTION 0 6 - PAGE

    Lock:Timeout 60469 sa 0X01 478 G01 sa 2011-09- 21 16:00:44.887 (7bf23fc490ce) 3463299315 .Net SqlClient Data Provider 0 0X3802000000017BF23FC490CE070007 17900 7 postFeeds 0 2011-09-21 16:00:44.887 1 G02 0 - LOCK 5 - X 0 72057594075152384 1 - TRANSACTION 0 7 - KEY

    Lock:Timeout 60470 sa 0X01 803 G01 sa 2011-09- 21 16:00:44.887 (379349b72c77) 3463296982 .Net SqlClient Data Provider 0 0X380200000001379349B72C77070007 17900 7 postFeeds 0 2011-09-21 16:00:44.887 1 G02 0 - LOCK 5 - X 0 72057594075152384 1 - TRANSACTION 0 7 - KEY

    Lock:Timeout 60471 tdbuser 0X048D73EF643661429B907E6106F78358 93 G01 tdbuser 2011-09-21 16:02:41.333 1:14386936 3463346220 .Net SqlClient Data Provider 0

    I use a stored procedure to do the insert.

    Hope I covered the unknowns.

  • Based on the OP's description it sounds like he's checking to see if the record exists first. Possibly via SELECT prior to INSERT.

    So here's a question: assume a CI on identity column. Also assume 100 simultaneous threads all INSERTing at the same time.

    Question: Is CI on ascending key the best option? I optimized a similar application not long ago and was able to increase performance by around 400% just by changing CI strategy from monotonically increasing CI guaranteeing inserts at "the bottom".

  • j_depp_99 (9/23/2011)


    Thanks to everyones responses..

    A little easier to read the definition like so: (EDIT: I see you fixed it in your original post)

    SET ANSI_NULLS ON GO

    SET QUOTED_IDENTIFIER ON GO

    SET ANSI_PADDING ON GO

    CREATE TABLE [dbo].[PostFeed](

    [id] [int] IDENTITY(1,1) NOT NULL,

    [entryId] [varchar](100) NOT NULL,

    [varchar](max) NULL,

    [entryContent] [nvarchar](max) NULL,

    [pubDate] [varchar](50) NOT NULL,

    [authorName] [nvarchar](100) NOT NULL,

    [authorId] [nvarchar](100) NULL,

    [age] [nvarchar](50) NULL,

    [sex] [nvarchar](50) NULL,

    [locale] [nvarchar](50) NULL,

    [pic] [nvarchar](100) NULL,

    [fanPage] [nvarchar](400) NULL,

    [faceTitle] [nvarchar](100) NULL,

    [feedtype] [varchar](50) NULL,

    [searchterm] [nvarchar](400) NOT NULL,

    [clientId] [nvarchar](100) NULL,

    [dateadded] [datetime] NULL,

    [matchfound] [nvarchar](50) NULL,

    [hashField] AS ([dbo].[getMd5Hash]([entryId])

    ),

    CONSTRAINT [PK_Feed] PRIMARY KEY CLUSTERED

    ( [entryId] ASC,

    [searchterm] ASC )

    WITH (

    PAD_INDEX = OFF,

    STATISTICS_NORECOMPUTE = OFF,

    IGNORE_DUP_KEY = OFF,

    ALLOW_ROW_LOCKS = ON,

    ALLOW_PAGE_LOCKS = ON)

    ON [PRIMARY] )

    ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    Please confirm there are no non-clustered indexes on this table?

    So, to your points:

    - I can only insert one record at a time since I need to know whether that record already exists in my DB table before I write the record to a xml file.

    I assume this is checked via the entryID VARCHAR(100) and the searchterm nVARCHAR(400)?

    - the table is setup with a numerically incremented field to answer how the records are distributed.

    Nope. You have an identity field which you could use to locate records, but it is not involved in the table storage in any way. Your data is being physically stored, and moved around, according to the entryID/Searchterm combination, because that's your clustered index. Which is also being copied to every non-clustered index.

    - there are other DB and tables on the machine as well as web apps running and no particular service agreements with client since data is not realtime requested

    So, this is a warehouse?

    A few items of note. These fields:

    [varchar](max) NULL,

    [entryContent] [nvarchar](max) NULL,

    without doing a text_data to another filegroup is making your life a crazier then it needs to be. To the point of pure and utter pain. Do you really need TWO BLOB fields in this?

    How's your maintenance on this? Do you defrag/restatistic regularly?

    This table is VERY wide, is clustered off a random-inclusion set of fields, has LOB data, and is getting a few hundred transactions/second, and has no fill factor to deal with new row inclusions on a per page level?

    Yes, you're going to feel an incredible amount of pain.

    Please try to answer the above questions and we'll see what we can do to help you out here.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Mike C (9/23/2011)


    Based on the OP's description it sounds like he's checking to see if the record exists first. Possibly via SELECT prior to INSERT.

    So here's a question: assume a CI on identity column. Also assume 100 simultaneous threads all INSERTing at the same time.

    Question: Is CI on ascending key the best option? I optimized a similar application not long ago and was able to increase performance by around 400% just by changing CI strategy from monotonically increasing CI guaranteeing inserts at "the bottom".

    That's going to depend on usage, non-clustered indexes available for confirmation of previous existance, and the like. Identity isn't necessarily the only way to guarantee the tail is the primary drop point, but it is a common solution. Like anything else, "It Depends". If there was a datetimestamp associated with these records that could be as viable if it was used during the seek. It may still be viable, it depends on what's required to check for previous existance.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Elliott Whitlow (9/23/2011)


    Hundreds or even thousands per second isn't a lot really and if you are having timeouts I might say your existing hardware/structures are incorrectly spec'd. If you expect that to grow substantially be sure to size your server appropriately, especially your disk sub-system.

    Clustered index, non clustered indexes, and table width ( how many fiels and how wide those fields are) can have huge implications on insert performance. Also if you can do bulk inserts that would help immensely too..

    For very high insert volume tables I recommend structuring them so records are always added to the end NEVER in the middle(per the clustered index), that they be VERY lightly indexed, and that they be as narrow as possible.

    CEWII

    I agree with your recommendations above, but the logical PK of the inbound data ("business key") can be hugely important in designing a new indexing strategy, depending on the distribution of the PKs around the multithreaded INSERTs.

  • SQL can guarantee the integrity of your individual SELECTs followed by INSERTs by itself, but it sounds like your .NET app is single-threaded? If that's the case the "tail-end" theory holds and will be most efficient for individual INSERTs. Depending on your rquirements, your best bet might be to bulk load the data into a staging table and do a set-based merge into the target table.

  • Addendum on a second look:

    [hashField] AS ([dbo].[getMd5Hash]([entryId])),

    As a calculated column, non-persisted?

    This table is just RIPE for pain.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • I just saw that getMD5hash myself. What exactly is the point of that field? In addition to what Kraig pointed out, if you absolutely need to populate this MD5 hash, generate it in .NET using the System.Security.Cryptography.MD5 class and send it along in the INSERT.

    I assume (searchterm) is your "business key" that you're using to locate the existing records. Also, based on your further info it sounds like you're single-threading the INSERTs. This being the case, I would change your indexing strategy to make (id) the clustered index and put a nonclustered index on (searchterm). The id column will automatically be appended to the end of the (searchterm) nonclustered index, which should reflect better how you query for the existence of records.

    I'm surprised you're mixing and matching varchar and nvarchar data types. Normally you'd see one or the other in a table. That's a nitpicky thing, but can cause you some pain in development later.

    If you have SQL 2008 R2 you might consider compression as well. When you apply page or row compression you automatically get Unicode compression for free. For standard Latin alphabet you get 50% compression; for other alphabets the compression rate varies. The advantage to compression is that you save space and you can get lower physical IOs, which is very often a bottleneck. It does cost you a bit of CPU (something like +3% on average), but on most servers this tends to be a small amount.

    While you're at it, considering all the other processes you're running on that box, make sure your SQL Server is not getting starved for resources.

    ///NOTE:

    Since you don't have a realtime SLA with this client, consider bulk inserting thousands of rows into a staging table and performing a server-side set-based MERGE or INSERT ... WHERE NOT EXISTS ... to get the data over and into the table. I can pretty much guarantee it will be a lot more efficient than thousands of round-trips to the server per second. .NET has a built in class just for this -- SqlBulkCopy.

  • Evil Kraig F (9/23/2011)


    Mike C (9/23/2011)


    Based on the OP's description it sounds like he's checking to see if the record exists first. Possibly via SELECT prior to INSERT.

    So here's a question: assume a CI on identity column. Also assume 100 simultaneous threads all INSERTing at the same time.

    Question: Is CI on ascending key the best option? I optimized a similar application not long ago and was able to increase performance by around 400% just by changing CI strategy from monotonically increasing CI guaranteeing inserts at "the bottom".

    That's going to depend on usage, non-clustered indexes available for confirmation of previous existance, and the like. Identity isn't necessarily the only way to guarantee the tail is the primary drop point, but it is a common solution. Like anything else, "It Depends". If there was a datetimestamp associated with these records that could be as viable if it was used during the seek. It may still be viable, it depends on what's required to check for previous existance.

    Ahhh, my point is that the tail isn't always the best drop point. In the application I tuned not long ago the tail was the drop point (identity column CI). It was actually a Java app that pushed 100 simultaneous threads all trying to INSERT to the table at the same time. Essentially 1 INSERT ran and 99 waited while the INSERT blocked the last page. When that one finished the next one ran and another thread on the Java side sent another INSERT to wait in line.

    I noticed the data being sent was in no particular order, and had a nice random distribution to it, so I changed the CI to the business key of the inbound data. This caused a lot of page splits on the initial inserts, so that instead of 100 threads waiting to insert into one page we had 50 threads waiting for two pages, and then 25 threads for four pages, and then 12 threads for eight pages, etc. Basically blocking was all but eliminated pretty quickly due to the random distribution of the data and the new CI.

    Kimberly Tripp pointed out another solution to this problem, when the inbound data does not have a good random distribution on the business key: basically use a persisted computed column to generate a modulo of the identity column and use it to direct the data to different partitions. You lose the major advantage of partitions during reads since the modulo of the surrogate key is not related to how people will query your data in the real world, but it should provide better performance on inserts.

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply