Fast inserting\reading\deleting of millions of records

  • Hi everyone,

    We are a software vendor who develop an application that executes engineering simulations. Individual simulations can run for several hours and potentially generate >100m result records that are written to a database table. The table structure is as follows: -

    [ValueIndex] [smallint] NOT NULL,

    [TimeIndex] [smallint] NOT NULL,

    [Result] [float] NOT NULL

    Each run of a simulation dynamically creates a new database table for that simulation (along with 8 other ancillary tables) so that the data is inserted quickly, and a clustered primary key (on Col1, Col2) is added after the data is inserted. Therefore, after hundreds\thousands of simulation runs we could have several thousand database tables, although to our customers this does not appear to be a problem as the database, to them, is invisible.

    Performance of inserting\reading\deleting (no updates are done on this data) these results is generally acceptable at the moment.

    This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.

    My preference is to use a single table to hold the result records for all simulations, although speaking to colleagues this was how the software was originally developed but they encountered performance problems with inserting\reading\deleting data which was why it was changed to the above.

    Also, many of our customers use SQL Express which makes some SQL Server features such as partitioning or parallelism unusable.

    I'd be interested to hear if anyone has any thoughts on how they would approach this scenario?

    Should we able to achieve fast inserts\reads\deletes for a table with millions, if not billions, of records?

    Are there any NoSQL options we should consider?

    Thanks for your help,

    Darren

  • In terms of straight data collection, many of the NoSQL options are wonderful. HADOOP would be perfect for this in terms of collecting the data. But, when it comes time to write reports, most people then migrate to relational storage because of all the advantages there.

    If you wanted to approach the idea of putting everything into a common table, my biggest suggestion would be ensuring that you pick a very good clustered key... I'd probably go with something compound that logically breaks the data storage up such that, at least in theory, any two tests are contending for the same pages, in other words, get away from using an identity column.

    How real time does the reporting have to be? Loading everything into HADOOP, which is going to collect the data quickly, then migrating it to relational storage for reporting, will work. But, that second step won't be fast. In fact, it'll be slow. Or at least, slow compared to just loading the tables directly. However, that one sounds a little more attractive in your situation than the single table. But, I'd prefer a single table to what you have now.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thanks for your reply Grant.

    Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.

    Thanks again,

    Darren

  • Ah, then you're in a tough spot. Sure, HADOOP will be excellent for gathering the data (as would some of the other NoSQL solutions). But reporting, depending on the reports, well, it sucks. No other way to say it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • darren_ferris (3/19/2014)


    Thanks for your reply Grant.

    Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.

    Thanks again,

    Darren

    1) Writing the code to process dynamically named tables can be pretty tedious. But for each sproc or piece of code you wish to run once you write it you are largely done with it forever. And when you do need to do a refactor, that is where your code comments come into play. 🙂 I have done numerous systems like this, especially for large open-ended search scenarios and assuming you have rigorous naming conventions it really isn't that bad or tough. There are quite a bit of benefits here in that if necessary you can spin up new databases as required (even for every run believe it or not). That can be critical since you have SQL Express customers and the size limit there (10GB) can come into play with lots of data. Other benefits include the ability to spread data on different file groups, eliminating of most of the latch/lock issues you may see with the single table approach, ability to drop or truncate tables to clear out old data, index maintenance and statistics updates are much easier to handle, etc.

    2) If you do go with a single-table approach you really do paint yourself into some difficult corners as mentioned above, but as Grant said chief among your concerns for high-throughput will be your clustering key. Since you are on SQL 2012 I highly recommend 2 or more SEQUENCE objects. Each number generated will be something important, such as test number, run number, machine number, row number or whatever you need to uniquely identify individual records. You should put these in the clustered index in least to most specific order. Test#, Run#, Row# for example. This will keep your inserts away from each other in the balanced binary tree that is the clustered index, mitigating latch contention at the end of the index chain. BUT this will also cause fragmentation of the index at multiple spots. But at least each run will be mostly contiguously located, which is really what you need.

    Sounds like a VERY fun project - I am envious!!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • darren_ferris (3/19/2014)


    This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.

    That shouldn't be true. Dynamic SQL is actually pretty easy to write, debug, and maintain if you do it right. Do you have an example of the Dynamic SQL you're using so that I might be able to make a suggestion?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the replies guys.

    Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.

    Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.

    Thanks again,

    Darren

  • I've tried changing our process so that all simulations are written to the same table with a unique ID for each simulation.

    [SimId] [smallint] NOT NULL,

    [ValueIndex] [smallint] NOT NULL,

    [TimeIndex] [smallint] NOT NULL,

    [Result] [float] NOT NULL

    I've then added a clustered index across SimId, ValueIndex and TimeIndex (ValueIndex and TimeIndex are the clustered PK columns on the existing dynamically created tables).

    Running a simulation that produces 25 million results takes over twice as long as previously to insert the data.

    The speed of reading the data speed is pretty much similar to before, which is good.

    Deleting results takes a lot longer as like Kevin mentioned we were just dropping the tables for simulations we wanted to delete whereas now we're deleting a batch of records from the single table.

    Darren

  • Can you tell what is slowing down the inserts? I mean other than the obvious, this index sucks. What does the execution plan look like? What are the wait stats on the inserts? Just curious because that type of index, spreading the inserts, does work, in some situations, to speed up inserts.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • darren_ferris (3/20/2014)


    Thanks for the replies guys.

    Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.

    Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.

    Thanks again,

    Darren

    I think you're in deep Kimchi. 🙂 Let's take a look at your first IF block of code...

    IF @useAuditTrail = 1

    BEGIN

    Set @Sel =

    'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo

    + ' From ' + @NW_Feature + ' as f ' +

    'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +

    'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +

    'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +

    ' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo <= (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1

    Where f.ComponentNo = nwf1.ComponentNo

    AND f.SubFormAnalyID = nwf1.SubFormAnalyID

    AND f.FeatureAnalyID = nwf1.FeatureAnalyID

    AND f.RepeatNo = nwf1.RepeatNo) '

    END

    ELSE

    BEGIN

    Set @Sel =

    'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo

    + ' From ' + @NW_Feature + ' as f ' +

    'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +

    'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +

    'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +

    ' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo = (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1

    Where f.ComponentNo = nwf1.ComponentNo

    AND f.SubFormAnalyID = nwf1.SubFormAnalyID

    AND f.FeatureAnalyID = nwf1.FeatureAnalyID

    AND f.RepeatNo = nwf1.RepeatNo) '

    END

    My first step in trying to make something like this easier is to remove all the dynamic stuff from the first query and end up with some readable code with some "variable tokens" that I'd replace later on. The formatting becomes nice and vertically aligned making it easier to see everything. Like a good DBA, I added aliases to everything that was missing them and renumbered the rest to make sense of it all. Like this...

    SELECT DISTINCT

    f1.ComponentNo

    ,f1.SubFormAnalyID

    ,f1.FeatureAnalyID

    ,f1.RepeatNo

    ,f1.SimulationNo

    ,f1.IsVarParam

    ,f1.LibItemID

    ,f1.LibVerNo

    ,f1.IntValue

    ,f1.RealValue

    ,f1.TextValue

    ,pc.CompAnalyID

    ,ft.EnumTypeID

    ,<<@ParentNo>>

    FROM dbo.<<@NW_Feature>> AS f1

    JOIN dbo.<<@NW_Component>> AS nc ON f1.ComponentNo = nc.ComponentNo

    JOIN dbo.PCC_Component AS pc ON nc.PhysID = pc.PhysID

    JOIN dbo.Feature AS ft ON f1.FeatureAnalyID = ft.AnalyID

    [font="Arial Black"] JOIN dbo.<<@NW_Feature>> AS f2[/font] ON f1.SimulationNo <= ( SELECT MAX(SimulationNo)

    FROM <<@NW_Feature>> AS nwf1

    WHERE f1.ComponentNo = nwf1.ComponentNo

    AND f1.SubFormAnalyID = nwf1.SubFormAnalyID

    AND f1.FeatureAnalyID = nwf1.FeatureAnalyID

    AND f1.RepeatNo = nwf1.RepeatNo

    )

    WHERE f1.EnumTypeID NOT IN (14,15)

    ;

    That's when I saw the reason for the DISTINCT... the bolded text in the code above forms a nasty ol' CROSS JOIN on whatever table @NW_Feature will describe. Notice that the "f2" alias isn't used anywhere else... not even in one of the ON clauses like it needs to be.

    If you could fix that little problem in my "first step" code above, then I'd be happy to show you how to eliminate the massive amount of duplication of code that you have in the text file that you attached while making all of the code much easier to read and troubleshoot.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Grant,

    I've done some further testing on the inserts and it only takes about 15% longer, so not too bad. I'm bulk loading the data via a C# application, using SqlBulkCopy, in batches of around 10k records.

    The explain plan on a simple insert of 1 record shows that 100% of the Operator Cost is with the index.

    I've profiled the inserting of all of the data and have the following wait stats although unfortunately this is not an area that I'm totally familiar with.

    I've used the query provided by Paul Randall at http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/: -

    WaitTypeWait_SResource_SSignal_SWaitCountPercentageAvgWait_SAvgRes_SAvgSig_S

    PREEMPTIVE_OS_WRITEFILEGATHER66.5966.590628.7111.098511.09850

    LATCH_EX58.8458.840825.377.35537.3550.0003

    ASYNC_IO_COMPLETION41.9441.9401018.084.19374.19370

    ASYNC_NETWORK_IO28.8726.642.235379812.450.00050.00050

    WRITELOG23.2422.810.43625510.020.00370.00360.0001

    IO_COMPLETION8.298.260.03173903.570.00050.00050

    Thanks again for your support.

    Darren

  • That looks like hardware as a bottleneck. It's the writes that are going slow.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I can't speak to the INSERTs but the accidental CROSS JOIN in the SELECT code will beat the hell out of the hardware making it look like a hardware problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    I've fixed the problem with the CROSS JOIN, it looks like we've had that code in there for a good few years:rolleyes:.

    Here's the revised code: -

    SELECT

    f1.ComponentNo

    ,f1.SubFormAnalyID

    ,f1.FeatureAnalyID

    ,f1.RepeatNo

    ,f1.SimulationNo

    ,f1.IsVarParam

    ,f1.LibItemID

    ,f1.LibVerNo

    ,f1.IntValue

    ,f1.RealValue

    ,f1.TextValue

    ,pc.CompAnalyID

    ,ft.EnumTypeID

    ,<<@ParentNo>>

    FROM dbo.<<@NW_Feature>> AS f1

    JOIN dbo.<<@NW_Component>> AS nc ON f1.ComponentNo = nc.ComponentNo

    JOIN dbo.PCC_Component AS pc ON nc.PhysID = pc.PhysID

    JOIN dbo.Feature AS ft ON f1.FeatureAnalyID = ft.AnalyID

    WHERE f1.SimulationNo <= ( SELECT MAX(SimulationNo)

    FROM <<@NW_Feature>> AS nwf1

    WHERE f1.ComponentNo = nwf1.ComponentNo

    AND f1.SubFormAnalyID = nwf1.SubFormAnalyID

    AND f1.FeatureAnalyID = nwf1.FeatureAnalyID

    AND f1.RepeatNo = nwf1.RepeatNo

    )

    AND ft.EnumTypeID NOT IN (14,15)

    ;

    Thanks,

    Darren

  • So do you still need the DISTINCT?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply