Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Fast inserting\reading\deleting of millions of records Expand / Collapse
Author
Message
Posted Wednesday, March 19, 2014 10:37 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 12:34 PM
Points: 8, Visits: 109
Hi everyone,

We are a software vendor who develop an application that executes engineering simulations. Individual simulations can run for several hours and potentially generate >100m result records that are written to a database table. The table structure is as follows: -

	[ValueIndex] [smallint] NOT NULL,
[TimeIndex] [smallint] NOT NULL,
[Result] [float] NOT NULL


Each run of a simulation dynamically creates a new database table for that simulation (along with 8 other ancillary tables) so that the data is inserted quickly, and a clustered primary key (on Col1, Col2) is added after the data is inserted. Therefore, after hundreds\thousands of simulation runs we could have several thousand database tables, although to our customers this does not appear to be a problem as the database, to them, is invisible.
Performance of inserting\reading\deleting (no updates are done on this data) these results is generally acceptable at the moment.

This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.

My preference is to use a single table to hold the result records for all simulations, although speaking to colleagues this was how the software was originally developed but they encountered performance problems with inserting\reading\deleting data which was why it was changed to the above.

Also, many of our customers use SQL Express which makes some SQL Server features such as partitioning or parallelism unusable.

I'd be interested to hear if anyone has any thoughts on how they would approach this scenario?
Should we able to achieve fast inserts\reads\deletes for a table with millions, if not billions, of records?
Are there any NoSQL options we should consider?

Thanks for your help,

Darren
Post #1552704
Posted Wednesday, March 19, 2014 11:19 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:26 PM
Points: 13,871, Visits: 28,266
In terms of straight data collection, many of the NoSQL options are wonderful. HADOOP would be perfect for this in terms of collecting the data. But, when it comes time to write reports, most people then migrate to relational storage because of all the advantages there.

If you wanted to approach the idea of putting everything into a common table, my biggest suggestion would be ensuring that you pick a very good clustered key... I'd probably go with something compound that logically breaks the data storage up such that, at least in theory, any two tests are contending for the same pages, in other words, get away from using an identity column.

How real time does the reporting have to be? Loading everything into HADOOP, which is going to collect the data quickly, then migrating it to relational storage for reporting, will work. But, that second step won't be fast. In fact, it'll be slow. Or at least, slow compared to just loading the tables directly. However, that one sounds a little more attractive in your situation than the single table. But, I'd prefer a single table to what you have now.


----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of: SQL Server Query Performance Tuning
SQL Server 2012 Query Performance Tuning
SQL Server 2008 Query Performance Tuning Distilled
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1552728
Posted Wednesday, March 19, 2014 11:45 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 12:34 PM
Points: 8, Visits: 109
Thanks for your reply Grant.

Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.

Thanks again,

Darren
Post #1552742
Posted Wednesday, March 19, 2014 12:25 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:26 PM
Points: 13,871, Visits: 28,266
Ah, then you're in a tough spot. Sure, HADOOP will be excellent for gathering the data (as would some of the other NoSQL solutions). But reporting, depending on the reports, well, it sucks. No other way to say it.

----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of: SQL Server Query Performance Tuning
SQL Server 2012 Query Performance Tuning
SQL Server 2008 Query Performance Tuning Distilled
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1552755
Posted Wednesday, March 19, 2014 12:54 PM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Yesterday @ 6:56 PM
Points: 4,400, Visits: 6,257
darren_ferris (3/19/2014)
Thanks for your reply Grant.

Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.

Thanks again,

Darren


1) Writing the code to process dynamically named tables can be pretty tedious. But for each sproc or piece of code you wish to run once you write it you are largely done with it forever. And when you do need to do a refactor, that is where your code comments come into play. I have done numerous systems like this, especially for large open-ended search scenarios and assuming you have rigorous naming conventions it really isn't that bad or tough. There are quite a bit of benefits here in that if necessary you can spin up new databases as required (even for every run believe it or not). That can be critical since you have SQL Express customers and the size limit there (10GB) can come into play with lots of data. Other benefits include the ability to spread data on different file groups, eliminating of most of the latch/lock issues you may see with the single table approach, ability to drop or truncate tables to clear out old data, index maintenance and statistics updates are much easier to handle, etc.

2) If you do go with a single-table approach you really do paint yourself into some difficult corners as mentioned above, but as Grant said chief among your concerns for high-throughput will be your clustering key. Since you are on SQL 2012 I highly recommend 2 or more SEQUENCE objects. Each number generated will be something important, such as test number, run number, machine number, row number or whatever you need to uniquely identify individual records. You should put these in the clustered index in least to most specific order. Test#, Run#, Row# for example. This will keep your inserts away from each other in the balanced binary tree that is the clustered index, mitigating latch contention at the end of the index chain. BUT this will also cause fragmentation of the index at multiple spots. But at least each run will be mostly contiguously located, which is really what you need.

Sounds like a VERY fun project - I am envious!!!


Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1552769
Posted Wednesday, March 19, 2014 2:06 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:09 AM
Points: 35,348, Visits: 31,886
darren_ferris (3/19/2014)
This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.


That shouldn't be true. Dynamic SQL is actually pretty easy to write, debug, and maintain if you do it right. Do you have an example of the Dynamic SQL you're using so that I might be able to make a suggestion?


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1552789
Posted Thursday, March 20, 2014 3:14 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 12:34 PM
Points: 8, Visits: 109
Thanks for the replies guys.

Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.

Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.

Thanks again,

Darren


  Post Attachments 
Dynamic SQL.txt (25 views, 9.97 KB)
Post #1552909
Posted Friday, March 21, 2014 5:36 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: 2 days ago @ 12:34 PM
Points: 8, Visits: 109
I've tried changing our process so that all simulations are written to the same table with a unique ID for each simulation.

[SimId] [smallint] NOT NULL,
[ValueIndex] [smallint] NOT NULL,
[TimeIndex] [smallint] NOT NULL,
[Result] [float] NOT NULL

I've then added a clustered index across SimId, ValueIndex and TimeIndex (ValueIndex and TimeIndex are the clustered PK columns on the existing dynamically created tables).

Running a simulation that produces 25 million results takes over twice as long as previously to insert the data.
The speed of reading the data speed is pretty much similar to before, which is good.
Deleting results takes a lot longer as like Kevin mentioned we were just dropping the tables for simulations we wanted to delete whereas now we're deleting a batch of records from the single table.

Darren
Post #1553435
Posted Friday, March 21, 2014 6:07 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:26 PM
Points: 13,871, Visits: 28,266
Can you tell what is slowing down the inserts? I mean other than the obvious, this index sucks. What does the execution plan look like? What are the wait stats on the inserts? Just curious because that type of index, spreading the inserts, does work, in some situations, to speed up inserts.

----------------------------------------------------
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood..." Theodore Roosevelt
The Scary DBA
Author of: SQL Server Query Performance Tuning
SQL Server 2012 Query Performance Tuning
SQL Server 2008 Query Performance Tuning Distilled
and
SQL Server Execution Plans

Product Evangelist for Red Gate Software
Post #1553450
Posted Saturday, March 22, 2014 11:13 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:09 AM
Points: 35,348, Visits: 31,886
darren_ferris (3/20/2014)
Thanks for the replies guys.

Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.

Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.

Thanks again,

Darren


I think you're in deep Kimchi. Let's take a look at your first IF block of code...

		IF @useAuditTrail = 1
BEGIN
Set @Sel =
'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo
+ ' From ' + @NW_Feature + ' as f ' +
'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +
'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +
'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +
' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo <= (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1
Where f.ComponentNo = nwf1.ComponentNo
AND f.SubFormAnalyID = nwf1.SubFormAnalyID
AND f.FeatureAnalyID = nwf1.FeatureAnalyID
AND f.RepeatNo = nwf1.RepeatNo) '
END
ELSE
BEGIN
Set @Sel =
'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo
+ ' From ' + @NW_Feature + ' as f ' +
'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +
'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +
'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +
' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo = (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1
Where f.ComponentNo = nwf1.ComponentNo
AND f.SubFormAnalyID = nwf1.SubFormAnalyID
AND f.FeatureAnalyID = nwf1.FeatureAnalyID
AND f.RepeatNo = nwf1.RepeatNo) '
END


My first step in trying to make something like this easier is to remove all the dynamic stuff from the first query and end up with some readable code with some "variable tokens" that I'd replace later on. The formatting becomes nice and vertically aligned making it easier to see everything. Like a good DBA, I added aliases to everything that was missing them and renumbered the rest to make sense of it all. Like this...

 SELECT DISTINCT 
f1.ComponentNo
,f1.SubFormAnalyID
,f1.FeatureAnalyID
,f1.RepeatNo
,f1.SimulationNo
,f1.IsVarParam
,f1.LibItemID
,f1.LibVerNo
,f1.IntValue
,f1.RealValue
,f1.TextValue
,pc.CompAnalyID
,ft.EnumTypeID
,<<@ParentNo>>
FROM dbo.<<@NW_Feature>> AS f1
JOIN dbo.<<@NW_Component>> AS nc ON f1.ComponentNo = nc.ComponentNo
JOIN dbo.PCC_Component AS pc ON nc.PhysID = pc.PhysID
JOIN dbo.Feature AS ft ON f1.FeatureAnalyID = ft.AnalyID
JOIN dbo.<<@NW_Feature>> AS f2 ON f1.SimulationNo <= ( SELECT MAX(SimulationNo)
FROM <<@NW_Feature>> AS nwf1
WHERE f1.ComponentNo = nwf1.ComponentNo
AND f1.SubFormAnalyID = nwf1.SubFormAnalyID
AND f1.FeatureAnalyID = nwf1.FeatureAnalyID
AND f1.RepeatNo = nwf1.RepeatNo
)
WHERE f1.EnumTypeID NOT IN (14,15)
;

That's when I saw the reason for the DISTINCT... the bolded text in the code above forms a nasty ol' CROSS JOIN on whatever table @NW_Feature will describe. Notice that the "f2" alias isn't used anywhere else... not even in one of the ON clauses like it needs to be.

If you could fix that little problem in my "first step" code above, then I'd be happy to show you how to eliminate the massive amount of duplication of code that you have in the text file that you attached while making all of the code much easier to read and troubleshoot.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1553793
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse