March 19, 2014 at 10:37 am
Hi everyone,
We are a software vendor who develop an application that executes engineering simulations. Individual simulations can run for several hours and potentially generate >100m result records that are written to a database table. The table structure is as follows: -
[ValueIndex] [smallint] NOT NULL,
[TimeIndex] [smallint] NOT NULL,
[Result] [float] NOT NULL
Each run of a simulation dynamically creates a new database table for that simulation (along with 8 other ancillary tables) so that the data is inserted quickly, and a clustered primary key (on Col1, Col2) is added after the data is inserted. Therefore, after hundreds\thousands of simulation runs we could have several thousand database tables, although to our customers this does not appear to be a problem as the database, to them, is invisible.
Performance of inserting\reading\deleting (no updates are done on this data) these results is generally acceptable at the moment.
This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.
My preference is to use a single table to hold the result records for all simulations, although speaking to colleagues this was how the software was originally developed but they encountered performance problems with inserting\reading\deleting data which was why it was changed to the above.
Also, many of our customers use SQL Express which makes some SQL Server features such as partitioning or parallelism unusable.
I'd be interested to hear if anyone has any thoughts on how they would approach this scenario?
Should we able to achieve fast inserts\reads\deletes for a table with millions, if not billions, of records?
Are there any NoSQL options we should consider?
Thanks for your help,
Darren
March 19, 2014 at 11:19 am
In terms of straight data collection, many of the NoSQL options are wonderful. HADOOP would be perfect for this in terms of collecting the data. But, when it comes time to write reports, most people then migrate to relational storage because of all the advantages there.
If you wanted to approach the idea of putting everything into a common table, my biggest suggestion would be ensuring that you pick a very good clustered key... I'd probably go with something compound that logically breaks the data storage up such that, at least in theory, any two tests are contending for the same pages, in other words, get away from using an identity column.
How real time does the reporting have to be? Loading everything into HADOOP, which is going to collect the data quickly, then migrating it to relational storage for reporting, will work. But, that second step won't be fast. In fact, it'll be slow. Or at least, slow compared to just loading the tables directly. However, that one sounds a little more attractive in your situation than the single table. But, I'd prefer a single table to what you have now.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 19, 2014 at 11:45 am
Thanks for your reply Grant.
Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.
Thanks again,
Darren
March 19, 2014 at 12:25 pm
Ah, then you're in a tough spot. Sure, HADOOP will be excellent for gathering the data (as would some of the other NoSQL solutions). But reporting, depending on the reports, well, it sucks. No other way to say it.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 19, 2014 at 12:54 pm
darren_ferris (3/19/2014)
Thanks for your reply Grant.Once a simulation has completed the user would then typically view the results (or a subset of the results) via our application. They might also want to delete all of that simulation's results which also needs to be pretty quick.
Thanks again,
Darren
1) Writing the code to process dynamically named tables can be pretty tedious. But for each sproc or piece of code you wish to run once you write it you are largely done with it forever. And when you do need to do a refactor, that is where your code comments come into play. 🙂 I have done numerous systems like this, especially for large open-ended search scenarios and assuming you have rigorous naming conventions it really isn't that bad or tough. There are quite a bit of benefits here in that if necessary you can spin up new databases as required (even for every run believe it or not). That can be critical since you have SQL Express customers and the size limit there (10GB) can come into play with lots of data. Other benefits include the ability to spread data on different file groups, eliminating of most of the latch/lock issues you may see with the single table approach, ability to drop or truncate tables to clear out old data, index maintenance and statistics updates are much easier to handle, etc.
2) If you do go with a single-table approach you really do paint yourself into some difficult corners as mentioned above, but as Grant said chief among your concerns for high-throughput will be your clustering key. Since you are on SQL 2012 I highly recommend 2 or more SEQUENCE objects. Each number generated will be something important, such as test number, run number, machine number, row number or whatever you need to uniquely identify individual records. You should put these in the clustered index in least to most specific order. Test#, Run#, Row# for example. This will keep your inserts away from each other in the balanced binary tree that is the clustered index, mitigating latch contention at the end of the index chain. BUT this will also cause fragmentation of the index at multiple spots. But at least each run will be mostly contiguously located, which is really what you need.
Sounds like a VERY fun project - I am envious!!!
Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service
March 19, 2014 at 2:06 pm
darren_ferris (3/19/2014)
This causes more of an issue with the underlying stored procedures and the maintenance of them as we have to use lots of dynamic SQL that becomes difficult to write, understand and debug.
That shouldn't be true. Dynamic SQL is actually pretty easy to write, debug, and maintain if you do it right. Do you have an example of the Dynamic SQL you're using so that I might be able to make a suggestion?
--Jeff Moden
Change is inevitable... Change for the better is not.
March 20, 2014 at 3:14 am
Thanks for the replies guys.
Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.
Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.
Thanks again,
Darren
March 21, 2014 at 5:36 am
I've tried changing our process so that all simulations are written to the same table with a unique ID for each simulation.
[SimId] [smallint] NOT NULL,
[ValueIndex] [smallint] NOT NULL,
[TimeIndex] [smallint] NOT NULL,
[Result] [float] NOT NULL
I've then added a clustered index across SimId, ValueIndex and TimeIndex (ValueIndex and TimeIndex are the clustered PK columns on the existing dynamically created tables).
Running a simulation that produces 25 million results takes over twice as long as previously to insert the data.
The speed of reading the data speed is pretty much similar to before, which is good.
Deleting results takes a lot longer as like Kevin mentioned we were just dropping the tables for simulations we wanted to delete whereas now we're deleting a batch of records from the single table.
Darren
March 21, 2014 at 6:07 am
Can you tell what is slowing down the inserts? I mean other than the obvious, this index sucks. What does the execution plan look like? What are the wait stats on the inserts? Just curious because that type of index, spreading the inserts, does work, in some situations, to speed up inserts.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 22, 2014 at 11:13 pm
darren_ferris (3/20/2014)
Thanks for the replies guys.Kevin, you're right, customers do tend to spawn up new databases once they reach the size limit in SQL Express. I'll try the clustered index approach, the data we're inserting is already sorted in the order of the index I'll attempt to use, so hopefully that will help things.
Jeff, I've attached a file which contains one example of some of the dynamic SQL we have in place. Let me know what you think.
Thanks again,
Darren
I think you're in deep Kimchi. 🙂 Let's take a look at your first IF block of code...
IF @useAuditTrail = 1
BEGIN
Set @Sel =
'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo
+ ' From ' + @NW_Feature + ' as f ' +
'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +
'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +
'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +
' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo <= (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1
Where f.ComponentNo = nwf1.ComponentNo
AND f.SubFormAnalyID = nwf1.SubFormAnalyID
AND f.FeatureAnalyID = nwf1.FeatureAnalyID
AND f.RepeatNo = nwf1.RepeatNo) '
END
ELSE
BEGIN
Set @Sel =
'SELECT DISTINCT f.ComponentNo, f.SubFormAnalyID, f.FeatureAnalyID, f.RepeatNo, f.SimulationNo, f.IsVarParam, f.LibItemID, f.LibVerNo, f.IntValue, f.RealValue, f.TextValue, pc.CompAnalyID, Feature.EnumTypeID, ' + @ParentNo
+ ' From ' + @NW_Feature + ' as f ' +
'INNER JOIN ' + @NW_Component + ' as nc ON f.ComponentNo = nc.ComponentNo ' +
'INNER JOIN PCC_Component as pc ON nc.PhysID = pc.PhysID ' +
'INNER JOIN Feature ON f.FeatureAnalyID = Feature.AnalyID AND Feature.EnumTypeID <> 14 AND Feature.EnumTypeID <> 15 ' +
' INNER JOIN ' + @NW_Feature + ' ON f.SimulationNo = (Select MAX(SimulationNo) FROM ' + @NW_Feature + ' As nwf1
Where f.ComponentNo = nwf1.ComponentNo
AND f.SubFormAnalyID = nwf1.SubFormAnalyID
AND f.FeatureAnalyID = nwf1.FeatureAnalyID
AND f.RepeatNo = nwf1.RepeatNo) '
END
My first step in trying to make something like this easier is to remove all the dynamic stuff from the first query and end up with some readable code with some "variable tokens" that I'd replace later on. The formatting becomes nice and vertically aligned making it easier to see everything. Like a good DBA, I added aliases to everything that was missing them and renumbered the rest to make sense of it all. Like this...
SELECT DISTINCT
f1.ComponentNo
,f1.SubFormAnalyID
,f1.FeatureAnalyID
,f1.RepeatNo
,f1.SimulationNo
,f1.IsVarParam
,f1.LibItemID
,f1.LibVerNo
,f1.IntValue
,f1.RealValue
,f1.TextValue
,pc.CompAnalyID
,ft.EnumTypeID
,<<@ParentNo>>
FROM dbo.<<@NW_Feature>> AS f1
JOIN dbo.<<@NW_Component>> AS nc ON f1.ComponentNo = nc.ComponentNo
JOIN dbo.PCC_Component AS pc ON nc.PhysID = pc.PhysID
JOIN dbo.Feature AS ft ON f1.FeatureAnalyID = ft.AnalyID
[font="Arial Black"] JOIN dbo.<<@NW_Feature>> AS f2[/font] ON f1.SimulationNo <= ( SELECT MAX(SimulationNo)
FROM <<@NW_Feature>> AS nwf1
WHERE f1.ComponentNo = nwf1.ComponentNo
AND f1.SubFormAnalyID = nwf1.SubFormAnalyID
AND f1.FeatureAnalyID = nwf1.FeatureAnalyID
AND f1.RepeatNo = nwf1.RepeatNo
)
WHERE f1.EnumTypeID NOT IN (14,15)
;
That's when I saw the reason for the DISTINCT... the bolded text in the code above forms a nasty ol' CROSS JOIN on whatever table @NW_Feature will describe. Notice that the "f2" alias isn't used anywhere else... not even in one of the ON clauses like it needs to be.
If you could fix that little problem in my "first step" code above, then I'd be happy to show you how to eliminate the massive amount of duplication of code that you have in the text file that you attached while making all of the code much easier to read and troubleshoot.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 24, 2014 at 3:58 am
Hi Grant,
I've done some further testing on the inserts and it only takes about 15% longer, so not too bad. I'm bulk loading the data via a C# application, using SqlBulkCopy, in batches of around 10k records.
The explain plan on a simple insert of 1 record shows that 100% of the Operator Cost is with the index.
I've profiled the inserting of all of the data and have the following wait stats although unfortunately this is not an area that I'm totally familiar with.
I've used the query provided by Paul Randall at http://www.sqlskills.com/blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/: -
WaitTypeWait_SResource_SSignal_SWaitCountPercentageAvgWait_SAvgRes_SAvgSig_S
PREEMPTIVE_OS_WRITEFILEGATHER66.5966.590628.7111.098511.09850
LATCH_EX58.8458.840825.377.35537.3550.0003
ASYNC_IO_COMPLETION41.9441.9401018.084.19374.19370
ASYNC_NETWORK_IO28.8726.642.235379812.450.00050.00050
WRITELOG23.2422.810.43625510.020.00370.00360.0001
IO_COMPLETION8.298.260.03173903.570.00050.00050
Thanks again for your support.
Darren
March 24, 2014 at 4:49 am
That looks like hardware as a bottleneck. It's the writes that are going slow.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
March 24, 2014 at 5:16 am
I can't speak to the INSERTs but the accidental CROSS JOIN in the SELECT code will beat the hell out of the hardware making it look like a hardware problem.
--Jeff Moden
Change is inevitable... Change for the better is not.
March 24, 2014 at 5:27 am
Hi Jeff,
I've fixed the problem with the CROSS JOIN, it looks like we've had that code in there for a good few years:rolleyes:.
Here's the revised code: -
SELECT
f1.ComponentNo
,f1.SubFormAnalyID
,f1.FeatureAnalyID
,f1.RepeatNo
,f1.SimulationNo
,f1.IsVarParam
,f1.LibItemID
,f1.LibVerNo
,f1.IntValue
,f1.RealValue
,f1.TextValue
,pc.CompAnalyID
,ft.EnumTypeID
,<<@ParentNo>>
FROM dbo.<<@NW_Feature>> AS f1
JOIN dbo.<<@NW_Component>> AS nc ON f1.ComponentNo = nc.ComponentNo
JOIN dbo.PCC_Component AS pc ON nc.PhysID = pc.PhysID
JOIN dbo.Feature AS ft ON f1.FeatureAnalyID = ft.AnalyID
WHERE f1.SimulationNo <= ( SELECT MAX(SimulationNo)
FROM <<@NW_Feature>> AS nwf1
WHERE f1.ComponentNo = nwf1.ComponentNo
AND f1.SubFormAnalyID = nwf1.SubFormAnalyID
AND f1.FeatureAnalyID = nwf1.FeatureAnalyID
AND f1.RepeatNo = nwf1.RepeatNo
)
AND ft.EnumTypeID NOT IN (14,15)
;
Thanks,
Darren
March 24, 2014 at 7:01 am
So do you still need the DISTINCT?
--Jeff Moden
Change is inevitable... Change for the better is not.
Viewing 15 posts - 1 through 15 (of 17 total)
You must be logged in to reply to this topic. Login to reply