Can't Import Data Fast Enough - Optimize or rethink the design?

  • I've written a SQL application to collection application logs files from 100+ servers. However, the servers are generating logs files faster than I can import them. I'm almost at the break even point, but we'll be adding more and more servers so I think it is a losing battle. I'm starting to get the feeling I need to rethink my design, but I'm not a SQL expert by any means so any suggestions would be appreciated.

    Existing Solution:

    Log files are standard CSV logs that exists on servers around the world. They are 10MB in size with about 50,000 rows in each file.

    (1) Copy all log files locally.

    (2) Console app runs locally on SQL server and calls Stored Procedure

    (2) SP uses Bulk Import to insert into a staging table with very loosely typed columns (ints and nvarchars) and no indexes. I think there is a PK defined on this table.

    (3) SP then copies (using INSERT INTO with a nested SELECT) from staging table into the final table with strongly typed columns and 8-10 indexes. There is some data cleansing done at this stage.

    I can import about 6 log files a minute. Looking at my server I'm not sure where the bottleneck is. SQL isn't using very much CPU (between 30-50), my memory footprint is under 2GB (system has 4GB). Logs files are a separate drive from the DB. Each drive is made up of multiple disks in a RAID0 config. On top of how slow import is, from time to time, the console application gets timeout errors when calling the StoredProcedure. The timeout is set to 3 minutes which is plenty of time.

    I've reached the end of my knowledge, so I'm not sure whether I should be looking to optimize this solution or rethinking it. i.e. Break the data up into different tables. i.e. Server01-Server-20 goes into Table1 and then have multiple instances of the console application each handling imports for a different group of servers.

    Thanks for your help!

  • Does your console app allow for parallel processing?

    What is the hardware configuration of your Server? (e.g. How many CPU's and how many of them usable for SQL Server?)

    Do you import into one single staging table or separate tables. If so, how are those separate tables selected? (Which file goes into which table?)

    I think the key is to run your processes in parallel, therewith utilizing more CPU.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • winterminute (9/13/2009)


    (3) SP then copies (using INSERT INTO with a nested SELECT) from staging table into the final table with strongly typed columns and 8-10 indexes. There is some data cleansing done at this stage.

    I'd have to say that the hold up is probably shared between the 8-10 indexes and the "data cleansing". It might be helpful if you posted the code for the "copy" and the CREATE TABLE statement along with all of the indexes, constraints, and any triggers that may exist on the final table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There are 2 types of logs with slightly different schemas. Depending on the log type, it goes into 1 of 2 staging tables and then both end up getting reconciled into a single table. But to answer what I think you were asking, one log file = one staging table.

    It is a 4-core server, I don't know how many are dedicated to SQL, how can I check this? The App itself could run in parallel, but it might end up tripping over itself in the stored procedure. The first thing the stored proc does is clean out the staging table (which I just realized uses DELETE FROM instead of TRUNCATE so there's one optimization), so I would need to figure out how to make sure each instance of the app isn't deleting while one is trying to import. I could have multiple staging tables so that multiple instances of the app could be importing at the same time.

  • Please see my previous post above.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It is a 4-core server, I don't know how many are dedicated to SQL, how can I check this?

    In Management Studio, right click on the Server in the Object Explorer Window and go to properties -> processors. Tell us, what you see (are both check boxes for automatic processor affinity mask and I/O affinity mask checked?, How many processors are checked?) Important note: Don't make any changes, just verify!!

    The App itself could run in parallel, but it might end up tripping over itself in the stored procedure.

    It sounds like you're performing a staging table delete/truncation as well as the index related changes for each and every file. So, parallel processing is not really an option at this moment. There's most probably some way for improvement. In order to look deeper into it, please provide the data Jeff asked for.

    How "fast" (= how often) each server generates a log file?

    Did you consider a process where you'd import more than one file and perform the "data cleansing" to a block of files?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • 1) Both I/O and Processor affinity are automatically so all 4 cores are (disabled, but) checked

    2) No, I hadn't considered data cleansing in bluk, but if that's the way to go that should be pretty easy.

    3) On a heavily used server, we're looking about a new log file (i.e. 10MB before it rolls over) every 10minutes, but on less loaded servers it might be 1/hr.

    4) The staging table gets cleared every time the SP is run, but there aren't any indexes or constraints on that table. The users table has 2 indexes and the UserActivity table has 8. There are no triggers on either table.

    CREATE TABLE [dbo].[UserActivity](

    [DateTimeStamp] [datetime] NOT NULL,

    [SessionId] [int] NOT NULL,

    [SequenceNumber] [int] NOT NULL,

    [UserId] [int] NOT NULL,

    [ActivityType] [nvarchar](9) NOT NULL,

    [DeploymentId] [int] NULL,

    ...

    ,

    CONSTRAINT [PK_UserActivity] PRIMARY KEY CLUSTERED

    (

    [DateTimeStamp] ASC,

    [SessionId] ASC,

    [SequenceNumber] ASC,

    [UserId] ASC,

    [ActivityType] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    5) Now that I look at the SP I realized that this one doesn't have much in the way of data cleansing. There is another one I'm working which does a lot of DateTime manipulation before inserting, but this one is pretty straight forward. The only reason I went to a staging DB was because the schemas are different so someone on this forum said the "correct" way to solve it was to go to a staging table then reconcile the changes.

    Here's the whole stored proc. I've sanitized it a bit to hide our internal details, but should be pretty straight forward. By the way, I hate this dynamic SQL stuff, so if there is a better way to handle variables I'd apperciate a pointer in the right direction.

    -- SET NOCOUNT ON added to prevent extra result sets from interfering with SELECT statements.

    SET NOCOUNT ON;

    -- Set some variables based on the LogType

    DECLARE @StagingDb nvarchar(30), @SQLString nvarchar(1000)

    SET @StagingDb =

    CASE @LogType

    WHEN '1' THEN '1'

    WHEN '2' THEN '2'

    END;

    SET @SQLString = N'DELETE FROM ' + @StagingDb;

    EXECUTE sp_executesql @SQLString

    SET @SQLString =

    N'BULK INSERT ' + @StagingDb + ' FROM ''' + @LogFile + '''

    WITH

    (

    FIRSTROW = 6,

    MAXERRORS = 500,

    FIELDTERMINATOR = '','',

    ROWTERMINATOR = ''''

    )'

    EXECUTE sp_executesql @SQLString

    EXEC

    ('

    INSERT INTO Users (UserId, ServerId)

    SELECT DISTINCT LOWER(UserId), ' + @ServerId + '

    FROM ' + @StagingDb + '

    WHERE NOT EXISTS

    (

    SELECT UserId

    FROM Users

    WHERE ' + @StagingDb + '.UserId = Users.UserId

    )

    ')

    EXEC

    ('

    INSERT INTO UserActivity

    (Column1, Column2, etc)

    SELECT Column1, Column2

    FROM ' + @StagingDb + '

    INNER JOIN Users

    On ' + @StagingDb + '.UserId = Users.UserId

    WHERE NOT EXISTS

    (

    SELECT UserId

    FROM UserActivity

    WHERE UserActivity.SessionId = ' + @StagingDb + '.[session-id]

    AND UserActivity.SequenceNumber = ' + @StagingDb + '.[seq-number]

    AND UserActivity.DateTimeStamp = ' + @StagingDb + '.[date-time]

    )

    ')

    END

  • Oddly enough, if you were to used a Temp table instead of a permanent staging table, you could fix 3 problems...

    1. Being able to run parallel jobs concurrently (if you wanted to do that).

    2. Would get rid of most of your dynamic SQL.

    3. Would increase performance if for no other reason than it's faster to create a new table than it is to delete rows from a permanent table.

    I realize you sanitized the code which means you also took some of the column names out of the table definition for the user activety table.... how many columns does it actually have? I'm asking because 8 indexes seems to be awfully high for something I estimate to have no more than 200 characters of input for.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There are 18 columns. The 8 indexs corespond to a potential WHERE clause. I seem to recal reading that any column that could be used in a WHERE clause (or a JOIN for that matter) should have an INDEX. What do you mean "estimate to have no more than 200 characters of input"?

    TEMP tables are an interesting idea. The way the current code is structured I could prototype that along side with the running code. I've disabled import on half the data while I get my throughput problems figured out.

  • There are 18 columns. The 8 indexs corespond to a potential WHERE clause. I seem to recal reading that any column that could be used in a WHERE clause (or a JOIN for that matter) should have an INDEX. What do you mean "estimate to have no more than 200 characters of input"?

    TEMP tables are an interesting idea. The way the current code is structured I could prototype that along side with the running code. I've disabled import on half the data while I get my throughput problems figured out.

  • winterminute (9/15/2009)


    There are 18 columns. The 8 indexs corespond to a potential WHERE clause. I seem to recal reading that any column that could be used in a WHERE clause (or a JOIN for that matter) should have an INDEX. What do you mean "estimate to have no more than 200 characters of input"?

    TEMP tables are an interesting idea. The way the current code is structured I could prototype that along side with the running code. I've disabled import on half the data while I get my throughput problems figured out.

    What I meant by "200 characters of input" was that you have 10MB per file and 50K rows per file. If you do that bit of math, there are about 200 characters of input per LINE in the file.

    Your CLUSTERED index appears to be OK for inserts (ordered temporally) if the data is actually inserted in order by DateTimeStamp. A problem with page splits will certainly occur if the DateTimeStamp column in any new data is less than the max DateTimeStamp in the UserActivitytable and that can cause quite the delay.

    The indexes are probably fine based on what you said although I'd be tempted to combine some of them to lighten the load if possible. They will cause a slowdown because they must also be updated.

    Do you have any triggers on the table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just want to throw a suggestion in here - may be worth a slight refactor of your import process to us the MS LogParser application (here) for the initial load. It's designed for just this kind of purpose - and if your log files aren't already ordered in clustered index order using this would allow you to sort them before the insert, perhaps gaining a little speed.

    Downside - you can't use a proc for the insert (tip: tune the batch size parameter up as high as you need to control the rows-per-transaction). It's pseudo-SQL operations (like the sort) are done in memory before the data is passed to the ODBC layer for the insert, so depending on your memory allocation for non-SQL server processes the speed of this will vary.

    Regards,

    Jacob

  • I finally had some time to rework my import queries. I changed my SP to use temp tables so now I can run multiple imports at the same time which has help quite a bit, but I think it is still a lossing race if I need to add more servers. Log files still take between 60-120 seconds to import. I don't know for sure but my guess is when you have a lot of rows (i.e. millions) and half a dozen indexes, every row you want to insert will take more and more time because you have to update the index, correct? If that's true, then without an aggressive data pruning plan I'm sunk.

    Jeff: I don't have any triggers. What do you mean by "combining indexes"?

    Jacob: I think the bottle neck is going from my temp table to the main table which LogParser wouldn't help with.

    I got to thinking about one more optimization:

    The log files are constantly being written and only roll-over @ 10MB. I don't want to wait until the log file is done (we need the data available pretty quick) so I have logic in my SP to let me process the same log files over and over again. I import the entire log into my temp table and then SELECT INTO the main table with a "WHERE NOT EXIST" clause on the columns that make up my composite primary key. If I had an external tracking mechanism to figure out which rows had already been imported I could skip the WHERE NOT EXIST clause when moving from TEMP to MAIN table.

    Is this worth prototyping?

  • While trying to learn more about SQL optimization I discovered that SQL can actually show you the execution plan - That's so cool 🙂 So, let's talk real numbers here:

    Looking at the execution plan for my SP Insert I have 2 big costs which occur in the step to move the data from the temp table to the main table.

    39% x 1 Primary Key - Clustered Index Insert

    14% x 6 (one for each index)

    Although the IndexInsert is 0% cost, the dependant operation: "Table Spool" is 14% for each index.

    I don't know what the second one means, but the first one got me thinking.

    I have a composite primary key made up of:

    DateTime, SessioId, SessionFrame, UserId, Operation

    My thinking is that none of these are unique by themselves. It is very common to have 10-20 rows with the same DateTime. SessionId starts back at zero everytime the service restarts so repeats will occure often, session frame gets reset everytime a user starts a session, etc, etc. Basically, I need all of them to ensure unqiueness, but I just had an idea.

    1) Remove the existing primary key

    2) Create an auto-increment primary key

    3) Rely on existing import to check for uniqueness

    I'm already essentially doing a uniqueness check by my WHERE NOT EXIST clause in which I check to make sure the columns above don't match rows that already exist in the main table. So, why not get rid of the composite primary key.

    Worth prototyping?

  • You can certainly change the PK to an identity column and, if you can guarantee that duplicates across what is currently the PK will be keep out by process, then you can probably remove that index all together although I'd likely keep it as a catch all. In fact, you can keep the PK on the original columns and just change it to be a non-clustered index. Use the clustered index on the Identity column. That will certainly prevent any page splits on the clustered index.

    [font="Arial Black"]I'd also recommend that you shouldn't be checking to see if a row already exists during the final insert. Rather, you should have a marker column in the staging table to identify the rows first... of course, no index required on that, either.[/font]

    The way I do these types of things is I BULK INSERT into a temporary staging table, add indexes to the temp table if and only if I really need them, do any preprocessing/validation including merge row marking, and then do a simple insert/update pair to do the merge base on the flags. The final table has an IDENTITY column as a clustered index to make the inserts fast and it has a non-clustered PK on the stuff I want to be guaranteed to be unique.

    A 10MB file is nothing size wise... You already have enough performance problems with trying to import them. Wait until the log file rolls over and then import them instead of trying over and over as you say. And, like I said, 10MB files are nothing for size... there may be some other problem that we're simply missing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply