Merge performance on Very Large Tables

  • To all you DB/SQL experts out there - I hope you can shed some light on my problems.

    I try to make it short.

    The DB has got a very large table, about 60m rows, 11GB in size, 2.9GB index size.

    Each day new data has to be merged into the table, I do that in 64MB batches which I bulk insert

    into a temporary table and then merge into the large target table.

    Each of those batches contains roughly half a million rows, of which some will be new to the db (resulting in an insert)

    and some already exist (resulting in an update).

    I do the operation above via a merge using 2 bigint indexes (which are the same in the temporary table as well the large target table).

    The issue is simply that the time used for the merge statement to complete, grows as the db grows. and I mean not just

    by a little bit, it grows ALOT.

    Starting with an empty db, a merge as described above takes about 11seconds. Now with the size as described above, it takes

    over 10 minutes(!). Im aware I cant keep insert times constant but this is a vast drop. I put the table already onto its own filegroup spanning

    across 4 disks, no real difference.

    I found out that it seems to be the inserts of new rows within the merge which cause the performance issue.

    When I run the same batch twice, or any other batch which I know I have already processed before, it only takes about 2 minutes (as all records already exists

    in the db and therefore only an update needs to be performed).

    I also set the table to autogrow (by 10%) so it doesnt constantly grow in 1MB steps.

    Recovery model is set to simple to keep the logfile small.

    I also checked the query plan and no table scans are done, but the query plan looks very complex (for a fairly simple merge).

    The most expensive items are those 3: "Table spool", "RID Lookup" and "Index Seek".

    Its running on a failry beefy machine (3GHz, 8GB mem).

    Am I doing something fundamentally wrong or do I have to simply accept that SQL becomes very slow at this size? (I rather believe its the former).

    I mean adding 500k new rows to a 60m rows table shouldnt take 11 minutes, its only 60mb of data to add.

    Any ideas?

    I look forward to your responses

    ZL

  • I would take a look at your execution plans from the merge statement. Something is causing this to slow down. You're working in a very small database, so you shouldn't be seeing this much performance issues. Also, those indexes sound a bit big. Do you have clustered indexes on the table(s) in question?

    It's hard to say more without seeing the structure, the code, and the execution plans for myself.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • zerolan22 (1/14/2011)


    To all you DB/SQL experts out there - I hope you can shed some light on my problems.

    It sounds from your description as if it may be indexes causing the problem on INSERTs because they ALL need to be updated during the insert of new rows.

    But, I'm just guessing because we don't have enough information. Take a look at the second link in my signature line below. If you can provide information in that fashion, we'll likely be able to help more. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hey Guys,

    Thank you for replying.

    I shall try to get all information together you asked for.

    Target (large table) definition: 10 columns, int, int, datetime, datetime, varchar(1024), varchar(512), bigint, bigint, bigint, bigint

    unique non-clustered index on 2 of the bigints columns (combined into one index, ascending)

    Input table is a temporary table which is loaded via bulk insert operation on a 64MB csv style file.

    this is quite quick 1-3 seconds and results in about 500.000 rows in the temp table.

    Definition: int, varchar(128), varchar(512), varchar(512), varchar(32), char(32), bigint, bigint

    after bulk insert, primary key of the 2 bigints (combined) is created to aid with the following merge. Thats also done in a second or 2.

    The following merge statement is then executed to merge the temp table into the large table:

    merge FullURLs as target

    using #tmpimport as source

    on source.URLExLo = target.URLEExLo and source.URLExHi = target.URLEExHi

    when matched then

    update set SeenLast = CURRENT_TIMESTAMP, target.TimesSeen = target.TimesSeen+source.TimesSeen

    when not matched then

    insert ( TimesSeen, SeenFirst, SeenLast, URL, [File], URLEExLo, URLEExHi, DomainExLo, DomainExHi )

    VALUES( 1, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, source.FullURL, source.[File], source.URLExLo, source.URLExHi, fnMD5ExLo( fnMD5Ex(Domain) ), fnMD5ExHi( fnMD5Ex(Domain) ) )

    output $action into #tempresult

    URLExLo and URLExHi are the 2 bigints I mentioned earlier which I index on (combined).

    The execution plan is also attached, showing that the majority of the time (98%) is spend on the Index Seek and RID lookups. 15 minutes (!!) for the batch I collected

    the attached exec plan for, which resulted in 0 inserts, and +472.000 row updates within the merge. (no inserts coz I ran the same batch already earlier).

    So its slow either way, on insert and on updates and becomes slower and slower the large the table gets.

    You can skipp all the other statements in the exec plan, its only the merge which takes teh whole focus.

    Is that all the information you require to make a better judgement?

    Thanks so much for looking into this with me - Im very much looking fwd to your pointers!

  • I haven't looked at the exec plan, but first off, get a clustered index on the table. Absolutely. Figure out what the right place to put it would be and get it in place. That alone will help a lot.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant, I have been there before.

    I used the 2 bigints as a combined, primary clusered index with no performance improvements, rather the opposite, it was even slower on

    inserts. :S

  • zerolan22 (1/14/2011)


    Grant, I have been there before.

    I used the 2 bigints as a combined, primary clusered index with no performance improvements, rather the opposite, it was even slower on

    inserts. :S

    Well, looking at the execution plan, the primary source of pain is the index seek and RID lookup on the table in question. an appropriate clustered index can eliminate this part of the process. Not to mention that inserts into a clustered index are faster than into a heap (not counting what happens when there are multiple non-clustered indexes associated with the table, but that's not the situation here). Also, data storage in a cluster is much, much more efficient than in a heap. Finally, retrieval from a cluster, a properly configured, appropriate cluster, are much, much faster than retrieival from a heap.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hey guys.

    Thanks for all your replies.

    Ok, I restructered the index again and created a clustered primary key on both bigints (combined, ascening) , the temporary

    as well the large target table. (and dropped the old index)

    alter table FullURLs add constraint PK_EXES PRIMARY KEY CLUSTERED ( URLEExLo ASC, URLEExHi ASC );

    Repeated the batch I used above again, run time: ~9 minutes! (0 inserts, ~400k updates)

    Little faster than before (it wasnt when I tried earlier but anyway).

    9 minutes is still way to slow though to merge 400k rows into a table (and thats without insert, with inserts this figure will be

    alot higher).

    This figure gets larger with the size of the table.

    I attached the new execution plan. Other than the index, nothing else has changed. I noticed an index scan on the large

    table, didnt think this is a good sign?

    ZL

  • I'n not sure why but I get an error on the execution plan you attached after saving it and trying to open it. Since you're also using MERGE in your code, I suspect that although this is a 2K5 forum, that you're actually using 2K8. That would be on my other machine and I don't currently have access to that machine.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In the merge operation against MalWare, why are you doing this in the select from the temp table:

    group by (FullURL + [MalwareGroup]), FullURL, [MalwareGroup]

    If you're going to all the trouble to set up a temporary table, why not ensure that it has the data you need prior to access it in a JOIN operation?

    Also, since you're using a temp table with hundreds of thousands of rows, I suspect the added cost of putting an index on it won't kill performance any worse than what you've got. Then, this part of the query:

    on source.GroupExLo = target.GroupnameExLo and source.GroupExHi = target.GroupnameExHi and source.URLExLo = target.URLExLo and source.URLExHi = target.URLExHi

    Might perform better.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Just for testing purposes you may want to break out the UPDATE and INSERT into 2 different steps (doing an UPSERT the old fashioned way). Update what's there and Insert what isn't. Look at the query plans, IO and times for each step.

    Could your inserts be causing page splits? Are you always inserting at the end of the table with some sort of ordering? Check table fragmentation before and after the inserts.

    Todd Fifield

  • Hey Guys,

    Thanks again for your suggestions.

    @Grant, you are right, the query you pointed out is sub-optimal, it didnt matter to much though as that table is small and

    the query you pointed out, is not the query which requires 98% of the batch-focus :S

    @tfifield - good suggestion. I will split things up into seperate inserts/updates but I "believe" it will be even slower because

    I still need to query the big target table first in order to figure out which data is already present and which is not. but worth a try.

    RE PageSplits, I had this in mind as well, funny enough. I tried having the clustered index on an identity ID column, which ( I believe, correct me if Im wrong ) forces

    the insert of new rows at the end of the table. However, this appears to be slower than having the clustered index on the 2 bigints instead, which most likely

    ends up in page-splits as the insertions are random and could be inserted anywhere in the table.

    ZL

  • ZL,

    One of the reasons I suggested splitting up the merge operations was to establish without a doubt what exactly was taking the most time. From my experience inserting is usually a faster operation than updating - especially when the inserting is being done on a table with an Identity as the clustering index.

    Another reason to split up the operations is to be able to batch the inserts into smaller batches. This sometimes gives better performance and gives a little relief to the transaction log.

    Todd Fifield

  • A few observations:

    I see a call to what I presume is a scalar function on the insert: fnMD5ExLo. Maybe it would be better to do that outside the merge - I am sure that will have a negative impact on the query performance.

    Also, the query plan suggests an index. Have you tried to create the index

    CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

    ON [dbo].[#tmpimport] ([MalwareGroup])

    INCLUDE ([FullURL])

    I do not know if it will work, but should be interesting...

  • Try letting the Database Tuning Advisor analyze a few of these batches loads and see what it reccomends.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply