108,000,000 row table - check for matches

  • Good afternoon,

    Long time reader - first time poster here!

    I am currently working on a 108m row table where some of the ids (TOID column) have inadvertently been shortened - they should all be 16chars in length, i.e. 0001000000076579

    0001000000104566

    0001000000159225

    0001000000250870

    0001000000263722

    0001000000303305

    0001000000303309

    but some are 13 in length. I think that some are missing a 100 at the beginning and some are missing 500. I have an update table (where I have to find matches between this table and the big daddy table based on this id (called TOID))

    I need to find out which rows in the update table don't match rows in the big daddy table - I can join on the matches, no problem, but when it comes to searching for those matches which are a concatenation of 100/500<TOID> this makes the query run very very slowly.

    my code is quite simply:

    UPDATE DUNCANTEMP

    SET

    DOESEXIST = 1

    WHERE

    EXISTS (SELECT TOID

    FROM

    [OS_MasterMap].dbo.OS_MM_AREA OMA WITH (NOLOCK)

    WHERE

    ('500' + OMA.Toid = DUNCANTEMP.TOID

    OR

    '100' + OMA.Toid = DUNCANTEMP.TOID)

    AND

    len(OMA.Toid) = 13)

    Where DUNCANTEMP consists of:

    CREATE TABLE dbo.DUNCANTEMP(

    TOID VARCHAR(50) NULL,

    DOESEXIST INT NULL

    ) ON [PRIMARY]

    The tables area static and in a dev environment.

    And [OS_MasterMap].dbo.OS_MM_AREA has loads of columns, but TOID is the only one we need to look at:

    CREATE TABLE dbo.OS_MM_AREA(

    OBJECTID INT IDENTITY,

    SHAPE GEOMETRY NOT NULL,

    Toid NVARCHAR(20) NOT NULL,

    Version INT NULL,

    VerDate DATETIME NULL,

    FeatCode INT NULL,

    Theme NVARCHAR(80) NULL,

    CalcArea NUMERIC(38, 8) NULL,

    Change NVARCHAR(80) NULL,

    DescGroup NVARCHAR(150) NULL,

    DescTerm NVARCHAR(150) NULL,

    Make NVARCHAR(20) NULL,

    PhysLevel INT NULL,

    PhysPres NVARCHAR(20) NULL,

    Broken SMALLINT NULL,

    LoadDate DATETIME NULL,

    CONSTRAINT PK__OS_MM_AR__F4B70D851D5924B0 PRIMARY KEY (OBJECTID),

    CONSTRAINT UQ__OS_MM_AR__FF378992EC7F35C5 UNIQUE (Toid)

    ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

    GO

    There is a unique constraint on OS_MM_AREA (TOID) and a non-clustered index.

    there is a clustered index on TOID on the DUNCANTEMP table (but not unique as there may be some duplicate TOIDs based on multiple update rows).

    My question is: my code has been running for about 14hrs now.. there are 33934 rows in Duncantemp and 107913558 rows in the OS_MM_AREA table - my code will be scanning the big table for matches for each row..

    can anyone think of a better and quicker way of finding those rows in DUNCANTEMP which don't match any records in OS_MM_AREA given a direct match or +100/+500<TOID> match as well?

    I welcome suggestions please.

    (if it doesn't complete over the weekend, I will add 2 more columns to the big daddy table with +500 and +100<TOID>, index them, and run the code using a join.. but I was interested if I had fundamentally missed something first

    thank you kindly

    D.

  • read this fantastic article carefully. This may help you

    http://www.sqlservercentral.com/articles/T-SQL/62867/

  • SrcName (9/20/2013)


    read this fantastic article carefully. This may help you

    http://www.sqlservercentral.com/articles/T-SQL/62867/%5B/quote%5D

    SrcName - thank you for the link, I have read this before and am familiar with the numbers/tally table. I'm not sure though how it helps me here, does not having a SARGABLE where clause cause a scan whatever I do?

    Sorry, I haven't used the tally table enough to see its application here.

    I would be very grateful if you could elaborate please.

    thank you

    D.

  • The thing to do is try to avoid the scan of the big table; give it something simple to look for.

    I don't offer any guarantee, but something like this might work:

    update DUNCANTEMP set DOESEXIST = 1

    where TOID like '100%' or TOID like '500%

    and exists (select 1 from OS_MM_AREA OMA with (NOLOCK)

    where OMA.TOID = RIGHT(DUNCANTEMP.TOID,13))

    That avoids applying a concatenation to OMA.TOID, so it should be able to use the index for the existence test, which avoids scanning teh big table. It may of course do a scan on DUNCANTEMP, but that's unavoidable.

    Anyway, for future reference: please remember to post an actual execution plan (if possible; estimated plan otherwise) with queries about improving performance.

    Tom

  • Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.

    We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.

    What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland Sommarskog (9/21/2013)


    Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.

    We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.

    What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

    Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches

    I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.

    thanks anyway..

    D.

  • duncanburtenshaw (9/23/2013)


    Erland Sommarskog (9/21/2013)


    Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.

    We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.

    What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

    Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches

    I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.

    thanks anyway..

    D.

    I'd be interested to know how many rows there have to be in the small table before the code i used produces a scan of the large table. In theory the optimiser should only do this when it thinks that the number of index lokups would be sufficientl large for it to be advantageous touse a scan; I find it hard to believe that something as low as .000324% of the rows implies a scan rather than individual lookups when what is being looked up is an exact match of the 13 character string.

    Of course it might be better if the 16 character matches were elimnated first and a new test table constructed from what wasn't a 16-character match, with the first three characters discarded (and the row eliminated if they were not one of the two interesting 3-character strings) and provided with a clustered index on the string column, because then the check for those values is very obviously a check for matches in the two indexes with no mesing about with appending prefixes - that would certainly take a lot less time than adding two new indexes to the big table and then checking all the possible matches, as eliminating the 16 character matches and truncating what is left will be a lot cheaper than creating two new indexes on the big table (about 2000 times fewer logical writes) and once you've done that you have only 1 index (the one you already have on the big table) to match against instead of 3.

    Tom

  • L' Eomot Inversé (9/23/2013)


    duncanburtenshaw (9/23/2013)


    Erland Sommarskog (9/21/2013)


    Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.

    We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.

    What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

    Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches

    I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.

    thanks anyway..

    D.

    I'd be interested to know how many rows there have to be in the small table before the code i used produces a scan of the large table. In theory the optimiser should only do this when it thinks that the number of index lokups would be sufficientl large for it to be advantageous touse a scan; I find it hard to believe that something as low as .000324% of the rows implies a scan rather than individual lookups when what is being looked up is an exact match of the 13 character string.

    Of course it might be better if the 16 character matches were elimnated first and a new test table constructed from what wasn't a 16-character match, with the first three characters discarded (and the row eliminated if they were not one of the two interesting 3-character strings) and provided with a clustered index on the string column, because then the check for those values is very obviously a check for matches in the two indexes with no mesing about with appending prefixes - that would certainly take a lot less time than adding two new indexes to the big table and then checking all the possible matches, as eliminating the 16 character matches and truncating what is left will be a lot cheaper than creating two new indexes on the big table (about 2000 times fewer logical writes) and once you've done that you have only 1 index (the one you already have on the big table) to match against instead of 3.

    Thank you for your comments. Perhaps I am completely missing the point. Let me reiterate:

    1. I have a large table (lets call it BIGa)

    2. I have a smaller table, (lets call it Smallb)

    3. The BIGa table has 108,000,000 rows. Out of these, approx 27,000,000 rows have a TOID (unique number) which is 13 chars in length instead of 16 (which all of the other rows are).

    4. I have a small update table (Smallb) with 370,000 odd rows. None of the rows in Smallb are 13 chars long, they are all correct length of 16 chars.

    5. I am trying to match any rows from Smallb with a matching row in BIGa based on any TOIDs in BIGa which are a) 13 chars long, and where the row matches based on '100'+TOID OR '500'+toid

    Based on the assumption above, surely all rows in the big table (27,000,000 rows based on len(TOID=13)) will be scanned?

    This was my original question - apart from adding a couple of columns persisting the '100'+TOID and '500'+TOID /indexing them, and then looking for a match - was there a better way of seeing if there are any matches between Smallb and BIGa that the code I originally quoted.

    I have, quite obviously, failed completely to explain this problem satisfactorily and for that I apologise.

    I have added new columns on Biga now with '500' and '100'+TOID, and will check for any direct matches after this has completed - I just wanted to know if I was missing anything which would have saved me the time in creating these columns and would allow me to detect possible matches.

    thank you for all your suggestions - next time I will attach execution plans as well.

    D.

  • What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?

    B

  • bleroy (9/23/2013)


    What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?

    B

    ooooh.. good idea! 🙂

    Exactly what I was looking for.. of course, I could just get 13 chars from the little table and see if there are any matches in the big table based on this.

    can't see the wood for the trees.. I will try this, thank you for the suggestion, seems so obvious now.

    D.

  • bleroy (9/23/2013)


    What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?

    B

    I wish I'd said it in those words instead of adding more detail (like not looking at the trailing 13 characters if the first 3 were not one of the two specified triples, and not including things that were matched on all 16 characters because they would already have been counted as matched) - as I said I was trying to simplify he query I should have kept the description simple as well.

    Tom

  • L' Eomot Inversé (9/23/2013)


    bleroy (9/23/2013)


    What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?

    B

    I wish I'd said it in those words instead of adding more detail (like not looking at the trailing 13 characters if the first 3 were not one of the two specified triples, and not including things that were matched on all 16 characters because they would already have been counted as matched) - as I said I was trying to simplify he query I should have kept the description simple as well.

    Thank you for your valuable contribution, this forum is fantastic! Best regards, D.

  • A small addiiton in case it was not clear: add that colimn as a computed column to the able and index it. That saves some space over adding it as a regular column.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply