108,000,000 row table - check for matches

Question

Post reply

108,000,000 row table - check for matches

duncanburtenshaw

Ten Centuries

Points: 1323
More actions
September 20, 2013 at 10:00 am

#395533

Good afternoon,
Long time reader - first time poster here!
I am currently working on a 108m row table where some of the ids (TOID column) have inadvertently been shortened - they should all be 16chars in length, i.e. 0001000000076579
0001000000104566
0001000000159225
0001000000250870
0001000000263722
0001000000303305
0001000000303309
but some are 13 in length. I think that some are missing a 100 at the beginning and some are missing 500. I have an update table (where I have to find matches between this table and the big daddy table based on this id (called TOID))
I need to find out which rows in the update table don't match rows in the big daddy table - I can join on the matches, no problem, but when it comes to searching for those matches which are a concatenation of 100/500<TOID> this makes the query run very very slowly.
my code is quite simply:
UPDATE DUNCANTEMP
SET
DOESEXIST = 1
WHERE
EXISTS (SELECT TOID
FROM
[OS_MasterMap].dbo.OS_MM_AREA OMA WITH (NOLOCK)
WHERE
('500' + OMA.Toid = DUNCANTEMP.TOID
OR
'100' + OMA.Toid = DUNCANTEMP.TOID)
AND
len(OMA.Toid) = 13)
Where DUNCANTEMP consists of:
CREATE TABLE dbo.DUNCANTEMP(
TOID VARCHAR(50) NULL,
DOESEXIST INT NULL
) ON [PRIMARY]
The tables area static and in a dev environment.
And [OS_MasterMap].dbo.OS_MM_AREA has loads of columns, but TOID is the only one we need to look at:
CREATE TABLE dbo.OS_MM_AREA(
OBJECTID INT IDENTITY,
SHAPE GEOMETRY NOT NULL,
Toid NVARCHAR(20) NOT NULL,
Version INT NULL,
VerDate DATETIME NULL,
FeatCode INT NULL,
Theme NVARCHAR(80) NULL,
CalcArea NUMERIC(38, 8) NULL,
Change NVARCHAR(80) NULL,
DescGroup NVARCHAR(150) NULL,
DescTerm NVARCHAR(150) NULL,
Make NVARCHAR(20) NULL,
PhysLevel INT NULL,
PhysPres NVARCHAR(20) NULL,
Broken SMALLINT NULL,
LoadDate DATETIME NULL,
CONSTRAINT PK__OS_MM_AR__F4B70D851D5924B0 PRIMARY KEY (OBJECTID),
CONSTRAINT UQ__OS_MM_AR__FF378992EC7F35C5 UNIQUE (Toid)
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
There is a unique constraint on OS_MM_AREA (TOID) and a non-clustered index.
there is a clustered index on TOID on the DUNCANTEMP table (but not unique as there may be some duplicate TOIDs based on multiple update rows).
My question is: my code has been running for about 14hrs now.. there are 33934 rows in Duncantemp and 107913558 rows in the OS_MM_AREA table - my code will be scanning the big table for matches for each row..
can anyone think of a better and quicker way of finding those rows in DUNCANTEMP which don't match any records in OS_MM_AREA given a direct match or +100/+500<TOID> match as well?
I welcome suggestions please.
(if it doesn't complete over the weekend, I will add 2 more columns to the big daddy table with +500 and +100<TOID>, index them, and run the code using a join.. but I was interested if I had fundamentally missed something first
thank you kindly
D.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

SrcName Default port Points: 1440 More actions · Answer 1

read this fantastic article carefully. This may help you

http://www.sqlservercentral.com/articles/T-SQL/62867/

duncanburtenshaw Ten Centuries Points: 1323 More actions · Answer 2

SrcName (9/20/2013)

read this fantastic article carefully. This may help you

http://www.sqlservercentral.com/articles/T-SQL/62867/%5B/quote%5D

SrcName - thank you for the link, I have read this before and am familiar with the numbers/tally table. I'm not sure though how it helps me here, does not having a SARGABLE where clause cause a scan whatever I do?

Sorry, I haven't used the tally table enough to see its application here.

I would be very grateful if you could elaborate please.

thank you

D.

TomThomson SSC Guru Points: 104773 More actions · Answer 3

The thing to do is try to avoid the scan of the big table; give it something simple to look for.

I don't offer any guarantee, but something like this might work:

update DUNCANTEMP set DOESEXIST = 1

where TOID like '100%' or TOID like '500%

and exists (select 1 from OS_MM_AREA OMA with (NOLOCK)

where OMA.TOID = RIGHT(DUNCANTEMP.TOID,13))

That avoids applying a concatenation to OMA.TOID, so it should be able to use the index for the existence test, which avoids scanning teh big table. It may of course do a scan on DUNCANTEMP, but that's unavoidable.

Anyway, for future reference: please remember to post an actual execution plan (if possible; estimated plan otherwise) with queries about improving performance.

Tom

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 4

Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.

We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.

What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

duncanburtenshaw Ten Centuries Points: 1323 More actions · Answer 5

Erland Sommarskog (9/21/2013)
Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.
We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.
What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.

Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches

I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.

thanks anyway..

D.

TomThomson SSC Guru Points: 104773 More actions · Answer 6

duncanburtenshaw (9/23/2013)
Erland Sommarskog (9/21/2013)
Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.
We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.
What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.
Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches
I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.
thanks anyway..
D.

I'd be interested to know how many rows there have to be in the small table before the code i used produces a scan of the large table. In theory the optimiser should only do this when it thinks that the number of index lokups would be sufficientl large for it to be advantageous touse a scan; I find it hard to believe that something as low as .000324% of the rows implies a scan rather than individual lookups when what is being looked up is an exact match of the 13 character string.

Of course it might be better if the 16 character matches were elimnated first and a new test table constructed from what wasn't a 16-character match, with the first three characters discarded (and the row eliminated if they were not one of the two interesting 3-character strings) and provided with a clustered index on the string column, because then the check for those values is very obviously a check for matches in the two indexes with no mesing about with appending prefixes - that would certainly take a lot less time than adding two new indexes to the big table and then checking all the possible matches, as eliminating the 16 character matches and truncating what is left will be a lot cheaper than creating two new indexes on the big table (about 2000 times fewer logical writes) and once you've done that you have only 1 index (the one you already have on the big table) to match against instead of 3.

Tom

duncanburtenshaw Ten Centuries Points: 1323 More actions · Answer 7

L' Eomot Inversé (9/23/2013)
duncanburtenshaw (9/23/2013)
Erland Sommarskog (9/21/2013)
Before we start talking about sargable arguments, maybe we should see if there is anything to sarg for. From what Duncan says, there does not seem to be any.
We have keys that should be 1001231231231 or 1007897897897, but in fact are only 1231231231 and 7897897897. Unless there is a pattern for the beginning of these short keys, there is nothing to sarg for.
What Duncan could do is to insert all 13-character keys into a temp table, in this way he only needs to scan the big table once. Which still takes a long time, but it should not take the entire weekend.
Thank you all for your inputs - I have failed to explain the problem correctly, it is the large table which has a mix of 13 and 16 char TOIDs, these are the ones I am looking to find by adding '100'+toid and '500'+toid to my search and seeing if any records in the smaller table match - I will add two more columns to the bigger table, add in the '100'+toid and '500'+toid to these, create an index and see if I get any matches
I think that any other option will inevitably produce a table scan on the large table for each row of the smaller table, so this is the way I am going now.
thanks anyway..
D.
I'd be interested to know how many rows there have to be in the small table before the code i used produces a scan of the large table. In theory the optimiser should only do this when it thinks that the number of index lokups would be sufficientl large for it to be advantageous touse a scan; I find it hard to believe that something as low as .000324% of the rows implies a scan rather than individual lookups when what is being looked up is an exact match of the 13 character string.
Of course it might be better if the 16 character matches were elimnated first and a new test table constructed from what wasn't a 16-character match, with the first three characters discarded (and the row eliminated if they were not one of the two interesting 3-character strings) and provided with a clustered index on the string column, because then the check for those values is very obviously a check for matches in the two indexes with no mesing about with appending prefixes - that would certainly take a lot less time than adding two new indexes to the big table and then checking all the possible matches, as eliminating the 16 character matches and truncating what is left will be a lot cheaper than creating two new indexes on the big table (about 2000 times fewer logical writes) and once you've done that you have only 1 index (the one you already have on the big table) to match against instead of 3.

Thank you for your comments. Perhaps I am completely missing the point. Let me reiterate:

1. I have a large table (lets call it BIGa)

2. I have a smaller table, (lets call it Smallb)

3. The BIGa table has 108,000,000 rows. Out of these, approx 27,000,000 rows have a TOID (unique number) which is 13 chars in length instead of 16 (which all of the other rows are).

4. I have a small update table (Smallb) with 370,000 odd rows. None of the rows in Smallb are 13 chars long, they are all correct length of 16 chars.

5. I am trying to match any rows from Smallb with a matching row in BIGa based on any TOIDs in BIGa which are a) 13 chars long, and where the row matches based on '100'+TOID OR '500'+toid

Based on the assumption above, surely all rows in the big table (27,000,000 rows based on len(TOID=13)) will be scanned?

This was my original question - apart from adding a couple of columns persisting the '100'+TOID and '500'+TOID /indexing them, and then looking for a match - was there a better way of seeing if there are any matches between Smallb and BIGa that the code I originally quoted.

I have, quite obviously, failed completely to explain this problem satisfactorily and for that I apologise.

I have added new columns on Biga now with '500' and '100'+TOID, and will check for any direct matches after this has completed - I just wanted to know if I was missing anything which would have saved me the time in creating these columns and would allow me to detect possible matches.

thank you for all your suggestions - next time I will attach execution plans as well.

D.

bleroy Hall of Fame Points: 3052 More actions · Answer 8

What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?

B

duncanburtenshaw Ten Centuries Points: 1323 More actions · Answer 9

bleroy (9/23/2013)
What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?
B

ooooh.. good idea! 🙂

Exactly what I was looking for.. of course, I could just get 13 chars from the little table and see if there are any matches in the big table based on this.

can't see the wood for the trees.. I will try this, thank you for the suggestion, seems so obvious now.

D.

TomThomson SSC Guru Points: 104773 More actions · Answer 10

bleroy (9/23/2013)
What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?
B

I wish I'd said it in those words instead of adding more detail (like not looking at the trailing 13 characters if the first 3 were not one of the two specified triples, and not including things that were matched on all 16 characters because they would already have been counted as matched) - as I said I was trying to simplify he query I should have kept the description simple as well.

Tom

dburtenshaw SSC Veteran Points: 220 More actions · Answer 11

L' Eomot Inversé (9/23/2013)
bleroy (9/23/2013)
What about adding 1 column to the small table that would contain RIGHT(smallTable.toid,13), create an index on that and then do your join?
B
I wish I'd said it in those words instead of adding more detail (like not looking at the trailing 13 characters if the first 3 were not one of the two specified triples, and not including things that were matched on all 16 characters because they would already have been counted as matched) - as I said I was trying to simplify he query I should have kept the description simple as well.

Thank you for your valuable contribution, this forum is fantastic! Best regards, D.

Erland Sommarskog SSC-Insane Points: 24038 More actions · Answer 12

A small addiiton in case it was not clear: add that colimn as a computed column to the able and index it. That saves some space over adding it as a regular column.

[font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]