Combining Two Very Large Tables

Question

Combining Two Very Large Tables

jamoore

SSC-Addicted

Points: 472
More actions
March 19, 2013 at 9:54 am

#291250

I have two tables that each contain approximately 90 million records. The second table represents an "update" of the second table; however, no unique identifier exists within either population to identify which records in the new table are added or changed from the previous table. Each table contains the following fields:
1. First Name
2. Middle Name
3. Last Name
4. Social Security Number
5. Birth Date
7. ID (unique to the table, but not across tables)
The SSN is unique in both tables; however, an SSN may be revised for a distinct record from one table to the next based on updated information, etc. that is received. We currently have the following indexes assigned to each table:
1. Social Security Number (Clustered Index)
2. First Name, Middle Name, Last Name, Birth Date, ID (Non-Clustered Indexes)
I am trying to identify the most efficient way to:
1. Combine the tables, and;
2. De-duplicate records within the combined table based on a combination of different fields (e.g., same SSN; same first name, last name, birth date; etc.)
We've accomplished #1 in the past by partitioning each table into chunks based on SSN and then unioning those chunks. For example:
SELECT Social_Security_Number, Last_Name, First_Name, Middle_Name, Death_Date, Birth_Date
INTO ##tbl_20110901_FullFile_00_04
FROM tbl_20110901_FullFile
WHERE LEFT(Social_Security_Number,2) IN('00','01','02','03','04')
SELECT Social_Security_Number, Last_Name, First_Name, Middle_Name, Death_Date, Birth_Date
INTO ##tbl_20120301_FullFile_00_04
FROM tbl_20120301_FullFile
WHERE LEFT(Social_Security_Number,2) IN('00','01','02','03','04')
SELECT * INTO ##tbl_Combined_00_04 FROM ##tbl_20110901_FullFile_00_04
UNION
SELECT * FROM ##tbl_20120301_FullFile_00_04
Is there a better way to accomplish this?
Thanks,
James

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

ScottPletcher SSC Guru Points: 100949 More actions · Answer 1

Maybe consider records to "match" if the SSN, first name and birth date all match. My reasoning: last name can change (marriage, etc.), and middle name may not be consistently used (sometimes I use my middle name/init, sometimes I don't).

You could do the UPDATE in batches based on the soc sec # ranges, something like below.

But you don't want to use LEFT, since that prevents indexes from being used. That's especially vital here because it's the clustered index. use LIKE '...%' instead:

UPDATE t1

SET

last_name = t2.last_name,

<column> = t2.<column>

--,...

FROM dbo.table1 t1

INNER JOIN dbo.table2 t2 ON

t2.social_security_number = t1.social_security_number AND

t2.first_name = t1.first_name AND

t2.birth_date = t1.birth_date

WHERE

t1.social_security_number LIKE '00%' AND

t2.social_security_number LIKE '00%' AND

(t1.last_name <> t2.last_name OR ISNULL(t1.<column>, <value>) <> ISNULL(t2.<column>, <value> OR ...))

/* then

WHERE

t1.social_security_number LIKE '01%' AND

t2.social_security_number LIKE '01%'

--etc.

*/

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

jamoore SSC-Addicted Points: 472 More actions · Answer 2

Scott -

Thanks for the reply. The issue is that there may be records in t1 that are not in t2, and vice-versa; therefore, I need to consolidate both tables, and then de-duplicate the population for matching records in both. Your query appears to update t1 where the SSN, first name and birth date in t2 are the same, but the last name, etc. are different. I was also wondering if there is a better way to do this than to manually assign batches (e.g., use NTILE(), assign a macro variable, etc.)? I apologize, but I am relatively new to SQL and do not know what options exist.

Thanks,

James

ScottPletcher SSC Guru Points: 100949 More actions · Answer 3

It's easy enough to generate the code for 00 to 99 based on the pattern for 00, so it's not really "manual" coding.

I figured on a (very broad) average of 90M/100 ranges = 900K per range; that's a large number, but not extreme for a table that size.

I thought you were trying to update one of the existing tables to become the "final" copy. So I was going to do separate UPDATE vs INSERT.

If, instead, you plan to create a new table, we can use one full outer join query could get all rows w/o having separate UPDATEs vs INSERTs.

MERGE is a possibility, of course, but after cases where it has not performed well, I'd personally be leery of using it here.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.