|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Today @ 9:52 AM
Points: 69,
Visits: 213
|
|
Hi all, I have a task to clean up data in one of the tables. The column name I need to clean up holds business names, but they can appear in there in many different ways. For instance: Costco COSTCO Costco Whls Costco Wholesale Costco Whls llc
What is available to me in SQL Server 2008R2 that can help me to accomplish that? How would you approach it?
Thanks,
|
|
|
|
|
Mr or Mrs. 500
      
Group: General Forum Members
Last Login: Today @ 4:48 PM
Points: 577,
Visits: 4,163
|
|
If you're looking for every instance of a particular character string, you can use this:
select * from mytable where bus_name like '%costco%'
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Today @ 9:52 AM
Points: 69,
Visits: 213
|
|
Thanks for reply. The table that holds business names has 20M rows. I can't specify a business name because I don't know it or will have to do it for every business name in the table. In the following example, how would you use "like"? Costco Costco LLC Costco Whls Home Interiors Malaga Home Plumbing Home Property Management Home Realty Home Svc
|
|
|
|
|
Mr or Mrs. 500
      
Group: General Forum Members
Last Login: Today @ 4:48 PM
Points: 577,
Visits: 4,163
|
|
eugene.pipko (12/12/2012) Thanks for reply. The table that holds business names has 20M rows. I can't specify a business name because I don't know it or will have to do it for every business name in the table. In the following example, how would you use "like"? Costco Costco LLC Costco Whls Home Interiors Malaga Home Plumbing Home Property Management Home Realty Home Svc
I guess I'm not quite sure you are looking for in this example, are all those business names considered to be the same for this case?
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Today @ 9:52 AM
Points: 69,
Visits: 213
|
|
I am looking for a way to say:
Based on the list here: --------------------- Costco Costco LLC Costco Whls Home Interiors Malaga Home Plumbing Home Property Management Home Realty Home Svc
These are unique business names: --------------------- Costco Home Interiors Malaga Home Plumbing Home Property Management Home Realty Home Svc
|
|
|
|
|
SSC Eights!
      
Group: General Forum Members
Last Login: Today @ 3:19 PM
Points: 827,
Visits: 5,704
|
|
Do you have anything else in the row that might help to find duplicates? Address? Phone? DUNS number? federal tax id?
What makes a business unique? Costco has a lot of stores. Is each one unique or should there only be one row for the parent company alone?
If you have only the name to go on, I'd recommend you hire one of the services who do this for a living to help clean up your data. The rules for this are extremely complex and most folks who do this don't guarantee that they will ever get to a 100% cleanup. Dun and Bradstreet has a service for this (I don't work for them) and I'm sure there are others.
And then again, I might be wrong ... David Webb
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Today @ 9:52 AM
Points: 69,
Visits: 213
|
|
David, I don't have any other supporting data. What I have is inconsistent. In the costco example, it should be one parent company, not multiple stores.
Thanks,
|
|
|
|
|
SSCertifiable
       
Group: General Forum Members
Last Login: Today @ 6:57 PM
Points: 6,724,
Visits: 11,771
|
|
SELECT SOUNDEX('Costco'), SOUNDEX('COSTCO'), SOUNDEX('Costco Whls'), SOUNDEX('Costco Wholesale'), SOUNDEX('Costco Whls llc');
__________________________________________________________________________________________________ There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Believe you can and you're halfway there. --Theodore Roosevelt
Everything Should Be Made as Simple as Possible, But Not Simpler --Albert Einstein
The significant problems we face cannot be solved at the same level of thinking we were at when we created them. --Albert Einstein
1 apple is not exactly 1/8 of 8 apples. Because there are no absolutely identical apples. --Giordy
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Today @ 7:31 PM
Points: 11,645,
Visits: 27,736
|
|
wow, thats going to be tough; the only thing i could think of was a combination of opc.three's example, and joining it against a list of common suffixes to find potential duplicates, but that of course is going an ongoing thing as you dig deeper into the data.
something like this is what i thought might be a starting point:
With MySampleData(CompanyName) AS ( SELECT 'Costco' UNION ALL SELECT 'Costco LLC' UNION ALL SELECT 'Costco Whls' UNION ALL SELECT 'Home Interiors Malaga' UNION ALL SELECT 'Home Plumbing' UNION ALL SELECT 'Home Property Management' UNION ALL SELECT 'Home Realty' UNION ALL SELECT 'Home Svc' ), CommonSuffixes (val) AS ( SELECT ' Inc' UNION ALL SELECT ' LLC' UNION ALL SELECT ' Company' UNION ALL SELECT ' Co' ) SELECT ROW_NUMBER() OVER (PARTITION BY SOUNDEX(CompanyName) ORDER BY CompanyName) AS RW, SOUNDEX(CompanyName) AS SoundX, * FROM MySampleData LEFT OUTER JOIN CommonSuffixes ON CHARINDEX(CommonSuffixes.val,MySampleData.CompanyName) > 0 --WHERE CommonSuffixes.val IS NOT NULL --(turns the LEFT OUTER into an inner join, i know) ORDER BY CompanyName,RW
Lowell
--There is no spoon, and there's no default ORDER BY in sql server either. Actually, Common Sense is so rare, it should be considered a Superpower. --my son
|
|
|
|
|
SSCrazy
      
Group: General Forum Members
Last Login: Today @ 11:00 AM
Points: 2,543,
Visits: 4,384
|
|
Without artificial intelligence which is on pair with clear human knowledge which names refer to the same company and which one, even very similar ones, are not, it is simply impossible to do what you want in a plain coding (regardless of programming language). That is data cleansing exercise and it will always require some manual intervention. I can only suggest couple of ways: Use SSIS, there is a Fuzzy Grouping transformation which is designed primary for the data cleansing tasks. Create a database of company names variations.
_____________________________________________ "The only true wisdom is in knowing you know nothing" "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!" (So many miracle inventions provided by MS to us...)
How to post your question to get the best and quick help
|
|
|
|