SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


pattern matching?


pattern matching?

Author
Message
rightontarget
rightontarget
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 475
Hi all,
I have a task to clean up data in one of the tables. The column name I need to clean up holds business names, but they can appear in there in many different ways.
For instance:
Costco
COSTCO
Costco Whls
Costco Wholesale
Costco Whls llc

What is available to me in SQL Server 2008R2 that can help me to accomplish that? How would you approach it?

Thanks,
roryp 96873
roryp 96873
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1604 Visits: 6608
If you're looking for every instance of a particular character string, you can use this:


select *
from mytable
where bus_name like '%costco%'


rightontarget
rightontarget
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 475
Thanks for reply.
The table that holds business names has 20M rows.
I can't specify a business name because I don't know it or will have to do it for every business name in the table.
In the following example, how would you use "like"?
Costco
Costco LLC
Costco Whls
Home Interiors Malaga
Home Plumbing
Home Property Management
Home Realty
Home Svc
roryp 96873
roryp 96873
SSCommitted
SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)SSCommitted (1.6K reputation)

Group: General Forum Members
Points: 1604 Visits: 6608
eugene.pipko (12/12/2012)
Thanks for reply.
The table that holds business names has 20M rows.
I can't specify a business name because I don't know it or will have to do it for every business name in the table.
In the following example, how would you use "like"?
Costco
Costco LLC
Costco Whls
Home Interiors Malaga
Home Plumbing
Home Property Management
Home Realty
Home Svc


I guess I'm not quite sure you are looking for in this example, are all those business names considered to be the same for this case?
rightontarget
rightontarget
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 475
I am looking for a way to say:

Based on the list here:
---------------------
Costco
Costco LLC
Costco Whls
Home Interiors Malaga
Home Plumbing
Home Property Management
Home Realty
Home Svc

These are unique business names:
---------------------
Costco
Home Interiors Malaga
Home Plumbing
Home Property Management
Home Realty
Home Svc
David Webb-CDS
David Webb-CDS
Hall of Fame
Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)Hall of Fame (4K reputation)

Group: General Forum Members
Points: 3950 Visits: 8586
Do you have anything else in the row that might help to find duplicates? Address? Phone? DUNS number? federal tax id?

What makes a business unique? Costco has a lot of stores. Is each one unique or should there only be one row for the parent company alone?

If you have only the name to go on, I'd recommend you hire one of the services who do this for a living to help clean up your data. The rules for this are extremely complex and most folks who do this don't guarantee that they will ever get to a 100% cleanup. Dun and Bradstreet has a service for this (I don't work for them) and I'm sure there are others.



And then again, I might be wrong ...
David Webb
rightontarget
rightontarget
SSChasing Mays
SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)SSChasing Mays (613 reputation)

Group: General Forum Members
Points: 613 Visits: 475
David,
I don't have any other supporting data. What I have is inconsistent.
In the costco example, it should be one parent company, not multiple stores.

Thanks,
Orlando Colamatteo
Orlando Colamatteo
SSC-Dedicated
SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)SSC-Dedicated (38K reputation)

Group: General Forum Members
Points: 38272 Visits: 14411
SELECT  SOUNDEX('Costco'),
SOUNDEX('COSTCO'),
SOUNDEX('Costco Whls'),
SOUNDEX('Costco Wholesale'),
SOUNDEX('Costco Whls llc');



__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Lowell
Lowell
SSC Guru
SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)SSC Guru (69K reputation)

Group: General Forum Members
Points: 69600 Visits: 40917
wow, thats going to be tough;
the only thing i could think of was a combination of opc.three's example, and joining it against a list of common suffixes to find potential duplicates, but that of course is going an ongoing thing as you dig deeper into the data.

something like this is what i thought might be a starting point:

With MySampleData(CompanyName)
AS
(
SELECT 'Costco' UNION ALL
SELECT 'Costco LLC' UNION ALL
SELECT 'Costco Whls' UNION ALL
SELECT 'Home Interiors Malaga' UNION ALL
SELECT 'Home Plumbing' UNION ALL
SELECT 'Home Property Management' UNION ALL
SELECT 'Home Realty' UNION ALL
SELECT 'Home Svc'
),
CommonSuffixes (val)
AS
(
SELECT ' Inc' UNION ALL
SELECT ' LLC' UNION ALL
SELECT ' Company' UNION ALL
SELECT ' Co'
)

SELECT
ROW_NUMBER() OVER (PARTITION BY SOUNDEX(CompanyName) ORDER BY CompanyName) AS RW,
SOUNDEX(CompanyName) AS SoundX,
*
FROM MySampleData
LEFT OUTER JOIN CommonSuffixes
ON CHARINDEX(CommonSuffixes.val,MySampleData.CompanyName) > 0
--WHERE CommonSuffixes.val IS NOT NULL --(turns the LEFT OUTER into an inner join, i know)
ORDER BY CompanyName,RW



Lowell
--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!
Eugene Elutin
Eugene Elutin
SSChampion
SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)

Group: General Forum Members
Points: 12042 Visits: 5478
Without artificial intelligence which is on pair with clear human knowledge which names refer to the same company and which one, even very similar ones, are not, it is simply impossible to do what you want in a plain coding (regardless of programming language).
That is data cleansing exercise and it will always require some manual intervention.
I can only suggest couple of ways:
Use SSIS, there is a Fuzzy Grouping transformation which is designed primary for the data cleansing tasks.
Create a database of company names variations.

_____________________________________________
"The only true wisdom is in knowing you know nothing"
"O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
(So many miracle inventions provided by MS to us...)

How to post your question to get the best and quick help
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search