Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Function for similar phrases Expand / Collapse
Author
Message
Posted Thursday, April 4, 2013 7:26 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Thursday, August 14, 2014 6:01 AM
Points: 885, Visits: 259
Hi All,
Is there any built in function available for checking the similarity of the phrases

eg: US Eastern District Court is similar to Eastern District Court of US.
So if I am comparing both i should get result 1.

I need this for checking the duplicate entries in a table having so much data, so that we can delete the duplicates.

Thanks In Advance...


Thanks
Post #1438793
Posted Thursday, April 4, 2013 7:36 AM
SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 3:40 AM
Points: 5,989, Visits: 12,925
two things come to mind -

full text indexing

and if you have enterprise edition fuzzy grouping in SSIS advanced data flow transformations.


---------------------------------------------------------------------

Post #1438801
Posted Thursday, April 4, 2013 8:03 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 2:32 PM
Points: 13,302, Visits: 12,168
Your question is almost verbatim the MSDN definition of SOUNDEX. http://msdn.microsoft.com/en-us/library/ms187384.aspx

Now with the two phrases you listed the SOUNDEX would not be even close because the order of the words is different. We can leverage the age old DelimitedSplit8K here quite nicely. The idea here is order all words in each phrase alphabetically so we have a consistent order of the words for SOUNDEX.

if OBJECT_ID('tempdb..#List') is not null
drop table #List

create table #List
(
ListID int identity,
Phrase varchar(50)
)

insert #List
select 'US Eastern District Court'
union all
select 'Eastern District Court of US'

;with List as
(
--First we need to split the values on the words
select * from #List
cross apply dbo.DelimitedSplit8K(Phrase, ' ')
)
, stuffedList as
(
--Now we sort the words alphabetically
select ListID, STUFF((select Item + ' '
from List l2
where l2.ListID = l1.ListID
order by Item
for XML PATH('')), 1, 0, '') as FullList
from List l1
group by ListID
)

select *, SOUNDEX(FullList) from stuffedList

Please see the link in my signature about splitting strings. In there you will find the code and the logic for how the DelimitedSplit8K function works.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1438820
Posted Friday, April 5, 2013 3:19 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Thursday, August 14, 2014 6:01 AM
Points: 885, Visits: 259
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?


Thanks
Post #1439134
Posted Friday, April 5, 2013 7:16 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:13 PM
Points: 36,995, Visits: 31,514
deepzzzz (4/5/2013)
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?


SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1439199
Posted Friday, April 5, 2013 7:39 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 2:32 PM
Points: 13,302, Visits: 12,168
Jeff Moden (4/5/2013)
deepzzzz (4/5/2013)
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?


SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.


I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1439217
Posted Friday, April 5, 2013 9:36 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 3:26 AM
Points: 708, Visits: 3,287
Check out the following post that I started last July: http://www.sqlservercentral.com/Forums/Topic1337370-391-1.aspx

Chris@Home kindly provided a really good token matching algortihm and it's ultra fast as it's implemented as iTVF.

SOUNDX is rubbish. Alternative to the Chri's algorithm in the above post . Also Metaphone, Double Metaphone (both free) and Metaphone3 which I purchased for $40 recently and they're all really good especially with US English/British words.


---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Post #1439301
Posted Monday, April 8, 2013 6:20 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 9:13 PM
Points: 36,995, Visits: 31,514
Sean Lange (4/5/2013)

I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.


I'm probably not giving it enough credit for whatever its intended use may have been but I've really been disappointed by it. From BOL...

Vowels are ignored in the comparison. Nonalphabetic characters are used to end the comparison.

In other words, a dash or a space or a digit or just about any punctuation will stop the comparison. That means either it does just the first word or first part of a hyphenated word.

These are probably bad examples of what I don't like about it, but it's the best I can do on short notice.

SELECT DIFFERENCE('Glif', 'Geoffrey');
SELECT DIFFERENCE('Glare', 'Geoffrey');

Because none of the vowels (including the letter "y" in this case), are considered, these two comparisons are both given a "3 out of 4" as a match. That's usually nowhere near close enough for my purposes.

Then take the following example which is just about spot on.

SELECT DIFFERENCE('Geoffrey', 'Mr Geoffrey');

That's only given a "2" because only the "Mr" of the second operand is evaluated because any non-alphabetic character, including a space, will stop the comparison.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1440084
Posted Monday, April 8, 2013 11:58 PM
SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Monday, August 18, 2014 9:16 PM
Points: 4,576, Visits: 8,347
Whatever clever function you may find/create the phrase
'US Eastern District Court'

will still be matching
'US Western District Court'

better than
'United States Eastern District Court'

It's a lot of general knowledge of yours which allows you to match a row to this row but not to that one.
Unless you pass that knowledge to you database (yep, creating tables, storing samples and patterns) you will always program errors.
Post #1440154
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse