SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Function for similar phrases


Function for similar phrases

Author
Message
deepzzzz
deepzzzz
SSC Eights!
SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)

Group: General Forum Members
Points: 961 Visits: 261
Hi All,
Is there any built in function available for checking the similarity of the phrases

eg: US Eastern District Court is similar to Eastern District Court of US.
So if I am comparing both i should get result 1.

I need this for checking the duplicate entries in a table having so much data, so that we can delete the duplicates.

Thanks In Advance...

Thanks
george sibbald
george sibbald
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10476 Visits: 13687
two things come to mind -

full text indexing

and if you have enterprise edition fuzzy grouping in SSIS advanced data flow transformations.

---------------------------------------------------------------------
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)

Group: General Forum Members
Points: 26352 Visits: 17556
Your question is almost verbatim the MSDN definition of SOUNDEX. http://msdn.microsoft.com/en-us/library/ms187384.aspx

Now with the two phrases you listed the SOUNDEX would not be even close because the order of the words is different. We can leverage the age old DelimitedSplit8K here quite nicely. The idea here is order all words in each phrase alphabetically so we have a consistent order of the words for SOUNDEX.


if OBJECT_ID('tempdb..#List') is not null
drop table #List

create table #List
(
ListID int identity,
Phrase varchar(50)
)

insert #List
select 'US Eastern District Court'
union all
select 'Eastern District Court of US'

;with List as
(
--First we need to split the values on the words
select * from #List
cross apply dbo.DelimitedSplit8K(Phrase, ' ')
)
, stuffedList as
(
--Now we sort the words alphabetically
select ListID, STUFF((select Item + ' '
from List l2
where l2.ListID = l1.ListID
order by Item
for XML PATH('')), 1, 0, '') as FullList
from List l1
group by ListID
)

select *, SOUNDEX(FullList) from stuffedList



Please see the link in my signature about splitting strings. In there you will find the code and the logic for how the DelimitedSplit8K function works.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
deepzzzz
deepzzzz
SSC Eights!
SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)SSC Eights! (961 reputation)

Group: General Forum Members
Points: 961 Visits: 261
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?

Thanks
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)

Group: General Forum Members
Points: 87472 Visits: 41116
deepzzzz (4/5/2013)
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?


SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)

Group: General Forum Members
Points: 26352 Visits: 17556
Jeff Moden (4/5/2013)
deepzzzz (4/5/2013)
Thanks Sean for your post..

For this example it works fine..
Just consider these examples...
'US Eastern District Court'
'Southern District Court of US'
For this also it is giving the same value. Actually both are different ..

Court District Eastern US C630
Court District of Southern US C630

So how can I avoid this?


SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.


I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Abu Dina
Abu Dina
Ten Centuries
Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)

Group: General Forum Members
Points: 1257 Visits: 3323
Check out the following post that I started last July: http://www.sqlservercentral.com/Forums/Topic1337370-391-1.aspx

Chris@Home kindly provided a really good token matching algortihm and it's ultra fast as it's implemented as iTVF.

SOUNDX is rubbish. Alternative to the Chri's algorithm in the above post . Also Metaphone, Double Metaphone (both free) and Metaphone3 which I purchased for $40 recently and they're all really good especially with US English/British words.

---------------------------------------------------------


It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
David Edwards - Media lens

Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
Howard Zinn
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)

Group: General Forum Members
Points: 87472 Visits: 41116
Sean Lange (4/5/2013)

I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.


I'm probably not giving it enough credit for whatever its intended use may have been but I've really been disappointed by it. From BOL...

Vowels are ignored in the comparison. Nonalphabetic characters are used to end the comparison.

In other words, a dash or a space or a digit or just about any punctuation will stop the comparison. That means either it does just the first word or first part of a hyphenated word.

These are probably bad examples of what I don't like about it, but it's the best I can do on short notice.

SELECT DIFFERENCE('Glif', 'Geoffrey');
SELECT DIFFERENCE('Glare', 'Geoffrey');

Because none of the vowels (including the letter "y" in this case), are considered, these two comparisons are both given a "3 out of 4" as a match. That's usually nowhere near close enough for my purposes.

Then take the following example which is just about spot on.

SELECT DIFFERENCE('Geoffrey', 'Mr Geoffrey');

That's only given a "2" because only the "Mr" of the second operand is evaluated because any non-alphabetic character, including a space, will stop the comparison.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Sergiy
Sergiy
SSChampion
SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)SSChampion (10K reputation)

Group: General Forum Members
Points: 10530 Visits: 11964
Whatever clever function you may find/create the phrase
'US Eastern District Court'


will still be matching
'US Western District Court'


better than
'United States Eastern District Court'



It's a lot of general knowledge of yours which allows you to match a row to this row but not to that one.
Unless you pass that knowledge to you database (yep, creating tables, storing samples and patterns) you will always program errors.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search