Function for similar phrases

  • Hi All,

    Is there any built in function available for checking the similarity of the phrases

    eg: US Eastern District Court is similar to Eastern District Court of US.

    So if I am comparing both i should get result 1.

    I need this for checking the duplicate entries in a table having so much data, so that we can delete the duplicates.

    Thanks In Advance...

    Thanks

  • two things come to mind -

    full text indexing

    and if you have enterprise edition fuzzy grouping in SSIS advanced data flow transformations.

    ---------------------------------------------------------------------

  • Your question is almost verbatim the MSDN definition of SOUNDEX. http://msdn.microsoft.com/en-us/library/ms187384.aspx

    Now with the two phrases you listed the SOUNDEX would not be even close because the order of the words is different. We can leverage the age old DelimitedSplit8K here quite nicely. The idea here is order all words in each phrase alphabetically so we have a consistent order of the words for SOUNDEX.

    if OBJECT_ID('tempdb..#List') is not null

    drop table #List

    create table #List

    (

    ListID int identity,

    Phrase varchar(50)

    )

    insert #List

    select 'US Eastern District Court'

    union all

    select 'Eastern District Court of US'

    ;with List as

    (

    --First we need to split the values on the words

    select * from #List

    cross apply dbo.DelimitedSplit8K(Phrase, ' ')

    )

    , stuffedList as

    (

    --Now we sort the words alphabetically

    select ListID, STUFF((select Item + ' '

    from List l2

    where l2.ListID = l1.ListID

    order by Item

    for XML PATH('')), 1, 0, '') as FullList

    from List l1

    group by ListID

    )

    select *, SOUNDEX(FullList) from stuffedList

    Please see the link in my signature about splitting strings. In there you will find the code and the logic for how the DelimitedSplit8K function works.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Thanks Sean for your post..

    For this example it works fine..

    Just consider these examples...

    'US Eastern District Court'

    'Southern District Court of US'

    For this also it is giving the same value. Actually both are different ..

    Court District Eastern US C630

    Court District of Southern US C630

    So how can I avoid this?

    Thanks

  • deepzzzz (4/5/2013)


    Thanks Sean for your post..

    For this example it works fine..

    Just consider these examples...

    'US Eastern District Court'

    'Southern District Court of US'

    For this also it is giving the same value. Actually both are different ..

    Court District Eastern US C630

    Court District of Southern US C630

    So how can I avoid this?

    SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

    If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/5/2013)


    deepzzzz (4/5/2013)


    Thanks Sean for your post..

    For this example it works fine..

    Just consider these examples...

    'US Eastern District Court'

    'Southern District Court of US'

    For this also it is giving the same value. Actually both are different ..

    Court District Eastern US C630

    Court District of Southern US C630

    So how can I avoid this?

    SOUNDEX is actually pretty limited. I don't remember the exact algorithm it uses but I do remember that it was extremely limited. It was something like the first character of the string followed by some sort of very short checksum that didn't even include the whole phrase.

    If this is for a law office, you'd be much better off by buying a couple of products called (IIRC from one of my old jobs) "Attenex" and "Equiveo". They're built just for this type of thing and make simple full text searches look nearly as limited as SOUNDEX.

    I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Check out the following post that I started last July: http://www.sqlservercentral.com/Forums/Topic1337370-391-1.aspx

    Chris@Home kindly provided a really good token matching algortihm and it's ultra fast as it's implemented as iTVF.

    SOUNDX is rubbish. Alternative to the Chri's algorithm in the above post . Also Metaphone, Double Metaphone (both free) and Metaphone3 which I purchased for $40 recently and they're all really good especially with US English/British words.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Sean Lange (4/5/2013)


    I knew it was limited but I didn't realize it is that useless. I have never really had a reason to need it so have never really tested the limitations very much.

    I'm probably not giving it enough credit for whatever its intended use may have been but I've really been disappointed by it. From BOL...

    [font="Arial Black"]Vowels are ignored in the comparison. Nonalphabetic characters are used to end the comparison.[/font]

    In other words, a dash or a space or a digit or just about any punctuation will stop the comparison. That means either it does just the first word or first part of a hyphenated word.

    These are probably bad examples of what I don't like about it, but it's the best I can do on short notice.

    SELECT DIFFERENCE('Glif', 'Geoffrey');

    SELECT DIFFERENCE('Glare', 'Geoffrey');

    Because none of the vowels (including the letter "y" in this case), are considered, these two comparisons are both given a "3 out of 4" as a match. That's usually nowhere near close enough for my purposes.

    Then take the following example which is just about spot on.

    SELECT DIFFERENCE('Geoffrey', 'Mr Geoffrey');

    That's only given a "2" because only the "Mr" of the second operand is evaluated because any non-alphabetic character, including a space, will stop the comparison.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Whatever clever function you may find/create the phrase

    'US Eastern District Court'

    will still be matching

    'US Western District Court'

    better than

    'United States Eastern District Court'

    It's a lot of general knowledge of yours which allows you to match a row to this row but not to that one.

    Unless you pass that knowledge to you database (yep, creating tables, storing samples and patterns) you will always program errors.

    _____________
    Code for TallyGenerator

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply