Finding a substring within a text

  • Hello everyone,

    I have 2 identical tables each containing 3 fields (ID, Date, Text). Each table holds news announcements from different publishers using slightly different presentations of identical announcements. I am trying to identify duplicate anouncements.

    Example:

    Table 1:

    ID: 864

    Date: 3/20/1999

    Text: SomeTown, MA -- YYY, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, today announced the 1234 Family of stand-alone workstations. xxx is a registered trademark of yyy

    Table 2

    ID: 297

    Date: 3/20/1999

    Text: March 3, 1999YYY Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the 1234 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments.

    Important: Only the underlined substring will appear in both entries if the two are in fact duplicate anouncements. The length and contents of the text which is NOT underlined is NOT constant, unpredicatable, and different between records.

    I am using SQL Server 2005 Express on Windows XP

    Any help/suggestion will be greatly appreciated.

  • How do you plan on searching for this? Do you have a list of announcement you want to look for?

    If you don't - then you're going to need to tackle trying to parse certain things to come up with the announcement. If you don't have the text of an announcment you want to look for, then you need to supply the system with some way to identify what is announcement and what is not.

    Even if you did this for several of the "bigger" news feeds you're getting, you'd have a running start at being able to parse out the announcements.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I believe that PATINDEX is the only function that will do as you ask on TEXT datatypes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ...or... (come on Jeff, you KNOW what I am about to say)....

    Regular expression CLR functions would do that as well.... But you still need a search string to start from....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • You will never get an exact way to sort out the duplicates but I would approach it as follows:

    1. Write a Longest Common Substring function. (Google for algorithm outlines) This will be slow if written in TSQL so you may want to look at using C#. (You could also try a Maximum Consecutive Word match function.)

    2. Join the tables on the dates where the dates are within a few days of each other and the length of the Longest Common Substring is, say, greater than 15. (When calling the Longest Common Substring Function I would be inclined to pass only the first 255 characters of the text strings in order to speed things up.)

  • My two tables contain thousands of announcements and I need to run a query/procedure that will match entries on the Date fields between the two tables and then compare announcemenet swith the same date to SOMEHOW PROGRAMMATICALLY decide if they are identical.

  • I haven't tried this but, just found it...http://www.data-for-all.com/blog/index.php?p=119

    (from Ken's suggestion)

    ______________________________________________________________________

    Personal Motto: Why push the envelope when you can just open it?

    If you follow the direction given HERE[/url] you'll likely increase the number and quality of responses you get to your question.

    Jason L. Selburg
  • The Longest common substring is exactly what I am looking for.

    Look as if I am not the first in need of such a function.

    Thanks to all for your help

  • Also asked and answered here

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=92214


    N 56°04'39.16"
    E 12°55'05.25"

  • Here is a link to a C# algorithm (source and explaination) for determining a Longest Common Substring.

    http://www.codeproject.com/cs/algorithms/lcs.asp

  • Matt Miller (11/7/2007)


    ...or... (come on Jeff, you KNOW what I am about to say)....

    Regular expression CLR functions would do that as well.... But you still need a search string to start from....

    Heh... good call, Matt... I was kind of "fuzzy" about when you'd jump in on this one with that particular suggestion...

    Hey, since I'm a bit ignorant about RegEx except for the simplest of functions, whould you mind showing me how to do such a "fuzzy" lookup using RegEx? Thanks, Matt...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well - it's certainly not so much fuzzy as it is flexible.It's more like Patindex than it is about longest common search, just with quite a few more options.

    I was getting ready to start tackling some way to "parse out" the article part from the headers. At least get the heading stuff out of there. Otherwise - I hadn't gotten much further than that with THIS problem. So I was counting on finding an ACTUAL pattern (or set of patterns if need be).

    I'm still not quite sure how the longest common substring is going to help the OP (a lot of the articles are shorter than the disclaimers, so how to "remove" those so they don't screw up the identification is something that mystifies me so far). For example, a lot of MS marketing-happy stuff might have any and all of these in the footer:

    Windows (TM) is a Trademark of the Microsoft Corporation

    SQL Server 2005 (c) is a Trademark of the Microsoft Corporation

    Which would make a LOT of the press releases be evaluated to be the SAME press release unless there is a way to "remove them".

    ------------------

    Now- if you want to get into what you might use RegEx for, or how it expands on what Patindex, replace, charindex, and does a lot of pattern matching/manipulation PatIndex can't touch, we can certainly do that. We should probably start another thread though for that if you want to pursue.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Heh... no, that's ok, Matt... there's some pretty good short books on RegEx... it's about time I picked one up. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply