Help searching a string within a set number of words apart

  • Hi,

    I'd be very grateful if someone could help me with this, I do hope someone can 🙂 I'll try and provide as much info as possible so to avoid confusion as to what my question is.

    Client requirement: Client has a table that holds articles in one column, think of them as digital newspaper articles. They have a search webpage to find articles that match the user's keywords. Along with this search functionality the user can specify the proximity that the keywords must be from each other to qualify. Basically how many words (not characters) one word must be from the other to return true.

    Search Example: If the user specifies 5 words apart and enters the search term as 'search word', then the query should return all rows where an article contains the words 'search' and 'word' and they are no more than 5 words apart.

    Result example

    True because only 2 words apart from each other: 'This is a search that has word in it'

    False because 7 words apart from each other: 'The search will not come out true because the word is more than 5 words apart.'

    Added complexity 1: This seems reasonable enough however there is the added complexity that you could have more than two search keywords, in which case they must all be within the specified word proximity.

    Added complexity 2: The whole article must be searched not just the first occurance of a keyword.

    i.e. if we use the example:

    Search keyword: 'search word' within 5 words...

    Article: 'The search will not come out true because the word is more than 5 words apart. But this is a search that has word in it.'

    The above is true because the second sentence contains the keywords within 5 words proximity even if the first sentence did not.

    I know that Full Text Search has the ability to use the operator NEAR however in 2008 you can not set a proximity limit that the words must be from each other. I know that 2012 does but because I am running on 2008 I need to find a way to programatically do this.

    Please oh please can someone help me. Right now I am stuck and do not know how to do this.

    Many thanks in advance, 🙂

    Lewis

  • I think I would look to write a full-text query to get results that might match (i.e. the text has all the words specified, perhaps using NEAR, perhaps not) and then pass that through a function (probably a streaming CLR one) to further filter the results to those that have all the words within x words of each other. Something like that.

  • Hi,

    Thanks Paul.

    With out sounding like an SOB that's the problem I am having, finding all those that have words within a set distance from each other. The pre-filter is easily done, it's that second stage I can not get my head around.

    Lewis

  • The only way i can think of to do this, is to use the filtered list as suggested then in your CLR or Application you would need to loop through each string and count the number of spaces to determine the number of words.

  • Regular expressions! They are perfect for things like this. You can't use them directly in SQL of course, but they work fine in the application or CLR. For example: http://www.regular-expressions.info/near.html

  • lewisdow123 (11/8/2011)


    With out sounding like an SOB that's the problem I am having, finding all those that have words within a set distance from each other. The pre-filter is easily done, it's that second stage I can not get my head around.

    Hi Lewis,

    Well the first part of the process is relatively easy:

    DECLARE @strings TABLE

    (

    id INTEGER IDENTITY PRIMARY KEY,

    string VARCHAR(4000) NOT NULL

    )

    INSERT @strings (string)

    VALUES

    ('This is a search that has word in it'),

    ('The search will not come out true because the word is more than 5 words apart.')

    DECLARE @terms TABLE

    (

    word VARCHAR(50) NOT NULL

    )

    INSERT @terms (word)

    VALUES ('search'), ('word'), ('is')

    DECLARE @words TABLE

    (

    string_id INTEGER NOT NULL,

    position INTEGER NOT NULL,

    word VARCHAR(50) NOT NULL,

    PRIMARY KEY (string_id, position)

    )

    -- Split input strings into words

    INSERT @words (string_id, position, word)

    SELECT

    s2.id,

    ROW_NUMBER() OVER (PARTITION BY s2.id ORDER BY s.number),

    f2.word

    FROM @strings AS s2

    JOIN master.dbo.spt_values AS s ON s.number BETWEEN 1 AND DATALENGTH(s2.string)

    CROSS APPLY (SELECT SPACE(1) + s2.string + SPACE(1)) AS f (wrapped)

    CROSS APPLY (SELECT SUBSTRING(f.wrapped, s.number + 1, CHARINDEX(SPACE(1), f.wrapped, s.number + 1) - s.number)) AS f2 (word)

    WHERE

    s.[type] = N'P'

    AND SUBSTRING(f.wrapped, s.number, 1) = SPACE(1)

    ORDER BY

    s.number

    -- Remove words we're not interested in

    DELETE @words

    WHERE

    NOT EXISTS

    (SELECT 1 FROM @terms AS t WHERE t.word = [@words].word)

    -- Results

    SELECT

    *

    FROM @words AS w

    Working out a robust and efficient algorithm to check whether the 'all words within x' condition is met, where more than one instance of each word occurs, is less trivial. Perhaps start with a brute-force iterative method (per string, construct all permutations, select those that match the conditions).

  • Thanks for your input folks...

    I'm leading down the regex path first using the link provided and http://www.sqlteam.com/article/regular-expressions-in-t-sql

    if this fails then i'll start with Paul's suggestion.

    Again, thank you guys 🙂

  • lewisdow123 (11/8/2011)


    I'm leading down the regex path first using the link provided and http://www.sqlteam.com/article/regular-expressions-in-t-sql%5B/quote%5D

    Yes that's the CLR route I would try too. The T-SQL code I posted was more to illustrate a concept, as I'm sure you guessed anyway...:-)

  • The complication is in allowing more than 2 words.

    The following checks for 2 words that are 5 or less words apart.

    using System;

    using Microsoft.SqlServer.Server;

    using System.Data.SqlTypes;

    using System.Text.RegularExpressions;

    namespace searchFeatures

    {

    public class UserDefinedFunctions

    {

    [SqlFunction]

    public static SqlString SearchFeatures(SqlString txtIn, SqlString word1, SqlString word2)

    {

    try

    {

    var regex = new Regex(@"\b" + word1.Value + @"\W+(?:\w+\W+){1,5}?" + word2.Value + @"\b", RegexOptions.Compiled);

    return regex.IsMatch(txtIn.Value) ? txtIn.Value : SqlString.Null;

    }

    catch (Exception)

    {

    return SqlString.Null;

    }

    }

    }

    }

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[SearchFeatures]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))

    DROP FUNCTION [dbo].[SearchFeatures]

    GO

    IF EXISTS (SELECT * FROM sys.assemblies asms WHERE asms.name = N'searchFeatures' and is_user_defined = 1)

    DROP ASSEMBLY [searchFeatures]

    GO

    CREATE ASSEMBLY [searchFeatures]

    AUTHORIZATION [dbo]

    FROM 

    WITH PERMISSION_SET = SAFE

    GO

    CREATE FUNCTION [dbo].[SearchFeatures](@txtIn [nvarchar](4000), @word1 [nvarchar](4000), @word2 [nvarchar](4000))

    RETURNS [nvarchar](4000) WITH EXECUTE AS CALLER

    AS

    EXTERNAL NAME [searchFeatures].[searchFeatures.UserDefinedFunctions].[SearchFeatures]

    GO

    DECLARE @strings TABLE (

    id INTEGER IDENTITY PRIMARY KEY

    ,string VARCHAR(4000) NOT NULL

    )

    INSERT @strings (string)

    VALUES ('This is a search that has word in it')

    ,('The search will not come out true because the word is more than 5 words apart.')

    SELECT id, string

    FROM @strings

    WHERE dbo.SearchFeatures(string,'search','word') IS NOT NULL


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Wow very nice!!!!

    Top answer by far. I've done the same thing but without the .Net instead I used the first function in this link....

    http://www.sqlteam.com/article/regular-expressions-in-t-sql

    Thanks to all who have helped me out, you're all very kind

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply