Search Engine on Multiple Words

  • Hi all,

    I'm currently putting together a simple search engine and having a little problem searching using multiple keywords!

    I have three tables:

    tblPage

       page_id

       url

    tblWord

       word_id

       word

    tblOccurrence

       occurrence_id

       page_id

       word_id

    I have an ASP page that reads a URL, gathers all of the words from the page then populates the tblWord table with the word it has found and then the tblOccurrence table with the page and word id's it has found.

    The search ASP file then queries these tables returning the url in the order of occurrence.

    The query I have is:

    SELECT tblPage.url AS url,

         COUNT(*) AS occurrences 

         FROM tblPage p, tblWord w, tblOccurrence o

            WHERE (p.page_id = o.page_id) AND

                      (w.word_id = o.word_id) AND

                      (w.word='<%= sKeyword %>')

                      GROUP BY p.url

                      ORDER BY occurrences DESC

    This works great for one keyword, however, I need to adapt this query to cater for multiple keywords!

    Any ideas????

    I could do a logical OR, which works but does not return a page if two different words appear on a page!

    Many thanks...

    Mike

  • you need to seperate words in a where clause

    where w.word like'<%= sKeyword %> and w.word like'<%= sKeyword %>

    for sepeate words

    for example vb,asp

    where w.word like '<%= vb %> and/or w.word like '<%= asp %>

     

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • Hi,

    Many thanks for replying!

    I thought that should work, however, I'd already tried this and it returns no records!

    The logicl OR is fine, i.e. give me a page if the word 'London' OR 'Race' appears within the page.

    If I have in:

    tblPage

    page_id = 1

    url = 'http://www.londonrace.com'

    tblWord

    word_id = 1

    word = 'london'

    word_id = 2

    word = 'race'

    tblOccurrence

    occurrence_id = 1

    page_id = 1

    word_id = 1

    occurrence_id = 2

    page_id = 1

    word_id = 2

    With an AND should it still return a valid record?

    Thanks.

    Mike

  • what is the db are you using, and show me the exact query you tried, you need to use like instead of "="

    <a href="http://www.websolsoftware.com"> For IT jobs click here</a>

    *Sukhoi*[font="Arial Narrow"][/font]

  • It is SQL Server 2005.

    I'm thinking that the way I'm storing the keywords is probably not the best way to achieve this!

    Even if you do a basic search on the tblWord with an AND you will not get the result as it is never true:

    NO - select

    * from tblWord where word = 'race' and word = 'racer'

    OK - select * from tblWord where word = 'race' or word = 'racer'

    OK - select * from tblWord where word like 'race%' or word like 'looks%'

    I need something like the mySQL TSQL command REGEXP!

    !!!!!!!

     

  • Have you considered using the fulltext search functionality for this?

    With fulltext you probably won't even need the keywords tables. Put a full text index on the documents (assuming they're stored in a database) and you can use the CONTAINS keyword pull out actaul or near matches.

     

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • create procedure usp_SimpleSearchEngine
    (
    @xmlKeywords nvarchar(max)
    )
    as
    
    --<root>
    --<item word=""/>
    --</root>
    
    begin
    declare @hdoc INT
    declare @xml nvarchar(max)
    declare @Keywords table(word nvarchar(200))
    
    --parse the xml and save them in the variable table
    exec sp_xml_preparedocument @hdoc output, @xmlKeywords
    
    insert into @Keywords(word)
    select  tmp.word from openxml (@hdoc, '/root/item',2) with(word nvarchar(128) '@word')tmp
    
    exec sp_xml_removedocument @hdoc
    
    --select all the records that matches our criteria
    selecttblPage.url as 'url'
    , count(*) as 'occurrences'
            fromtblPage p
    jointblOccurrence o on((o.page_id = p.page_id))
    jointblWord w on (w.word_id = o.word_id)
            whereexists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\')
    order by p.url asc
    , occurrences DESC
    end
    go
    
    /*usage
    exec usp_SimpleSearchEngine
    @xmlKeywords=N'<root>
                               <item word="race"/><item word="racer"/>
    </root>'
    */
    
    ------------------------------------------------------------------------------------
    
    or just simply use this simple sproc to do the job..
    
    its only logic is that it accept a list of keywords,
    store them in a variable table(don't use temp table)
    and check for a match if any of them exists in the database
    (by using the EXISTS keyword in the WHERE clause)
    
    
    if there is any question, suggestion, or anything you may want to ask, send me an email .
    it will be highly appreciated.. 
    
    
  • Hi Gail & Christopher,

    Many thanks for your replies, greatly appreciated!

    The production server I have is a remote SQL Server 7 that I do not have a major control over (ISP offered service)!

    The SQL Server 2005, is on my laptop and used for development.

    Not played around with the FULL TEXT within SQL so need to have a read up on this, the procedure above looks interesting, I will give it a go and let you know how I get on.

    Many many thanks for all of your help.

    Mike

     

  • Hi Christopher,

    When I added your stored procedure, I get the following error:

    Msg 4104, Level 16, State 1, Procedure usp_SimpleSearchEngine, Line 19

    The multi-part identifier "tblPage.url" could not be bound.

    Have I missed something?? Thanks Mike

    use

    SearchEngine

    go

    create

    procedure usp_SimpleSearchEngine

    (

    @xmlKeywords

    nvarchar(max)

    )

    as

    --<root>

    -- <item word=""/>

    --</root>

    begin

    declare @hdoc INT

    declare @xml nvarchar(max)

    declare @Keywords table(word nvarchar(200))

    -- parse the xml and save them in the variable table

    exec sp_xml_preparedocument @hdoc output, @xmlKeywords

    insert into @Keywords(word)

    select tmp.word from openxml (@hdoc, '/root/item',2) with(word nvarchar(128) '@word')tmp

    exec sp_xml_removedocument @hdoc

    -- select all the records that matches our criteria

    select tblPage.url as url, count(*) as 'occurrences'

    from tblPage p

    join tblOccurrence o on (o.page_id = p.page_id)

    join tblWord w on (w.word_id = o.word_id)

    where exists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\')

    order by p.url asc, occurrences DESC

    end

    go

     

  • ----------------------------------------------------------
    -- select all the records that matches our criteria
    selectp.url as 'url'
    , count(*) as 'occurrences'
            fromtblPage p
    jointblOccurrence o on((o.page_id = p.page_id))
    jointblWord w on (w.word_id = o.word_id)
            whereexists(select 1 from @Keywords k where w.word like '%' + k.word + '%' ESCAPE '\')
    order by p.url asc
    , occurrences DESC
    ----------------------------------------------------------
    
    i misspelled the Alias on the tblPage table..
    
    just replace the query with the one above..
    
    sorry bout that mate.. just pure negligence... hehehhe 
    
  • I had changed the alias name to p.url, except I get the following error, so I poped it back !

    Msg 8120, Level 16, State 1, Procedure usp_SimpleSearchEngine, Line 21

    Column 'tblPage.url' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

     

  • May of sorted it !!!!

    I changed to group by p.url order by occurrances DESC

    will test

  • Hi again,

    I managed to get it to work and execute fine now, however, don't you just hate it when somebody says that!

    exec usp_SimpleSearchEngine @xmlKeywords

    ='<root><item word="mjd1"/><item word="mjd2"/></root>'

    go

    SELECT

    url, COUNT(*) AS occurrences FROM tblOccurrence o1

    INNER JOIN tblWord w1 ON w1.word_id = o1.word_id

    INNER JOIN tblPage p ON p.page_id = o1.page_id

    WHERE (w1.word like '%mjd1%') or (w1.word like '%mjd2%')

    GROUP BY p.url

    ORDER BY occurrences DESC

    In these example I have two words that appear on one page 'mjd1' and 'mjd2', I do not want a page to be returned if just one word is present as it is a logical AND that is needed.

    The results both look as if they are logically OR'ing.

    Hope I have explained myself ok??????

    Can the stored procedure by modified?

    Thanks.

    Mike

  • Try adding the following after the group by, before the order by. The filter there will depend on the no on keywords. No promices, I'm guessing here.

    HAVING COUNT(distinct w1.word) > 1 -- the count to look for depends on the number of keywords specified.

    Also, a word of warning. Get more than a couple thousand words and your search is going to run like molasses. Because of the wildcard on the word, SQL will not be able to use any index on the word table. With small amounts of data you won't notice a difference. Once the tables get large, you may end up with a rather slow running query.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks tried that, still looks like a logical OR!

    Beginning to think I'm going about this the wrong way!

    Word table is approx 6000, occurrence is approx 35,000

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply