Who wants a challenging "Group By" problem?

  • Ok, I am using Full Text Search to build a search engine of sorts for example. I want to return 10 results and of those 10 results I can't have any duplicate Homepages. I want the url though that has the highest Ranking score based on the FTS Results. So for example if I had dog.com and dog.com/test.html show up and dog.com had a high ranking I want to exclude dog.com/test.html from my results. Also instead of now only having 9 results this should still return 10. My code works to exclude duplicate Homepages however it is based on the Min Primary Key (ID) which could result badly if dog.com/test.html had a smaller key then it would exclude dog.com. Here is a small sample version of how my FTS Procedure works. I am usually pretty good at coming up with complex SQL Statements, however I am stumped on this one so please help me out! Thank You in advance.

    @search nvarchar(1000),

    @numberRows int

    AS

    SET NOCOUNT ON

    DECLARE @ORSearch nvarchar(1000);

    DECLARE @NEARSearch nvarchar(1000);

    DECLARE @URLSearch nvarchar(1000);

    SET @ORSearch = REPLACE(@Search, ' ', ' OR ');

    SET @NEARSearch = REPLACE(@Search, ' ', ' ~ ');

    SET @URLSearch = ('http://' + REPLACE(@Search, ' ', '') + '.com');

    SELECT TOP (@numberRows) SRCH.ID, SRCH.PARAGRAPH, SRCH.TITLE, SRCH.URL FROM SEARCH_RESULTS AS SRCH

    INNER JOIN HOMEPAGES ON SRCH.HOMEPAGE_ID = HOMEPAGES.ID

    FULL JOIN CONTAINSTABLE (SEARCH_RESULTS, URL,

    @URLSearch)

    AS URLEXACT_TBL

    ON SRCH.ID = URLEXACT_TBL.

    INNER JOIN CONTAINSTABLE (SEARCH_RESULTS, TITLE,

    @ORSearch)

    AS TITLE_TBL

    ON SRCH.ID = TITLE_TBL.

    INNER JOIN CONTAINSTABLE (SEARCH_RESULTS, PARAGRAPH,

    @NEARSearch)

    AS PARAALL_TBL

    ON SRCH.ID = PARAALL_TBL.

    WHERE SRCH.ID IN (SELECT MIN(SRCH_MIN_ID.ID)

    FROM SEARCH_RESULTS AS SRCH_MIN_ID

    WHERE SRCH.HOMEPAGE_ID = SRCH_MIN_ID.HOMEPAGE_ID

    )

    ORDER BY

    ((

    TITLE_TBL.RANK

    + PARAALL_TBL.RANK

    )) DESC

  • Sonds like a similar requirement like you stated before . Why the duplicate thread?

    In order to help you any further, please poste table def as well as sample data for the underlying table together with your expected result.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Duplicate post because I didn't like how I explained my question in the last post and I didn't show my procedure. I know I could have added to it but then it wouldn't be at the top so I felt in my best interest would be to start fresh. I have a table SEARCH_RESULTS is has a column called HOMEPAGE_ID (INT) which is where if you have dog.com and dog.com/test.html the HOMEPAGE_ID would be the same for both.

    Here is an example when I take out my "current" WHERE CLAUSE which removes duplicate homepages based on LOWEST (ID COLUMN (PKEY)) value.

    Here is the results when I put my WHERE CLAUSE back in which works for this scenario because dog.com had the lowest ID out of all the dog.com URLs.

    However the problem still exists because what if I scanned dog.com after i scanned dog.com/new? Dog.com would have a higher ID value and would be excluded from my results even though the ranking for dog.com was clearly higher. So that is the reasoning to why I want to remove duplicate homepages that have a lower ranking.

  • I'm sorry. I didn't point you to the first article referenced in my signature describing how to post ready to use sample data.

    Most of us would like to test their solutions before posting.

    So, please take the time to read and follow the advice given in the article.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM Thank you for the advice it appears to work and adds minimal overhead which is nice. I didn't notice you posted on the last one until after I posted this new topic. I don't ever do forums so I am a newbie at this. I do have one last question if you know of the top of your head. If you notice the "SET @URLSearch = ('http://' + REPLACE(@Search, ' ', '') + '.com');" part...I want to find/rank http://dog.com and not http://dog.com/test.html. FTS gives the same ranking to both sites is there a way I can include for example that exact phrase including the exact length in size? This would eliminate http://dog.com/test.html. If you think it would be best to start a new topic just let me know, and again thanks for your help to my previous question!

  • It depends. If you always only search for the main site, then you could use

    CHARINDEX('/',STUFF(Url,1,7,''))

    A return value of Zero would indicate it's a main site, otherwise it would return the position of the next slash.

    Another option would be to sort by LEN(Url).



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • This sounds like a problem that could perhaps be better solved by ranking rather than by grouping. The example below is abstract and doesn't match your table structure, but you get the idea. Instead of ordering the rank by site visits, you could order by length of url or basically anything.

    select top 10 domain_name from

    (

    select domain_name,

    rank over (partition by domain_name order by site_visit_count desc) as domain_rank

    from sites

    ) as sites

    where domain_rank = 1

    order by domain_name;

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I figured out the ranking how I wanted. I am currently not able to remote into my sql server to test the performance overhead...but this is what I came up with (example):

    ORDER BY

    (

    TITLE_TBL.RANK +

    CASE SEARCH_RESULTS.URL

    WHEN @URLSearch THEN 1000 ELSE 0

    ) DESC

    Everyone has been very helpful and I am glad I have all my remaining SQL issues worked out. Thank You!

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply