Issue: Need to scan whole table to find duplicate news all the time

  • Hi

    I have a news reader application that reads 20'000-30'000 news from rss. all the time I have to read all rss and check every news from it with news table in database. it's very time consuming and reduces performance of database.

    Please, tell me an idea or solution to solve this.

    Thanks.

  • My first option would be to change the process that reads the rss feed to not read duplicates. I know nothing about rss readers and rss feeds, but I assume that the messages in the feed have some sort of id or timestamp or so - so the reader should only request messages that were added after it last read the feed. That should not only solve your database issues, it will also save you a lot of bandwidth use.

    If you need to do this in the DB, then determine what column (or combination of columns) to check to verify whether a post is a duplicate. Create an index on those columns, both on the tables with new messages to be imported and on the table with messages previously imported. That should make it much faster to check whether there are duplicates.

    (You will hopefully still get scans on both tables, but they should each scan just once and then be merge joined; now you either have a scan in the lower input of a nested loop join or a hash join).

    If there are a lot of columns or if they are long texts, then instead of indexing the column, I recommend adding a computed column with a checksum value (look at the CHECKSUM and BINARY_CHECKSUM functions in SQL Server) in both tables, and indexing that. Compare the checksums to see whether a post coculd be a duplicate, but also compare the full set of columns because there can be false positives when matching on CHECKSUM.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Like Hugo wrote.. I think create some new column and if data will be read , this column will be update with some timestamp , in case will be null will be read again. in case will not be null will not be read again..

    so you will have some like this SELECT * FROM TABLE WHERE NewColumnTIMESTAMP is not null

  • Hopefully there's a lastBuildDate (if old lastBuildDate = new lastBuildDate, skip the whole feed) and a <item> pubDate (only process new pubDate's), but according to RSS 2.0 SPECIFICATION the dates are optional, so there's no guarantee...

    The only required items are <item>and<link> which as previously stated, you could hash[/url]

  • Do some testing. I have no idea what your table is, but I did this:

    CREATE TABLE News

    (NewID INT IDENTITY(1,1)

    , NewsTitle VARCHAR(250)

    , NewsAbstract VARCHAR(6000)

    , NewsDate DATETIME2

    )

    GO

    I loaded 50,000 random rows in here.

    Then this

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    GO

    -- turn on execution plan

    SELECT

    NewID

    , NewsTitle

    , NewsAbstract

    , NewsDate

    FROM dbo.News

    WHERE NewsTitle = 'Monsipantor';

    GO

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    Run it a few times, but I'm getting 23,000 logical reads consistently with an index scan of the heap.

    I then added this:

    CREATE INDEX News_IDX_NewsTitle ON dbo.News(NewsTitle)

    GO

    Rerun the query. It's not about 20-25% of the execution time and 4 logical reads.

    50,000 rows isn't a lot to SQL Server when you have indexing.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply