PDF Ifilter indexing problems with large files

  •  

    I am having a problem with indexing of large pdf files. I have 2 large pdfs : both around 23meg and both around 2000 pages. When the gatherer tries to index them it fails and retries. It appears to be a 30sec time out failure as CPU usage drops after 30sec and then ramps up again.

    It retries repeatedly without moving on to other documents - effectively getting stuck. It does not log an error in the Windows Event log or the sql log. It logs the following in the gatherer log:

    09/03/2005 14:38:24 Add The gatherer has started

    09/03/2005 14:38:26 Add The initialization has completed

    09/03/2005 16:10:36 Add The gatherer has started

    09/03/2005 16:10:40 Add The recovery has completed

    09/03/2005 16:45:06 MSSQL75://SQLServer/76cba758/F87750AC4AACBF4BA9F2816993FBE5EA Add Error fetching URL, (80041201 - The object was not found. )

    However this is only logged after the pdf files are deleted from the document library.  Nothing is logged before.

    Other documents in the database get indexed propoperly (if they were indexed before these pdfs) and the full-text catalogs are searchable. If the 2 large pdfs are removed then the indexing completes successfully. Other pdfs in the database are indexable and searchable.

    I am using SQL 2000, SP3, Adobe IFilter 6.0, Windows 2003. The database is a Windows Sharepoint Services content database.

    Any ideas?

  • Hi,

    SQL Server 2000 (and possibly SQL Server 2005) limit the Full Text Indexing of large files to (see BOL title ? , cannot find this reference at this time, will post this later). However, depending upon your server's resouces, i.e., the amount of RAM and, you can tune FTI for large files via the registry setting of MaxPropStoreCachedSize) , see KB article:

    303459 "INF: How to Improve the Performance of FTS Queries for Large Tables"

    http://support.microsoft.com/?id=303459

    More to follow...

    John

    SQL Full Text Search Blog

    http://spaces.msn.com/members/jtkane/


    John T. Kane

  • I don't have that registry key!  I have tried creating it and setting it to 256, 512 and 2560 but it makes no difference.  The total size of all my catalogs are smaller than 256mb anyway.  The directory, all files and sub-directories under :

    C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA are only 30mb in total.

    Do you know of any more logs that can be checked other than the windows event log, the gatherer log and the SQL error log? Does MS Search have any kind of admin tool / monitoring tool (other than Performance Monitor/sysmon)?

    Thanks

  • I have fixed this myself .  This fix is SharePoint specific so a lot of what follows will be irrelevant to general full-text users (but there is a lot of general stuff so if you don’t care about SharePoint skip the 2nd paragraph).  That said I don’t even know if my problem exists outside of SharePoint.

     

    On further investigation I discovered the problem does and doesn’t exist in SharePoint Portal (SPS)!  Essentially I discovered that when a small web farm is created (1 backend SQL and 1 front end web server which is the search and index server) the front end index server will fail to index the document.  It says it has indexed it partially (but I couldn’t get it to show up in any searches).  More importantly the indexing process actually finished with errors.  This is significant because in SharePoint Services (WSS) , the indexing process never finishes – it repeatedly tries again.  The reason why I say it doesn’t work as well is because I discovered full text indexing in is turned on in the SPS site database – and this fails in the same way as WSS.  Essentially what is happening is that the document is being indexed in 2 places – the front end index/search server AND the SQL backend database!  If you open WSS central admin in the farm and turn off searching at the WSS level, the full text catalogues are deleted in SQL for the SPS Site database.  You can still search WSS sites from Portal but not from within WSS.  This means that documents that are stored in the Portal areas are indexed through full-text indexing in the backend SQL database as well as in the index catalogues on the front end web servers.

     

    It is going to be difficult to prevent this problem from occurring or automatically detecting when it has occurred.  To prevent it from happening possible ways are to limit the size of files that users can upload or don’t index pdfs at all.  To spot when it is happening ‘in the wild’ you can monitor CPU usage of the msdmn.exe process.  This is the process that performs filtering through the IFilters.  If this is ramped up all the time or repeatedly ramping up and down then it’s likely you have hit this problem.  Another way is to check the full-text catalogs status in Enterprise Manager or Query Analyzer.  If it is ‘notifications processing’ or ‘change tracking’ for a significant length of time then it is likely you have hit this problem.  Another way to check is to look in the temp directory used by the indexer – usually:

     

    C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA

     

    If this directory has large PDF files with recent creation dates (last couple of minutes) then you are likely to be experiencing the problem.  Another way of checking is to use PerfMon:

    1.         Select the Performance Object – Microsoft Gather Projects.

    2.         Select the Retries counter

    3.         Select all the instances (if you have more than one) – theses instances can be matched back to SQL databases – the number at the end ie SQLServ~1c SQL00009~1c can be matched to database_ID 00009 by using Query Analyzer (SELECT DB_ID() tells you the id for the database).

    4.         These counters should probably be at 0.  If they are incrementing at the rate of 1 or 2 per minute – you are probably experiencing the problem.

     

    Obviously none of these are satisfactory!

     

    Resolution to large PDF problem:

    1.         Open WSS Central Administration

    2.         Under Component Configuration click Configure Data Retrieval service Settings

    3.         Under Data Source Time Out set the Request time-out to a number larger than 30 (ie 120)

    4.         Sometimes this fixes the problem straight away.  Sometimes you have to rebuild the catalog by going into WSS central admin, clicking configure full text search, then clicking OK.

     

    I was unable to locate the registry setting that is changed (it might not be a registry setting therefore as I used software to compare the registry before and after the change on both the front-end and back-end server) so general Full-text users are on their own from now (but as I said earlier I don’t even know if my problem exists outside of SharePoint)

     

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply