I have fixed this myself . This fix is SharePoint specific so a lot of what follows will be irrelevant to general full-text users (but there is a lot of general stuff so if you don’t care about SharePoint skip the 2nd paragraph). That said I don’t even know if my problem exists outside of SharePoint.
On further investigation I discovered the problem does and doesn’t exist in SharePoint Portal (SPS)! Essentially I discovered that when a small web farm is created (1 backend SQL and 1 front end web server which is the search and index server) the front end index server will fail to index the document. It says it has indexed it partially (but I couldn’t get it to show up in any searches). More importantly the indexing process actually finished with errors. This is significant because in SharePoint Services (WSS) , the indexing process never finishes – it repeatedly tries again. The reason why I say it doesn’t work as well is because I discovered full text indexing in is turned on in the SPS site database – and this fails in the same way as WSS. Essentially what is happening is that the document is being indexed in 2 places – the front end index/search server AND the SQL backend database! If you open WSS central admin in the farm and turn off searching at the WSS level, the full text catalogues are deleted in SQL for the SPS Site database. You can still search WSS sites from Portal but not from within WSS. This means that documents that are stored in the Portal areas are indexed through full-text indexing in the backend SQL database as well as in the index catalogues on the front end web servers.
It is going to be difficult to prevent this problem from occurring or automatically detecting when it has occurred. To prevent it from happening possible ways are to limit the size of files that users can upload or don’t index pdfs at all. To spot when it is happening ‘in the wild’ you can monitor CPU usage of the msdmn.exe process. This is the process that performs filtering through the IFilters. If this is ramped up all the time or repeatedly ramping up and down then it’s likely you have hit this problem. Another way is to check the full-text catalogs status in Enterprise Manager or Query Analyzer. If it is ‘notifications processing’ or ‘change tracking’ for a significant length of time then it is likely you have hit this problem. Another way to check is to look in the temp directory used by the indexer – usually:
C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA
If this directory has large PDF files with recent creation dates (last couple of minutes) then you are likely to be experiencing the problem. Another way of checking is to use PerfMon:
1. Select the Performance Object – Microsoft Gather Projects.
2. Select the Retries counter
3. Select all the instances (if you have more than one) – theses instances can be matched back to SQL databases – the number at the end ie SQLServ~1c SQL00009~1c can be matched to database_ID 00009 by using Query Analyzer (SELECT DB_ID() tells you the id for the database).
4. These counters should probably be at 0. If they are incrementing at the rate of 1 or 2 per minute – you are probably experiencing the problem.
Obviously none of these are satisfactory!
Resolution to large PDF problem:
1. Open WSS Central Administration
2. Under Component Configuration click Configure Data Retrieval service Settings
3. Under Data Source Time Out set the Request time-out to a number larger than 30 (ie 120)
4. Sometimes this fixes the problem straight away. Sometimes you have to rebuild the catalog by going into WSS central admin, clicking configure full text search, then clicking OK.
I was unable to locate the registry setting that is changed (it might not be a registry setting therefore as I used software to compare the registry before and after the change on both the front-end and back-end server) so general Full-text users are on their own from now (but as I said earlier I don’t even know if my problem exists outside of SharePoint)