October 27, 2009 at 11:06 am
Hi all.
We've got a database that currently contains the contents of around 700,000 scanned and OCRed documents in PDF format, around 350GB of them. The PDF files are NOT stored in the database, they reside on a file server, but as part of the OCR process, we insert the text contents into our database. We did this as we needed our online searches to perform full-text queries against the contents of the files and doing that indexing in SQL Server was the best solution at the time. To give you an idea of the rate of change these doucments go through, every day we add several thouand documents into the system and remove a somewhat smaller amount.
However, times have changed. SQL 2008 has introduced a horribly slow full-text system that is simply unable to perform adequately with the loads we put on it. To be honest, it makes the system completely unuseable. Our research has found no resolution to this problem, so we're looking for an alternative.
One of the options we're considering is to go with a file-based indexing system that would simply crawl the share containing all of these documents and simply index the PDF files themselves, which we could then query from a CLR stored procedure to marry the results with our database information. The only solutions we've found so far are the Google appliance and Microsoft Search Server (which ironically uses a SQL 2005 full-text index and will very explicitly NOT work with SQL 2008!!!). While both could do the job, they both also have drawbacks that we're not thrilled with. I'm wondering if anyone else out there has any experience with other systems that can handle the quantity of documents we need indexed and searched. I'm having trouble even finding other options for this type of system.
I'd be interested in any hardware- or software-based solutions that might be of use in this situation, although my preference would be towards software as we're largely a virtualized environment with replication to an offsite datacenter.
I realize this isn't a SQL Server question per se, but I figure there have to be some SQL admins out there that are tying into file-based indexes to produce results in conjunction with a database. Thanks for any input!
October 27, 2009 at 12:04 pm
There is a project called Lucene part of the Apache project a developer created .NET version that Asp.net developers have used he moved on but others continued development so check it out if you can use it. The Apache project recently approved the project to continue and on a side note your load could be the main problem.
http://incubator.apache.org/lucene.net/
Kind regards,
Gift Peddie
Viewing 2 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply