Search Filesystem

  • What would be a good aproach for finding files containing a matching text string?

    Given a particular directory loaction, with recursive action on sub folders.

    There are around 60 subdirectories.

    A total of around 1200 files

    Each file averaging 100k

    About 95% are word documents, plus some XLS, some MSG(email).

    Likely performance indicators would be helpful..

    Thanks ...

  • My first thought is that I would not do this in SQL. I might use SQL to store saved terms. Or even pull the files into a table with a full-text index.

    I guess I would need to know more about what you were trying to accomplish before I give greater detail.

    Do these files change often?

    CEWII

  • A bit more info:

    The Word documents are a type of merge template:

    They might look like this in the word doc:

    -------------------------------------------------

    Client Names(s):Client TypeMatterMatter Description

    [&client.name][?.clienttype][*Matter.EntityRef]. [Matter.Number][Matter.Description]

    Client Address:

    [~Matter.CorrAddress1]

    [~Matter.CorrAddress2]

    [~Matter.CorrAddress3]

    [~Matter.CorrAddress4]

    [~Matter.CorrPostcode]

    [SQL: EXEC up_DoSomeStuff '[m.entityref]', [m.number]]

    -------------------------------------------------

    So a real mix of document letter text, merge codes and some SQL server statements.

    I am likely for example to want to list all docs containing 'up_DoSomeStuff' as shown in last line of doc snippet above.

  • Also .. of the 1200 files I would estimate 50 or so may be modied in any day.

  • I would take something like:

    FSSearchIndex Framework - A managed File System search and indexing framework

    http://fssearchindexfx.codeplex.com/

    And use it to find the text and then write the path and filename to the database.

    But I would do it from an outside app not within SQL.

    CEWII

  • I have stored procedure that searches functions, table column names, stored procs, triggers and 'some' text coulmns so far, and this works very well.

    If there was a way to tag the filesystem search onto that it would be ideal?

  • You *might* be able to do this with SQLCLR however, crossing the boundary from SQL into the Filesystem should not be considered lightly. You have a lot more potential conditions that need to be checked for and errors to catch. I have dealt with this in the past and it was painful.

    You *might* be able to execute a remote query using a linked server against the Content Indexer service, but I'm not sure if that is sufficient, I haven't done that myself since 2002, so I am rusty.. At least then you would be dealing with data and not files.

    Short of pulling all the files into a table and capturing the path they came from those are about the only things I can give.

    The transition from SQL to the filesystem is not seemless by any stretch, be careful.

    CEWII

  • Elliott W (2/16/2010)


    Short of pulling all the files into a table and capturing the path they came from those are about the only things I can give.

    Hmmm ... I just checked the files total 60MB.

    Perphaps I would only need to pull those where the modified date has changed... after first search..

  • I am 2005, so VARBINARY(MAX) I guess ... Just having a look to see if I can find code to load some files & give it go ..

  • If that is your path then it would likely make sense to make the column varbinary(max) and put a full-text index on it. For a full-text index you need to have a single key column, so keep that in mind during table design.

    CEWII

  • I think there is a BLOB source on Codeplex that would probably be better than that, but I don't have its URL right now..

    CEWII

  • Interesting start:

    LOAD

    --------

    That loaded 678 doc files totalling 30MB in 32 seconds.

    On my laptop - local SQL 2005, 2GB RAM (71% in use)

    SEARCH

    ---------

    SELECT * FROM dbo.myblob WHERE blob LIKE '%sql%'

    returned 219 rows in 5 seconds

    Code:

    USE master

    go

    sp_configure 'xp_cmdshell', 1

    go

    RECONFIGURE WITH OVERRIDE

    go

    SET quoted_identifier OFF

    go

    USE [master]

    go

    /****** object: database [test] script date: 09/10/2006 22:07:03 ******/

    IF EXISTS ( SELECT name

    FROM sys.databases

    WHERE name = N'test' )

    DROP DATABASE [test]

    go

    CREATE DATABASE [test]

    go

    USE [test]

    /****** object: table [dbo].[myimage] script date: 09/10/2006 21:55:46 ******/

    IF EXISTS ( SELECT *

    FROM sys.objects

    WHERE object_id = OBJECT_ID(N'[myblob]')

    AND type IN ( N'u' ) )

    DROP TABLE [myblob]

    go

    CREATE TABLE [myblob]

    (

    [id] INT IDENTITY(1, 1) ,

    [image name] VARCHAR(100) ,

    [blob] VARBINARY(MAX)

    )

    Go

    CREATE PROCEDURE usp_uploadfiles

    @databasename VARCHAR(128) ,

    @schemaname VARCHAR(128) ,

    @tablename VARCHAR(128) ,

    @FileNameColumn VARCHAR(128) ,

    @blobcolumn VARCHAR(128) ,

    @path VARCHAR(500) ,

    @filetype VARCHAR(10) ,

    @printorexec VARCHAR(5) = 'print'

    AS

    SET nocount ON

    DECLARE @dircommand VARCHAR(1500)

    DECLARE @insertquery VARCHAR(2000)

    DECLARE @updatequery VARCHAR(2000)

    DECLARE @count INT

    DECLARE @maxcount INT

    DECLARE @filename VARCHAR(500)

    SET @count = 1

    SET @dircommand = 'dir /b ' + @path + @filetype

    CREATE TABLE #dir ( name VARCHAR(1500) )

    INSERT #dir

    ( name

    )

    EXEC master..xp_cmdshell @dircommand

    DELETE FROM #dir

    WHERE name IS NULL

    CREATE TABLE #dir2

    (

    id INT IDENTITY(1, 1) ,

    name VARCHAR(1500)

    )

    INSERT INTO #dir2

    SELECT name

    FROM #dir

    --select * from #dir2

    SET @maxcount = IDENT_CURRENT('#dir2')

    WHILE @count <= @maxcount

    BEGIN

    SET @filename = ( SELECT name

    FROM #dir2

    WHERE id = @count

    )

    SET @insertquery = 'Insert into [' + @databasename + '].['

    + @schemaname + '].[' + @tablename + ']

    ([' + @filenamecolumn + ']) values ("' + @filename + '")'

    SET @updatequery = 'update [' + @databasename + '].['

    + @schemaname + '].[' + @tablename + ']

    set [' + @blobcolumn + '] =

    (SELECT * FROM OPENROWSET(BULK "' + @path + @filename

    + '", SINGLE_BLOB)AS x )

    WHERE [' + @filenamecolumn + ']="' + @filename + '"'

    IF @printorexec = 'print'

    BEGIN

    PRINT @insertquery

    PRINT @updatequery

    END

    IF @printorexec = 'exec'

    BEGIN

    EXEC (@insertquery)

    EXEC (@updatequery)

    END

    SET @count = @count + 1

    END

    go

    Exec master..usp_uploadfiles

    @databasename ='test',

    @schemaname ='dbo',

    @tablename ='myblob',

    @FileNameColumn ='Image Name',

    @blobcolumn = 'blob',

    @path = 'c:\partner\templates\',

    @filetype ='*.doc',

    @printorexec ='exec'

    --SELECT * FROM dbo.myblob

    --SELECT * FROM dbo.myblob WHERE blob LIKE '%sql%'

  • I agree.

    CEWII

  • Oh great - I re-wrote all that and now supports -

    searching sub directories

    only loading to blob if modified date changed

    combined with searching sql:-

    stored procs, functions, triggers, and special columns etc

    displays a snippet of the file/code where text was found

    Works nicely will post up when polished.

    Cheers

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply