February 16, 2010 at 10:02 am
What would be a good aproach for finding files containing a matching text string?
Given a particular directory loaction, with recursive action on sub folders.
There are around 60 subdirectories.
A total of around 1200 files
Each file averaging 100k
About 95% are word documents, plus some XLS, some MSG(email).
Likely performance indicators would be helpful..
Thanks ...
C# Gnu
____________________________________________________
February 16, 2010 at 10:08 am
My first thought is that I would not do this in SQL. I might use SQL to store saved terms. Or even pull the files into a table with a full-text index.
I guess I would need to know more about what you were trying to accomplish before I give greater detail.
Do these files change often?
CEWII
February 16, 2010 at 10:19 am
A bit more info:
The Word documents are a type of merge template:
They might look like this in the word doc:
-------------------------------------------------
Client Names(s):Client TypeMatterMatter Description
[&client.name][?.clienttype][*Matter.EntityRef]. [Matter.Number][Matter.Description]
Client Address:
[~Matter.CorrAddress1]
[~Matter.CorrAddress2]
[~Matter.CorrAddress3]
[~Matter.CorrAddress4]
[~Matter.CorrPostcode]
[SQL: EXEC up_DoSomeStuff '[m.entityref]', [m.number]]
-------------------------------------------------
So a real mix of document letter text, merge codes and some SQL server statements.
I am likely for example to want to list all docs containing 'up_DoSomeStuff' as shown in last line of doc snippet above.
C# Gnu
____________________________________________________
February 16, 2010 at 10:22 am
Also .. of the 1200 files I would estimate 50 or so may be modied in any day.
C# Gnu
____________________________________________________
February 16, 2010 at 10:27 am
I would take something like:
FSSearchIndex Framework - A managed File System search and indexing framework
http://fssearchindexfx.codeplex.com/
And use it to find the text and then write the path and filename to the database.
But I would do it from an outside app not within SQL.
CEWII
February 16, 2010 at 10:32 am
I have stored procedure that searches functions, table column names, stored procs, triggers and 'some' text coulmns so far, and this works very well.
If there was a way to tag the filesystem search onto that it would be ideal?
C# Gnu
____________________________________________________
February 16, 2010 at 10:58 am
You *might* be able to do this with SQLCLR however, crossing the boundary from SQL into the Filesystem should not be considered lightly. You have a lot more potential conditions that need to be checked for and errors to catch. I have dealt with this in the past and it was painful.
You *might* be able to execute a remote query using a linked server against the Content Indexer service, but I'm not sure if that is sufficient, I haven't done that myself since 2002, so I am rusty.. At least then you would be dealing with data and not files.
Short of pulling all the files into a table and capturing the path they came from those are about the only things I can give.
The transition from SQL to the filesystem is not seemless by any stretch, be careful.
CEWII
February 16, 2010 at 11:21 am
Elliott W (2/16/2010)
Short of pulling all the files into a table and capturing the path they came from those are about the only things I can give.
Hmmm ... I just checked the files total 60MB.
Perphaps I would only need to pull those where the modified date has changed... after first search..
C# Gnu
____________________________________________________
February 16, 2010 at 11:36 am
I am 2005, so VARBINARY(MAX) I guess ... Just having a look to see if I can find code to load some files & give it go ..
C# Gnu
____________________________________________________
February 16, 2010 at 11:38 am
If that is your path then it would likely make sense to make the column varbinary(max) and put a full-text index on it. For a full-text index you need to have a single key column, so keep that in mind during table design.
CEWII
February 16, 2010 at 11:46 am
Ok I found this source - pretty close - will give it a whirl
C# Gnu
____________________________________________________
February 16, 2010 at 11:48 am
I think there is a BLOB source on Codeplex that would probably be better than that, but I don't have its URL right now..
CEWII
February 16, 2010 at 12:00 pm
Interesting start:
LOAD
--------
That loaded 678 doc files totalling 30MB in 32 seconds.
On my laptop - local SQL 2005, 2GB RAM (71% in use)
SEARCH
---------
SELECT * FROM dbo.myblob WHERE blob LIKE '%sql%'
returned 219 rows in 5 seconds
Code:
USE master
go
sp_configure 'xp_cmdshell', 1
go
RECONFIGURE WITH OVERRIDE
go
SET quoted_identifier OFF
go
USE [master]
go
/****** object: database [test] script date: 09/10/2006 22:07:03 ******/
IF EXISTS ( SELECT name
FROM sys.databases
WHERE name = N'test' )
DROP DATABASE [test]
go
CREATE DATABASE [test]
go
USE [test]
/****** object: table [dbo].[myimage] script date: 09/10/2006 21:55:46 ******/
IF EXISTS ( SELECT *
FROM sys.objects
WHERE object_id = OBJECT_ID(N'[myblob]')
AND type IN ( N'u' ) )
DROP TABLE [myblob]
go
CREATE TABLE [myblob]
(
[id] INT IDENTITY(1, 1) ,
[image name] VARCHAR(100) ,
[blob] VARBINARY(MAX)
)
Go
CREATE PROCEDURE usp_uploadfiles
@databasename VARCHAR(128) ,
@schemaname VARCHAR(128) ,
@tablename VARCHAR(128) ,
@FileNameColumn VARCHAR(128) ,
@blobcolumn VARCHAR(128) ,
@path VARCHAR(500) ,
@filetype VARCHAR(10) ,
@printorexec VARCHAR(5) = 'print'
AS
SET nocount ON
DECLARE @dircommand VARCHAR(1500)
DECLARE @insertquery VARCHAR(2000)
DECLARE @updatequery VARCHAR(2000)
DECLARE @count INT
DECLARE @maxcount INT
DECLARE @filename VARCHAR(500)
SET @count = 1
SET @dircommand = 'dir /b ' + @path + @filetype
CREATE TABLE #dir ( name VARCHAR(1500) )
INSERT #dir
( name
)
EXEC master..xp_cmdshell @dircommand
DELETE FROM #dir
WHERE name IS NULL
CREATE TABLE #dir2
(
id INT IDENTITY(1, 1) ,
name VARCHAR(1500)
)
INSERT INTO #dir2
SELECT name
FROM #dir
--select * from #dir2
SET @maxcount = IDENT_CURRENT('#dir2')
WHILE @count <= @maxcount
BEGIN
SET @filename = ( SELECT name
FROM #dir2
WHERE id = @count
)
SET @insertquery = 'Insert into [' + @databasename + '].['
+ @schemaname + '].[' + @tablename + ']
([' + @filenamecolumn + ']) values ("' + @filename + '")'
SET @updatequery = 'update [' + @databasename + '].['
+ @schemaname + '].[' + @tablename + ']
set [' + @blobcolumn + '] =
(SELECT * FROM OPENROWSET(BULK "' + @path + @filename
+ '", SINGLE_BLOB)AS x )
WHERE [' + @filenamecolumn + ']="' + @filename + '"'
IF @printorexec = 'print'
BEGIN
PRINT @insertquery
PRINT @updatequery
END
IF @printorexec = 'exec'
BEGIN
EXEC (@insertquery)
EXEC (@updatequery)
END
SET @count = @count + 1
END
go
Exec master..usp_uploadfiles
@databasename ='test',
@schemaname ='dbo',
@tablename ='myblob',
@FileNameColumn ='Image Name',
@blobcolumn = 'blob',
@path = 'c:\partner\templates\',
@filetype ='*.doc',
@printorexec ='exec'
--SELECT * FROM dbo.myblob
--SELECT * FROM dbo.myblob WHERE blob LIKE '%sql%'
C# Gnu
____________________________________________________
February 16, 2010 at 12:29 pm
I agree.
CEWII
February 17, 2010 at 7:56 am
Oh great - I re-wrote all that and now supports -
searching sub directories
only loading to blob if modified date changed
combined with searching sql:-
stored procs, functions, triggers, and special columns etc
displays a snippet of the file/code where text was found
Works nicely will post up when polished.
Cheers
C# Gnu
____________________________________________________
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply