May 18, 2015 at 3:45 pm
Hi
I have 14 Windows folders containing a mix of Word and PDF documents. Each folder contains up to 500,000 files and these documents are the source for a document management system.
I need to create an audit table which can take the file names and date modified for every document in each folder but I want to avoid having to do a DOS command like dir *.* > filenames.txt then importing as a text file 14 times. Is there a way of automating this in T-SQL?
Each Windows folder is named by year e.g. 2002Docs, 2003Docs, 2004Docs etc.
Documents within the folders are named like this - 20020401_doc1.doc, 20020401_doc2, 20020401_doc678.pdf etc.
Any pointers would be appreciated thanks.
May 20, 2015 at 2:25 am
You could try something like the following, though you'll still have some work to do to split out the datestamp, filesize and filename.
IF OBJECT_ID('Folders','U') IS NOT NULL
DROP TABLE Folders;
GO
CREATE TABLE Folders (FolderID INT IDENTITY(1,1), FolderName NVARCHAR(4000), IsActive BIT);
GO
INSERT Folders VALUES ('2002Docs', 1), ('2003Docs', 1), ('2004Docs', 1);
GO
IF OBJECT_ID('DocumentMetaData','U') IS NOT NULL
DROP TABLE DocumentMetaData;
GO
CREATE TABLE DocumentMetaData (FolderID INT, RawFileMetaData NVARCHAR(4000), Filename NVARCHAR(4000), DateModified DATETIME );
GO
DECLARE
@Command NVARCHAR(4000),
@FolderID INT,
@FolderName NVARCHAR(4000),
@FolderYear CHAR(4),
@RootPath NVARCHAR(4000) = 'C:\temp\';
IF OBJECT_ID('tempdb..#Files') IS NOT NULL
DROP TABLE #Files;
CREATE TABLE #Files (Filename NVARCHAR(4000));
DECLARE FolderCursor CURSOR FOR SELECT FolderID, FolderName FROM dbo.Folders WHERE IsActive = 1
OPEN FolderCursor;
WHILE 1=1
BEGIN
FETCH NEXT FROM FolderCursor INTO @FolderID, @FolderName;
IF @@FETCH_STATUS <> 0
BREAK;
SET @FolderYear = LEFT(@FolderName,4);
SET @Command = 'dir ' + @RootPath + @FolderName + '\' + @FolderYear + '*.*';
PRINT @Command;
INSERT #Files
EXEC xp_cmdshell @Command;
--SELECT '#Files', * FROM #Files
INSERT dbo.DocumentMetaData ( FolderID, RawFileMetaData )
SELECT @FolderID, Filename FROM #Files WHERE Filename LIKE '%'+@FolderYear + '%' AND Filename NOT LIKE '%Directory of%';
END
CLOSE FolderCursor;
DEALLOCATE FolderCursor;
SELECT f.FolderName, dmd.* FROM dbo.DocumentMetaData dmd JOIN dbo.Folders f ON dmd.FolderID = f.FolderID;
The database server needs to have been configured to enable xp_cmdshell to work. E.g.
EXEC sp_configure 'allow updates', 0;
GO
RECONFIGURE;
GO
EXEC sp_configure 'xp_cmdshell', 1;
GO
RECONFIGURE;
GO
Forum Etiquette: How to post data/code on a forum to get the best helpMake sure that you include code in the appropriate IFCode tags, e.g. <your code here>
. You can find the IFCode tags under the INSERT options when you are writing a post.
May 20, 2015 at 12:07 pm
Awesome!
This looks really promising. I'll have a good test and feedback.
Many thanks!
May 20, 2015 at 9:27 pm
lenrigby (5/18/2015)
HiI have 14 Windows folders containing a mix of Word and PDF documents. Each folder contains up to 500,000 files and these documents are the source for a document management system.
I need to create an audit table which can take the file names and date modified for every document in each folder but I want to avoid having to do a DOS command like dir *.* > filenames.txt then importing as a text file 14 times. Is there a way of automating this in T-SQL?
Each Windows folder is named by year e.g. 2002Docs, 2003Docs, 2004Docs etc.
Documents within the folders are named like this - 20020401_doc1.doc, 20020401_doc2, 20020401_doc678.pdf etc.
Any pointers would be appreciated thanks.
This will work. You just need to change the parent directory and also note that it recurses through ALL the file names and subdirectores. It also parses out the full file path and the modified date through the magic of computed columns.
--===== Drop temp tables if they exist to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#MyDir','U') IS NOT NULL
DROP TABLE #MyDir
;
--===== Create the temp table that not only collects the file info,
-- but parses it, as well.
CREATE TABLE #MyDir
(
CmdOutput VARCHAR(500)
,ModifiedOn AS CAST(SUBSTRING(CmdOutput,1,CHARINDEX('"',CmdOutput)-1) AS DATETIME)
,FullPath AS CAST(REPLACE(SUBSTRING(CmdOutput,CHARINDEX('"',CmdOutput)+1,8000),'"','') AS VARCHAR(500))
)
;
--===== Get our files and dates for the given path (/P) and all sub-directories.
INSERT INTO #MyDir
(CmdOutput)
EXEC xp_CmdShell 'FORFILES /P "C:\Temp" /S /C "CMD /C IF @IsDir==FALSE ECHO @FDate @FTime @Path"'
;
--===== Let's see what we have
SELECT FullPath, ModifiedOn
FROM #MyDir
WHERE CmdOutput > ''
;
It will NOT be as fast as a DIR command, though.
--Jeff Moden
Change is inevitable... Change for the better is not.
May 21, 2015 at 5:55 am
Thanks Jeff
I had a feeling that you may have a non-cursor based solution! I've set it running and will feedback.
Many thanks for your help.
May 21, 2015 at 7:43 am
lenrigby (5/21/2015)
Thanks JeffI had a feeling that you may have a non-cursor based solution! I've set it running and will feedback.
Many thanks for your help.
IF it turns out to be too bloody slow, post back. A good "DIR" along with a Quirky Update data smear on a Temp Table will be MUCH quicker but I didn't have the time to try and set that up last night.
To wit and in embarrassing hind site, colin.frame's cursor rendition is a well written cursor that's used correctly to create dynamic SQL for each "set" of files in each folder. It uses DIR and stands a pretty good chance of being faster than most people would expect. I probably shouldn't have posted my solution. It's a lot simpler but that ECHO command is going to make it slow, especially on 500,000 files per directory. The only thing that colin.frame's doesn't do (and he said that) is the actual split of what is returned and that can be remedied fairly easily.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 4, 2015 at 2:20 pm
June 4, 2015 at 3:47 pm
Thanks all for replying.
I used Colin's code to give me what I needed and it ran on average 3 mins duration per folder of 500k files. This was before I received Jeff's non-cursor suggestion which alas I didn't try to compare.
June 4, 2015 at 4:04 pm
lenrigby (6/4/2015)
Thanks all for replying.I used Colin's code to give me what I needed and it ran on average 3 mins duration per folder of 500k files. This was before I received Jeff's non-cursor suggestion which alas I didn't try to compare.
3 Minutes for half a million files is pretty good. I'd be very tempted to not change that.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 14, 2015 at 10:06 am
As a follow on to this task which is working fine, I need to find and copy documents meeting certain crtiteria from my Windows folders. I have 30 windows folders containing an average of 200,000 documents per folder and I need to locate and copy around 1000 restricted documents from the rest based on a unique identifier.
I have a lookup table that has the filename and a folder name e.g.
filename foldername
doc123456 D:\2002_docs
doc234567 D:\2002_docs
doc345678 D:\2002_docs
doc456789 D:\2003_docs
doc567890 D:\2003_docs
I need to run a routine for each of the foldernames to first locate the filename in my lookup table then copy to another folder location which will ultimately contain all the restricted documents.
Is this achievable using SQL and/or a cursor?
July 14, 2015 at 12:15 pm
That should be relatively simple in SSIS, run a query that generates all the copy/move commands and exports them to the batch file, then run the batch file 🙂
July 15, 2015 at 11:51 am
Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 15, 2015 at 2:28 pm
Jeff Moden (7/15/2015)
Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.
Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?
I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.
I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.
So your comment about "so much time and effort" worked in the opposite direction for me.
I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!
Rich
July 15, 2015 at 2:49 pm
rmechaber (7/15/2015)
Jeff Moden (7/15/2015)
Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.
Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?
I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.
I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.
So your comment about "so much time and effort" worked in the opposite direction for me.
I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!
Rich
Understood on the xp_CmdShell thing. And I do like the idea that it returns the full gambit of information and, to be honest, I wish that MS would build functionality like this into SQL Server.
I'm just curious why a company would ban the use of xp_CmdShell but allow a CLR set like this because xp_CmdShell requires SA privs (unless someone was dumb enough to grant direct execution privs to individuals, very bad thing, indeed!). If you allow lesser privs on this CLR set, then any hacker that gets in with less than SA privs can copy files as part of their payload or file insertion, freely explore your directories to find a decent payload, move files for a bit of malicious fun, or just flat out delete your files to cause you damage.
If you can guarantee that only sysadmins can use this CLR set, then it might not be such a bad thing.
--Jeff Moden
Change is inevitable... Change for the better is not.
July 15, 2015 at 3:17 pm
Jeff Moden (7/15/2015)
rmechaber (7/15/2015)
Jeff Moden (7/15/2015)
Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.
Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?
I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.
I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.
So your comment about "so much time and effort" worked in the opposite direction for me.
I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!
Rich
Understood on the xp_CmdShell thing. And I do like the idea that it returns the full gambit of information and, to be honest, I wish that MS would build functionality like this into SQL Server.
I'm just curious why a company would ban the use of xp_CmdShell but allow a CLR set like this because xp_CmdShell requires SA privs (unless someone was dumb enough to grant direct execution privs to individuals, very bad thing, indeed!). If you allow lesser privs on this CLR set, then any hacker that gets in with less than SA privs can copy files as part of their payload or file insertion, freely explore your directories to find a decent payload, move files for a bit of malicious fun, or just flat out delete your files to cause you damage.
If you can guarantee that only sysadmins can use this CLR set, then it might not be such a bad thing.
Thanks Jeff. I think you might have misunderstood one aspect of CLR and security, though, or maybe I misunderstood you: CLR gets loaded into a database as an "assembly", but regular users typically aren't given any access to assemblies (you need CREATE ASSEMBLY permission).
SQL stored procedures, functions and TVFs are created from assemblies, and you grant users EXECUTE/SELECT permissions on those objects just the way you would any other proc/function/TVF.
From the user's perspective, then, it doesn't matter if the TVF gets created from T-SQL (with or without xp_cmdshell) or a CLR assembly: if the user hasn't been granted SELECT permissions on the TVF, they get nowhere.
Rich
Viewing 15 posts - 1 through 14 (of 14 total)
You must be logged in to reply to this topic. Login to reply