Populating SQL tables with entries from Windows folders

  • Hi

    I have 14 Windows folders containing a mix of Word and PDF documents. Each folder contains up to 500,000 files and these documents are the source for a document management system.

    I need to create an audit table which can take the file names and date modified for every document in each folder but I want to avoid having to do a DOS command like dir *.* > filenames.txt then importing as a text file 14 times. Is there a way of automating this in T-SQL?

    Each Windows folder is named by year e.g. 2002Docs, 2003Docs, 2004Docs etc.

    Documents within the folders are named like this - 20020401_doc1.doc, 20020401_doc2, 20020401_doc678.pdf etc.

    Any pointers would be appreciated thanks.

  • You could try something like the following, though you'll still have some work to do to split out the datestamp, filesize and filename.

    IF OBJECT_ID('Folders','U') IS NOT NULL

    DROP TABLE Folders;

    GO

    CREATE TABLE Folders (FolderID INT IDENTITY(1,1), FolderName NVARCHAR(4000), IsActive BIT);

    GO

    INSERT Folders VALUES ('2002Docs', 1), ('2003Docs', 1), ('2004Docs', 1);

    GO

    IF OBJECT_ID('DocumentMetaData','U') IS NOT NULL

    DROP TABLE DocumentMetaData;

    GO

    CREATE TABLE DocumentMetaData (FolderID INT, RawFileMetaData NVARCHAR(4000), Filename NVARCHAR(4000), DateModified DATETIME );

    GO

    DECLARE

    @Command NVARCHAR(4000),

    @FolderID INT,

    @FolderName NVARCHAR(4000),

    @FolderYear CHAR(4),

    @RootPath NVARCHAR(4000) = 'C:\temp\';

    IF OBJECT_ID('tempdb..#Files') IS NOT NULL

    DROP TABLE #Files;

    CREATE TABLE #Files (Filename NVARCHAR(4000));

    DECLARE FolderCursor CURSOR FOR SELECT FolderID, FolderName FROM dbo.Folders WHERE IsActive = 1

    OPEN FolderCursor;

    WHILE 1=1

    BEGIN

    FETCH NEXT FROM FolderCursor INTO @FolderID, @FolderName;

    IF @@FETCH_STATUS <> 0

    BREAK;

    SET @FolderYear = LEFT(@FolderName,4);

    SET @Command = 'dir ' + @RootPath + @FolderName + '\' + @FolderYear + '*.*';

    PRINT @Command;

    INSERT #Files

    EXEC xp_cmdshell @Command;

    --SELECT '#Files', * FROM #Files

    INSERT dbo.DocumentMetaData ( FolderID, RawFileMetaData )

    SELECT @FolderID, Filename FROM #Files WHERE Filename LIKE '%'+@FolderYear + '%' AND Filename NOT LIKE '%Directory of%';

    END

    CLOSE FolderCursor;

    DEALLOCATE FolderCursor;

    SELECT f.FolderName, dmd.* FROM dbo.DocumentMetaData dmd JOIN dbo.Folders f ON dmd.FolderID = f.FolderID;

    The database server needs to have been configured to enable xp_cmdshell to work. E.g.

    EXEC sp_configure 'allow updates', 0;

    GO

    RECONFIGURE;

    GO

    EXEC sp_configure 'xp_cmdshell', 1;

    GO

    RECONFIGURE;

    GO

    Forum Etiquette: How to post data/code on a forum to get the best helpMake sure that you include code in the appropriate IFCode tags, e.g. <your code here>. You can find the IFCode tags under the INSERT options when you are writing a post.

  • Awesome!

    This looks really promising. I'll have a good test and feedback.

    Many thanks!

  • lenrigby (5/18/2015)


    Hi

    I have 14 Windows folders containing a mix of Word and PDF documents. Each folder contains up to 500,000 files and these documents are the source for a document management system.

    I need to create an audit table which can take the file names and date modified for every document in each folder but I want to avoid having to do a DOS command like dir *.* > filenames.txt then importing as a text file 14 times. Is there a way of automating this in T-SQL?

    Each Windows folder is named by year e.g. 2002Docs, 2003Docs, 2004Docs etc.

    Documents within the folders are named like this - 20020401_doc1.doc, 20020401_doc2, 20020401_doc678.pdf etc.

    Any pointers would be appreciated thanks.

    This will work. You just need to change the parent directory and also note that it recurses through ALL the file names and subdirectores. It also parses out the full file path and the modified date through the magic of computed columns.

    --===== Drop temp tables if they exist to make reruns in SSMS easier.

    IF OBJECT_ID('tempdb..#MyDir','U') IS NOT NULL

    DROP TABLE #MyDir

    ;

    --===== Create the temp table that not only collects the file info,

    -- but parses it, as well.

    CREATE TABLE #MyDir

    (

    CmdOutput VARCHAR(500)

    ,ModifiedOn AS CAST(SUBSTRING(CmdOutput,1,CHARINDEX('"',CmdOutput)-1) AS DATETIME)

    ,FullPath AS CAST(REPLACE(SUBSTRING(CmdOutput,CHARINDEX('"',CmdOutput)+1,8000),'"','') AS VARCHAR(500))

    )

    ;

    --===== Get our files and dates for the given path (/P) and all sub-directories.

    INSERT INTO #MyDir

    (CmdOutput)

    EXEC xp_CmdShell 'FORFILES /P "C:\Temp" /S /C "CMD /C IF @IsDir==FALSE ECHO @FDate @FTime @Path"'

    ;

    --===== Let's see what we have

    SELECT FullPath, ModifiedOn

    FROM #MyDir

    WHERE CmdOutput > ''

    ;

    It will NOT be as fast as a DIR command, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff

    I had a feeling that you may have a non-cursor based solution! I've set it running and will feedback.

    Many thanks for your help.

  • lenrigby (5/21/2015)


    Thanks Jeff

    I had a feeling that you may have a non-cursor based solution! I've set it running and will feedback.

    Many thanks for your help.

    IF it turns out to be too bloody slow, post back. A good "DIR" along with a Quirky Update data smear on a Temp Table will be MUCH quicker but I didn't have the time to try and set that up last night.

    To wit and in embarrassing hind site, colin.frame's cursor rendition is a well written cursor that's used correctly to create dynamic SQL for each "set" of files in each folder. It uses DIR and stands a pretty good chance of being faster than most people would expect. I probably shouldn't have posted my solution. It's a lot simpler but that ECHO command is going to make it slow, especially on 500,000 files per directory. The only thing that colin.frame's doesn't do (and he said that) is the actual split of what is returned and that can be remedied fairly easily.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you can CLR, this works well: https://nclsqlclrfile.codeplex.com/[/url]

    Rich

  • Thanks all for replying.

    I used Colin's code to give me what I needed and it ran on average 3 mins duration per folder of 500k files. This was before I received Jeff's non-cursor suggestion which alas I didn't try to compare.

  • lenrigby (6/4/2015)


    Thanks all for replying.

    I used Colin's code to give me what I needed and it ran on average 3 mins duration per folder of 500k files. This was before I received Jeff's non-cursor suggestion which alas I didn't try to compare.

    3 Minutes for half a million files is pretty good. I'd be very tempted to not change that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As a follow on to this task which is working fine, I need to find and copy documents meeting certain crtiteria from my Windows folders. I have 30 windows folders containing an average of 200,000 documents per folder and I need to locate and copy around 1000 restricted documents from the rest based on a unique identifier.

    I have a lookup table that has the filename and a folder name e.g.

    filename foldername

    doc123456 D:\2002_docs

    doc234567 D:\2002_docs

    doc345678 D:\2002_docs

    doc456789 D:\2003_docs

    doc567890 D:\2003_docs

    I need to run a routine for each of the foldernames to first locate the filename in my lookup table then copy to another folder location which will ultimately contain all the restricted documents.

    Is this achievable using SQL and/or a cursor?

  • That should be relatively simple in SSIS, run a query that generates all the copy/move commands and exports them to the batch file, then run the batch file 🙂

  • rmechaber (6/4/2015)


    If you can CLR, this works well: https://nclsqlclrfile.codeplex.com/[/url]

    Rich

    Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/15/2015)


    rmechaber (6/4/2015)


    If you can CLR, this works well: https://nclsqlclrfile.codeplex.com/[/url]

    Rich

    Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.

    Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?

    I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.

    I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.

    So your comment about "so much time and effort" worked in the opposite direction for me.

    I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!

    Rich

  • rmechaber (7/15/2015)


    Jeff Moden (7/15/2015)


    rmechaber (6/4/2015)


    If you can CLR, this works well: https://nclsqlclrfile.codeplex.com/[/url]

    Rich

    Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.

    Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?

    I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.

    I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.

    So your comment about "so much time and effort" worked in the opposite direction for me.

    I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!

    Rich

    Understood on the xp_CmdShell thing. And I do like the idea that it returns the full gambit of information and, to be honest, I wish that MS would build functionality like this into SQL Server.

    I'm just curious why a company would ban the use of xp_CmdShell but allow a CLR set like this because xp_CmdShell requires SA privs (unless someone was dumb enough to grant direct execution privs to individuals, very bad thing, indeed!). If you allow lesser privs on this CLR set, then any hacker that gets in with less than SA privs can copy files as part of their payload or file insertion, freely explore your directories to find a decent payload, move files for a bit of malicious fun, or just flat out delete your files to cause you damage.

    If you can guarantee that only sysadmins can use this CLR set, then it might not be such a bad thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/15/2015)


    rmechaber (7/15/2015)


    Jeff Moden (7/15/2015)


    rmechaber (6/4/2015)


    If you can CLR, this works well: https://nclsqlclrfile.codeplex.com/[/url]

    Rich

    Everytime I see that particular post, I can't help but ask why someone would spend so much time and effort duplicating what DOS already does so well.

    Really? I know you are an advocate for xp_cmdshell, Jeff, but what if that's not an option?

    I agree with you that CLR is a bit of work to set up, but once it's completed you get a TVF you can query and join to. Plus, CLR can return so many more file attributes than DOS -- the archive bit (you'd need the DOS ATTRIB command to capture); full path (need xp_dirtree), etc.

    I spent quite a long time a couple years ago trying to get xp_cmdshell working using information you'd posted here on SSC on a forum answer, but I never succeeded. You were even nice enough to post back a couple of times to help me troubleshoot, but all I ever got was errors.

    So your comment about "so much time and effort" worked in the opposite direction for me.

    I'm not a CLR evangelist, but this seems like a good use of the tool. YMMV and "it depends" are in full force!

    Rich

    Understood on the xp_CmdShell thing. And I do like the idea that it returns the full gambit of information and, to be honest, I wish that MS would build functionality like this into SQL Server.

    I'm just curious why a company would ban the use of xp_CmdShell but allow a CLR set like this because xp_CmdShell requires SA privs (unless someone was dumb enough to grant direct execution privs to individuals, very bad thing, indeed!). If you allow lesser privs on this CLR set, then any hacker that gets in with less than SA privs can copy files as part of their payload or file insertion, freely explore your directories to find a decent payload, move files for a bit of malicious fun, or just flat out delete your files to cause you damage.

    If you can guarantee that only sysadmins can use this CLR set, then it might not be such a bad thing.

    Thanks Jeff. I think you might have misunderstood one aspect of CLR and security, though, or maybe I misunderstood you: CLR gets loaded into a database as an "assembly", but regular users typically aren't given any access to assemblies (you need CREATE ASSEMBLY permission).

    SQL stored procedures, functions and TVFs are created from assemblies, and you grant users EXECUTE/SELECT permissions on those objects just the way you would any other proc/function/TVF.

    From the user's perspective, then, it doesn't matter if the TVF gets created from T-SQL (with or without xp_cmdshell) or a CLR assembly: if the user hasn't been granted SELECT permissions on the TVF, they get nowhere.

    Rich

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply