Load folder full of csv files

  • Not yet... It's 1:35 AM and I just got done with a work related crunch.  I have to be awake, "lively and aware", for scrum at 9 and so I'm going to bed. 😀  Apologies for tonight.

    My suggestions would be to looking into the 3 parameter convention use of the undocumented DIRTREE command in TSQL and how to sequester errors in a separate file without halting a load of a file in BULK Insert and how to create a BCP format file to make loading your double-quoted data easier.  There might be something else I'm forgetting to suggest but I just drooled on myself so it's time for me to quit for the day.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... I created a file that had one row for every second of the year 2020 (31,622,400 rows plus the header row and weighs in at 956MB (almost a gigabyte)).  I created a "staging" table to load the file into and set things up so that it would check every row as "unique" and every row for datatype with a Clustered PK.  I also created a BCP format file to direct the shredding so we don't need to fart around with the double quotes once the data is loaded into the 2 column staging table.  It will also tolerate up to 10 data faults by default. We can make it do more if you need to.

    The import to the staging table with all that data checking takes just a little over 34 seconds on my laptop.

    Is something like that worth you time continuing to develop or not?  I have no idea how long it would take PowerShell to do the same thing.  I also have to clue how long it would take SSIS (although, done correctly, can operate at roughly the same speed for the import part because it'll also use the "bulk loader").

    Lemme know and I'll hammer out the rest... it's not difficult it's just I don't want to waste any time if the performance of my test isn't good enough.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • THat sounds like something to keep developing, sounds like a good solution.

    Many Thanks!!!

  • Ok... We need some details...

    1. Since the file names do NOT have any temporal indications in the names, what do you want to do with a file to move it someplace else so the machine and line can dump another to where we're reading things from?
    2. How many rows are you expecting in files?  I need this info for testing.
    3. All good rows from a file can be loaded even though there are bad rows.

      1. How many bad rows do you want to allow before the whole jam is rolled back?  And, yes, the answer can be to not ever do a rollback and to always bring in the good rows regardless of the number of bad rows.
      2. What do you want to do with the bad rows?  For example, simply ignore them or sequester them for possible correction or ???

    4. You posted your table columns as a list.  Please post the actual CREATE TABLE statement for the table being sure to include all constraints and indexes and any other columns that may be present even if you didn't list them before.
    5. What version (2016, 2017, 2019 or ???) and Edition (Standard or Enterprise or ???)  are you working with for this?
    6. Are there any restrictions on what you can and cannot use in T-SQL?  For example, some DBAs do not allow functions of any kind.  Some do not allow the use of any undocumented functionality.  Some allow only ANSI compatible code (which kills just about everything).  Some will not allow the creation of Temp Tables or even Table Variables.   Do you have any such restrictions that you have to follow?
    7. If we do this all in T-SQL and it runs fast enough, will you actually use it?  I hate "firedrills". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

    1. Move successful file loads to a completed dir.
    2. 40-50 thousands records per file.
    3. Bad rows maybe logged to exception table with file name that they came from.
    CREATE TABLE [dbo].[LineSpeed](
    [Date_Recorded] [datetime] NULL,
    [Speed] [int] NULL,
    [Oven_Line] [nvarchar](35) NULL
    ) ON [PRIMARY]
    GO

    5. SQL2012 STD.

    6 No.

    7. Sure will use it.

    THanks.

  • I was thinking on the table and I would like to make separate columns for Oven and Line..

     

     

  • Bruin wrote:

    I was thinking on the table and I would like to make separate columns for Oven and Line..

    Funny thing there... I figured that early on (part of "column normalization) and was already setup to do just exactly that and make that same recommendation to you.  Thanks for the feedback.  I'll try to finish it tonight.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hang in there.  I made some good progress (fully testing along the way) but I had another late work night and it's time for bed (01:49 AM).  I'll work on this more tomorrow night.  It'll be worth the wait.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Still on the radar?

     

    Thanks.

  • Did you have an free time to complete?

     

    Thanks,

  • Sorry. I've not had much time to work on this. Working on it now. Having some fairly good success. I'm also making sure that it's fully documented so that you can figure out what it does.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok.  Mostly done.  We still need to add some "pretty" error handling and the file move-after-imported thing but this is working quite nicely and it does tolerate errors along with a little self reporting is on the error(s).

    First, you need a BCP format file.  Save it where SQL Server can get at it.  I normally recommend saving it where you keep the files it represents that you're going to import.  This isn't strictly needed but it makes things nasty fast with BULK INSERT.  Here's the contents of the file...

    9.0
    3
    1 SQLCHAR 0 1 "" 0 LeadDQuote ""
    2 SQLCHAR 0 100 "\",\"" 1 stamp_time ""
    3 SQLCHAR 0 100 "\"\r\n" 2 wirespeed ""

    Where ever you name it as and where ever you save it, you need to change the related string in the code below.  It should be obvious but here's the line that you need to change...

            ,@FileSourceBCPFMT  VARCHAR(500) = 'D:\Temp\OvenLine_Staging\OvenLine.BCPFMT'

    You also need to make a change in the code to point to the full path (which may be a UNC but, either way, SQL Server needs READ/WRITE privs to the path).  Here's the line in the code you need to change...

    DECLARE  @FileSourceDirCmd  VARCHAR(500) = 'DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b'

    Without much effort, we could parameterize those two things as inputs to a stored procedure if you want to convert the code below to a stored procedure.

    This is the "LineSpeed" table I created as the ultimate target table.  The code imports to a staging table first, just to be on the safe side and to allow for any additional checks you might want to make on the imported data before inserting it into the final table.

    --   DROP TABLE IF EXISTS dbo.LineSpeed;
    CREATE TABLE dbo.LineSpeed
    (
    Date_Recorded DATETIME NOT NULL
    ,Speed INT NOT NULL
    ,Oven NVARCHAR(35) NOT NULL
    ,Line INT NOT NULL
    ,CONSTRAINT PK_LineSpeed
    PRIMARY KEY CLUSTERED (Date_Recorded,Oven,Line)
    --WITH (FILLFACTOR = 71) --Optional FillFactor to prevent page splits as the table gets bigger.
    )
    ;

    Speaking of "the code", here it is below.  You'll need sysadmin privs to run it because of the use of xp_CmdShell.  If you decide to use this, I can show you how to setup the stored procedure so that folks with lesser privs can execute the code them needing such high-diety privs without actually giving them any privs to use xp_CmdShell directly.  It's a very safe method provided that we make any parameterization of paths "injection proof", which is also not difficult to do.  Note that if you absolutely cannot use xp_CmdShell because of folks that don't actually know how to use it safely, we can do some work arounds.  We just won't be able to do file moves (when it's time) through the stored procedure.

    And yeah, it does actually work.  I'll show you the run results right after this...

    --=====================================================================================================================
    -- Presets
    --=====================================================================================================================
    --===== Environment
    SET NOCOUNT ON
    ;
    --===== Configuration.
    -- Set the string constants to the correct paths, with may be UNCs.
    -- Note that these could be reworked a tiny bit to become parameters for a stored procedure.
    DECLARE @FileSourceDirCmd VARCHAR(500) = 'DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b'
    ,@FileSourceBCPFMT VARCHAR(500) = 'D:\Temp\OvenLine_Staging\OvenLine.BCPFMT'
    ;
    --===== Constants
    DECLARE @DBar CHAR(119) = REPLICATE('=',119)
    ,@DisplaySQL TINYINT = 0 --0 means don't display, anything else means display.
    ,@DLine CHAR(119) = REPLICATE('-',119)
    ;
    --===== Variables
    DECLARE
    @Dummy INT
    ,@AdvOptStatus TINYINT
    ,@BulkCmd VARCHAR(1000)
    ,@CmdShellStatus TINYINT
    ,@Duration CHAR(12)
    ,@FileCount INT
    ,@FileCounter INT = 1
    ,@FullFileName VARCHAR(500)
    ,@FullFilePath VARCHAR(500)
    ,@Line INT
    ,@Oven VARCHAR(20)
    ,@RowCount INT
    ,@StartDT CHAR(24) = CONVERT(CHAR(24),GETDATE(),113)
    ;
    --=====================================================================================================================
    -- Mark the run start
    --=====================================================================================================================
    RAISERROR('%s',0,0,@DBar);
    RAISERROR('Run Start Time = %s',0,0,@StartDT);
    RAISERROR('@FileSourceDirCmd = %s',0,0,@FileSourceDirCmd);
    RAISERROR('@FileSourceBCPFMT = %s',0,0,@FileSourceBCPFMT);
    RAISERROR('%s',0,0,@DBar);

    --=====================================================================================================================
    -- Temp Tables
    --=====================================================================================================================
    --===== Drop any Temp Tables that exist to make reruns in SSMS easier.
    IF OBJECT_ID('tempdb..#FilesToLoad') IS NOT NULL DROP TABLE #FilesToLoad;
    IF OBJECT_ID('tempdb..#Staging' ) IS NOT NULL DROP TABLE #Staging;

    --===== Create any Temp Tables that need to be created first.
    -- This table is used to control the imports
    CREATE TABLE #FilesToLoad
    (
    FileNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
    ,FullFilePath VARCHAR(500)
    ,FullFileName AS SUBSTRING(FullFilePath
    ,LEN(FullFilePath)-CHARINDEX('\',REVERSE(FullFilePath))+2
    ,500) PERSISTED
    )
    ;
    -- This table is what we initially load a file into for a bit of "self error checking".
    -- If there's a datatype violation when we insert, it'll cause an error without us having to check manually.
    -- It's very fast and it also makes sure that we don't have any duplicate per-second time stamps.
    CREATE TABLE #Staging
    (
    stamp_time DATETIME2(0) NOT NULL PRIMARY KEY CLUSTERED
    ,wirespeed SMALLINT NOT NULL
    )
    ;
    --=====================================================================================================================
    -- Setup advanced options and xp_cmdshell
    --=====================================================================================================================
    --===== Remember the current status of 'Show Advanced Options'
    SELECT @AdvOptStatus = CONVERT(INT,ISNULL(value,value_in_use))
    FROM sys.configurations
    WHERE name = 'show advanced options'
    ;
    --===== Remember the current status of xp_CmdShell
    SELECT @CmdShellStatus = CONVERT(INT,ISNULL(value,value_in_use))
    FROM sys.configurations
    WHERE name = 'xp_cmdshell'
    ;
    --===== If xp_CmdShell is disabled, temporarily enable it.
    IF @CmdShellStatus = 0
    BEGIN
    EXEC sp_configure 'show advanced options',1; RECONFIGURE;
    EXEC sp_configure 'xp_cmdshell' ,1; RECONFIGURE;
    END
    ;
    --===== Print a display separator
    RAISERROR('%s',0,0,@DLine);
    --=====================================================================================================================
    -- Get the filepath information we need (ie. Load the file names we want to import)
    --=====================================================================================================================
    --===== Get the full path information for the files we want to load.
    INSERT INTO #FilesToLoad
    (FullFilePath)
    EXEC xp_CmdShell @FileSourceDirCmd
    ;
    --===== Remember how many files we need to import
    SELECT @FileCount = MAX(FileNum)
    FROM #FilesToLoad
    WHERE FullFilePath > '' --Eliminates nulls at the end and any blanks (don't expect any blanks, though).
    ;
    --===== Display the file information that we'll be loading from
    RAISERROR('Loading the following files (see the grid for the list)...',0,0);
    SELECT *
    FROM #FilesToLoad
    WHERE FileNum <= @FileCount
    ;
    --===== Print a display separator
    RAISERROR('%s',0,0,@DLine)
    ;
    --=====================================================================================================================
    -- Load each file that appears in the #FilesToLoad table
    --=====================================================================================================================
    WHILE @FileCounter <= @FileCount
    BEGIN
    --===== Start the timer
    SELECT @StartDT = CONVERT(CHAR(24),GETDATE(),113) --dd mon yyyy hh:mi:ss:mmm (24h)
    ;
    --===== Print the start date time for this file load.
    RAISERROR('StartDT: %s',0,0,@StartDT)
    ;
    --====== Make sure the staging table is empty.
    TRUNCATE TABLE #Staging
    ;
    --===== Create the BULK INSERT command for the current file.
    -- Not that this is NOT indented just to make the output to the screen pretty.
    -- Also note that an unexpected end-of-file error may still cause an abort.
    -- We need to work on that but other errors provide an alert and the file still loads
    -- except for the bad line and, possibly, the line after the bad line.
    SELECT @FullFilePath = FullFilePath
    ,@FullFileName = FullFileName
    ,@BulkCmd = REPLACE(REPLACE(REPLACE(REPLACE('
    BULK INSERT #Staging
    FROM <<qFullFilePath>>
    WITH (
    CODEPAGE = "RAW"
    ,FIRSTROW = 2
    ,FORMATFILE = <<q@FileSourceBCPFMT>>
    ,MAXERRORS = 2000000000 --Basically, keep loading no matter how many errors we get.
    ,TABLOCK
    );'
    -- Other end of REPLACES
    ,'<<FullFilePath>>' ,FullFilePath)
    ,'<<qFullFilePath>>' ,QUOTENAME(FullFilePath,''''))
    ,'<<q@FileSourceBCPFMT>>',QUOTENAME(@FileSourceBCPFMT,''''))
    ,'"' ,'''') --Replaces double quotes with single quotes.
    FROM #FilesToLoad
    WHERE FileNum = @FileCounter
    ;
    --===== Do the import (load) to the staging table.
    -- Note that any errors in this area are displayed but they don't stop the import of the file.
    RAISERROR('Working on file %u of %u files: %s...',0,0,@FileCounter,@FileCount,@FullFilePath) WITH NOWAIT;
    IF @DisplaySQL <> 0 RAISERROR('%s',0,0,@BulkCmd) WITH NOWAIT;
    EXEC(@BulkCmd)
    ;
    --===== Display how many rows we loaded into the staging table.
    -- Don't put anything between this SELECT and the EXEC(@BulkCmd) above.
    SELECT @RowCount = @@ROWCOUNT;
    RAISERROR('%u Rows loaded from %s into #Staging.',0,0,@RowCount,@FullFilePath)
    ;
    --===== Get the Oven and Line info from the file name.
    SELECT @Oven = PARSENAME(ol.OvenLine,3)
    ,@Line = PARSENAME(ol.OvenLine,2)
    FROM (SELECT REPLACE(@FullFileName,'-Line','.'))ol(OvenLine)
    ;
    --===== Copy the cleaned staging table data into the final LineSpeed table
    -- including the Oven and Line notation.
    INSERT INTO dbo.LineSpeed WITH(TABLOCK) --Tablock is to take advantage of any "Minimal Logging" available
    (Date_Recorded,Speed,Oven,Line)
    SELECT Date_Recorded = stamp_time
    ,Speed = wirespeed
    ,Oven = @Oven
    ,Line = @Line
    FROM #Staging
    ORDER BY Date_Recorded, Oven, Line -- To take advantage of any "Minimal Logging" Available
    OPTION (RECOMPILE,MAXDOP 1) -- To take advantage of any "Minimal Logging" Available
    ;
    --===== Display how many rows we loaded from the staging table to the Linespeed table.
    -- Don't put anything between this SELECT and the INSERT/SELECT above.
    -- We'll also display the duration in hh:mi:ss:mmm (24h).
    SELECT @RowCount = @@ROWCOUNT
    ,@Duration = CONVERT(CHAR(12),GETDATE()-@StartDT,114) --hh:mi:ss:mmm (24h)
    ;
    RAISERROR('%u Rows loaded from #Staging into dbo.LineFeed.',0,0,@RowCount);
    RAISERROR('Duration: %s',0,0,@Duration);
    RAISERROR('%s',0,0,@DLine)
    ;
    --===== Bump the loop counter
    SELECT @FileCounter += 1
    ;
    END
    ;
    --=====================================================================================================================
    -- Change the advanced options and xp_CmdShell back to their original state.
    --=====================================================================================================================
    --===== Set both options back to their original condition just to keep everyone happy.
    EXEC sp_configure 'xp_cmdshell' ,@CmdShellStatus; RECONFIGURE;
    EXEC sp_configure 'show advanced options',@AdvOptStatus ; RECONFIGURE;

    --===== Display the "Run Complete" message
    RAISERROR('%s',0,0,@DBAR);
    RAISERROR('RUN COMPLETE.',0,0);
    GO

    And here are the run results.  I made a much larger first file just do demonstrate how fast all this is (more than 31 million rows womb-to-tomb in just over a minute).  The second file has two intentional errors in it just to show what happens and that the load continues despite the error.  The third file is a normal good file.

    Both the second and third file each have 50,000 rows and each loads in less that 200 milliseconds womb-to-tomb unless there's an error in the file.  Errors will, of course, take a little longer but it's still nasty fast.    You should have no problems even if you have hundreds of such files unless you have hundreds of errors in the files.

    =======================================================================================================================
    Run Start Time = 14 Jun 2021 02:00:55:277
    @FileSourceDirCmd = DIR "D:\Temp\OvenLine_Staging\????-line*.csv" /s /a-d /b
    @FileSourceBCPFMT = D:\Temp\OvenLine_Staging\OvenLine.BCPFMT
    =======================================================================================================================
    -----------------------------------------------------------------------------------------------------------------------
    Loading the following files (see the grid for the list)...
    -----------------------------------------------------------------------------------------------------------------------
    StartDT: 14 Jun 2021 02:00:55:317
    Working on file 1 of 3 files: D:\Temp\OvenLine_Staging\f300-line1.csv...
    31622400 Rows loaded from D:\Temp\OvenLine_Staging\f300-line1.csv into #Staging.
    31622400 Rows loaded from #Staging into dbo.LineFeed.
    Duration: 00:01:07:787
    -----------------------------------------------------------------------------------------------------------------------
    StartDT: 14 Jun 2021 02:02:03:103
    Working on file 2 of 3 files: D:\Temp\OvenLine_Staging\f300-line2.csv...
    Msg 4864, Level 16, State 1, Line 16
    Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 15, column 2 (stamp_time).
    Msg 4864, Level 16, State 1, Line 16
    Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 25, column 2 (stamp_time).
    49997 Rows loaded from D:\Temp\OvenLine_Staging\f300-line2.csv into #Staging.
    49997 Rows loaded from #Staging into dbo.LineFeed.
    Duration: 00:00:00:597
    -----------------------------------------------------------------------------------------------------------------------
    StartDT: 14 Jun 2021 02:02:03:700
    Working on file 3 of 3 files: D:\Temp\OvenLine_Staging\f300-line3.csv...
    50000 Rows loaded from D:\Temp\OvenLine_Staging\f300-line3.csv into #Staging.
    50000 Rows loaded from #Staging into dbo.LineFeed.
    Duration: 00:00:00:190
    -----------------------------------------------------------------------------------------------------------------------
    Configuration option 'xp_cmdshell' changed from 1 to 1. Run the RECONFIGURE statement to install.
    Configuration option 'show advanced options' changed from 1 to 1. Run the RECONFIGURE statement to install.
    =======================================================================================================================
    RUN COMPLETE.

    Notice that the code remembers whether or not advanced options and xp_CmdShell were enabled or not, then enables them, uses them, and changes them back to their original condition at the end of the run.

    If you have any questions, please read the comments in the code first.  If you can't science out an answer for your question after that, then by all mean, fire a flare on this thread and we'll see what we can do.

    And, like I said, we still have to do the file moves but I figured this would be a really good start and you can let me know if you want to continue in this direction.  Including creating the test data and writing all of the code and testing the bejeezus out of it, and writing it all up for good understanding, this only took about 5 hours to do.  Changes will be fairly simple now that the code has a keel.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Awesome!!!!...

    I'll get some time today to run it thru it's paces ...

    Many Thanks

  • That was a super fast load process... Just wanted to thanks again ... very impressive and well documented SP

    Just a side question, what process do you choose to load xls files to SQL.

    Thanks

  • Bruin wrote:

    That was a super fast load process... Just wanted to thanks again ... very impressive and well documented SP

    Just a side question, what process do you choose to load xls files to SQL. Thanks

    It would be quite similar except  I use the ACE drivers with OpenRowSet.  Some folks use a linked server for each file but I've found that to be a fair bit slower.

    There's also the subject of what form the source spreadsheet is in.  I have some code that auto-magically "figures that out" and unpivots the typical "horizontal spreadsheet" into Eav Tables with having to know the names of the columns, etc.  I wouldn't post anything on that, though.  Instead, I'd likely give you the PowerPoint presentation I made on the subject of "Excel Hell" and the code and say "have at it".

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 45 total)

You must be logged in to reply to this topic. Login to reply