Importing Tricky CSV Files with PowerShell

  • Thanks.

  • Stewart "Arturius" Campbell - Tuesday, July 3, 2018 6:24 AM

    ssc 21649 - Tuesday, July 3, 2018 5:28 AM

    This is cool!

    If a quote-delimited string looks like this:

    "1313 Mockingbird Lane, Detroit, MI 48690"


    (two commas), does this logic still work?
    Same question for quote-delimited strings like this:

    "This is a ""test"" of parsing"

    It does not matter how many comma's are encountered within the quotes, the logic will still handle it correctly.
    even double-delimiting will work correctly.

    Well until someone decides to put "," inside one of the fields, it's helpful but ultimately if you have to do this kind of preprocessing chances are you already lost.

  • Very nice script and useful. couple of questions
    1) how could it show lines imported and records loaded to table for verification.
    2)  how about records that fail can it detect line number from the data file and why.

     Thanks again

  • Jerry Kelk - Tuesday, July 3, 2018 2:43 AM

    This has been a personal annoyance for years.
    Surely, the import engine should properly respect the double quotes (or whatever specified character) as string delimiters, if so directed.  We would not then have to hack at the user's text by removing commas or any such workarounds.

    It would definitely be nice if MS would recognize the importance of CSV with respect to BCP and Bulk Insert.  Instead, MS publishes that CSV isn't actually supported by either.  I believe there's a "connect" item (I can't remember the touchy-feely name they now call it) on the subject so we might actually see such support sometime around 2035. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Stewart "Arturius" Campbell - Monday, July 2, 2018 11:04 PM

    Comments posted to this topic are about the item Importing Tricky CSV Files with PowerShell

    Thanks for the code and taking time to share it in an article, Stewart.

    One of the biggest problems I've seen occurs when people try to import CSVs that were created by spreadsheets.  As you already know, BCP and BULK INSERT are NOT sensitive to "lines" of data in the file.  They're sensitive to the type and number of delimiters.  The delimiters MUST be consistent throughout the file, including "column headings" in the file.  This kills the ability to easily do an input using BCP or BULK INSERT because EXCEL (and many other products) will only "text qualify" (with double quotes) only those elements that contain delimiters (such as commas) as a part of the intended text in the element.

    There is a tool in PowerShell that can easily handle this and, under the covers, I'm sure that it operates pretty much as your good code does but at machine language speeds instead of the interpretive speeds of a loop in PowerShell.  I don't have quantitative performance measurements on it (but will make the effort to provide that in the near future) but my anecdotal qualitative observation is that it's quite fast.

    The method involves ImportCSV followed by an immediate ExportCSV.  What it does is that it text qualifies (with double quotes) all elements based on the desired delimiter.  It does not add extra qualifiers to any element (or column of elements) that have already been text qualified with quotes.  The other cool part is that it also text qualifies a delimited header row if it exists.

    After doing the ImportCSV/ExportCSV thing, the file is now 100% consistent for delimiters (",").  The only "rub" is that the first double quote simply needs to be ignored and the last delimiter (what people refer as the "row terminator", which includes the trailing delimiter (double quote in this case) for the last element on the line from the file.  That means that you must use a BCP format file, which isn't so bad because that also helps performance a bit.

    The put-off for some is that Microsoft does not actually support CSV imports or exports using BCP or BULK INSERT.  I suspect that may be a part of the reason why they don't include how to use double quotes as a part of a delimiter even though it has worked in all known versions of both BCP and BULK INSERT (use /" for those that want to know, especially in a BCP Format file).  Because it's not supported, people fear that MS may someday change it.  While that is a possibility, I just can't see MS doing that because I'm not sure that even the current generation of people that support the code are even aware that capability exists and would actually require a code change to remove it.

    If they ever do remove the capability, there will be a very angry old man in full postal garb making an in person trip to Redmond with a serious 4 banded pork chop launcher and a freezer full of frozen "bone in" pork chops.  😉

    Thank you again for this article.  I wish more people would publish about how to handle CSVs, TSVs, and spreadsheets.

    I also agree with ZZMartin about having to do this type of thing to begin with.  Data is important and, if at all possible, the receivers of the data should make it painfully clear to the providers that the data is important and must be both easy to use consistent in form at all times.  The providers really shouldn't have to be told this, either.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I have been doing something similar for the last few months.   Thank you a clear and easily understood script.
    John Kingery

  • Jeff Moden - Tuesday, July 3, 2018 8:25 AM

    Jerry Kelk - Tuesday, July 3, 2018 2:43 AM

    This has been a personal annoyance for years.
    Surely, the import engine should properly respect the double quotes (or whatever specified character) as string delimiters, if so directed.  We would not then have to hack at the user's text by removing commas or any such workarounds.

    It would definitely be nice if MS would recognize the importance of CSV with respect to BCP and Bulk Insert.  Instead, MS publishes that CSV isn't actually supported by either.  I believe there's a "connect" item (I can't remember the touchy-feely name they now call it) on the subject so we might actually see such support sometime around 2035. 😉

    I shall be long retired and nearing a wooden box by then. Enjoy.

  • Nice piece Stewart, thanks for this!
    😎

    I'll compare this with the methods I'm using and ping back with my findings.

  • A non-visual terminator could be a good solution:  \0 (non-visible null aka CHAR(0) ) could work... but certain systems may treat the null as the end of a string or file, which could be awkward (if you need to work with the CSV file in other systems).

    \t (tab) might work if you don't have embedded tabs.

    But it looks like our choice of terminators is limited:  https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-2017

    We cannot use arbitrary non-printable characters such as CHAR(28)  which is traditionally FS or File separator ...which is too bad.

    But we can use a "string of up to 10 printable characters, including some or all of the terminators listed earlier".

    I haven't used it, but something like this .NET package could also be helpful for validating and pre-processing the CSV:  https://joshclose.github.io/CsvHelper

  • David Rueter - Tuesday, July 3, 2018 1:06 PM

    A non-visual terminator could be a good solution:  \0 (non-visible null aka CHAR(0) ) could work... but certain systems may treat the null as the end of a string or file, which could be awkward (if you need to work with the CSV file in other systems).

    \t (tab) might work if you don't have embedded tabs.

    But it looks like our choice of terminators is limited:  https://docs.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server?view=sql-server-2017

    We cannot use arbitrary non-printable characters such as CHAR(28)  which is traditionally FS or File separator ...which is too bad.

    But we can use a "string of up to 10 printable characters, including some or all of the terminators listed earlier".

    I haven't used it, but something like this .NET package could also be helpful for validating and pre-processing the CSV:  https://joshclose.github.io/CsvHelper

    Hmm.... I've used some pretty odd terminators in BCP, right now we have a job using 0xDCBF as the column separator

  • Bruin - Tuesday, July 3, 2018 8:22 AM

    Very nice script and useful. couple of questions
    1) how could it show lines imported and records loaded to table for verification.
    2)  how about records that fail can it detect line number from the data file and why.

     Thanks again

    any thoughts on these issues when dealing with loading csv files?

     Thanks

  • The PowerShell built-in cmdlet Import-Csv handles all the nonsense of commas within the quoted strings very nicely for us.
    Try this for size.

    • Import-Csv returns PowerShell objects, one for each row.
    • The properties of these objects are taken from the header row of the CSV - this allows us to set up the columns nicely when processing the first input row. object.PSObject.Properties.Name for any PowerShell object returns an array of the property names of that object, which in this case we iterate to set up the datatable's columns.
    • For the subsequent rows, we get the data with object.PSObject.Properties.Value, returning an array of the values which we add as a datatable row. 

    It should also be noted about the script in its current form that when the columns are added to the datatable, the default data type of all columns is assumed to be .NET String (i.e. nvarchar), thus the table being inserted must have column types that will CAST implicitly from nvarchar or you'll get errors. Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table.

    Would be interested to see how this performs using the author's original input.


    [void][Reflection.Assembly]::LoadWithPartialName("System.Data")
    [void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
    $Date = $args[0]
    $ConnectionString = $args[1]
    $Directory = $args[2]
    $Table = $args[3]

    $file = "$Directory\$Table.csv" #This could be anything, the file just happens to be the same as the table name in this instance

    $batchsize = 75000 #This could, theoretically, be any number - i found that this has the least drain on cpu & resources
    $i = 0

    try
    {
      $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
      $bulkcopy.DestinationTableName = $Table
      $bulkcopy.bulkcopyTimeout = 0
      $bulkcopy.batchsize = $batchsize
      $datatable = New-Object System.Data.DataTable

      # Let Import-Csv deal with delimiter nonsense!
      Import-Csv -Path $file |
       ForEach-Object {

       if ($i -eq 0)
       {
            # Add columns
            $_.PSObject.Properties.Name |
            ForEach-Object {

                 [void]$datatable.Columns.Add($_)
            }
       }

       [void]$datatable.Rows.Add($_.PSObject.Properties.Value)

       $i++

       if (($i % $batchsize) -eq 0)
       {
             # Write batch
             $bulkcopy.WriteToServer($datatable)
             $datatable.Clear()
       }
      }

      if ($datatable.Rows.Count -gt 0)
      {
           # Write remaining rows
           $bulkcopy.WriteToServer($datatable)
           $datatable.Clear()
      }
    }
    catch
    {
      # A good script will handle exceptions here,
      # e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc.
      # for now, just re-throw
      throw
    }
    finally
    {
      # Clean up in finally block - ensures resources are released even in event of errors.
      ($bulkcopy, $datatable) |
      Where-Object { $_ -ne $null  } |
      ForEach-Object {
      
       $_.Dispose()
      }
    }

    Update - Only just noticed that this is a variation on an earlier post by layton 35542 however this version does not pre-load the entire CSV file which would be problematic for HUGE imports.

  • "Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table."

    can you drop a example of how this is done?

    another issue I always run into is field lengths of csv versus sql def. How can you
    1) detect mismatch and write to error log or
    2) truncate on insert and show error log of data row.

    I was testing using your example and ran into this condition

    Exception calling "WriteToServer" with "1" argument(s): "Received an invalid column length from the bcp client

    Thanks very interested on how people handle these conditions without affecting the performance of a large import.

  • Hi Bruin
    If you and others are interested, I'll develop a full-fat version of this and post it later.

  • Sounds good !!!!

    Thanks.

Viewing 15 posts - 16 through 30 (of 76 total)

You must be logged in to reply to this topic. Login to reply