Importing Tricky CSV Files with PowerShell

  • Bruin - Thursday, July 5, 2018 4:05 AM

    "Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table."

    can you drop a example of how this is done?

    another issue I always run into is field lengths of csv versus sql def. How can you
    1) detect mismatch and write to error log or
    2) truncate on insert and show error log of data row.

    I was testing using your example and ran into this condition

    Exception calling "WriteToServer" with "1" argument(s): "Received an invalid column length from the bcp client

    Thanks very interested on how people handle these conditions without affecting the performance of a large import.

    I tell BCP or BULK INSERT to use the built-in features to sequester bad rows to a separate file.  The bad part is that the built in error indicators that are stored are rather cryptic and, even if you understand them, you come to find out that the error messages being stored are rather useless for troubleshooting.  The error messages are no where near as helpful as when you run things manually, which tell you row, column number, column name, and what the error is.

    My work around is to have the machine "pretend" that a human was running it.  I use xp_CmdShell to call SQLCMD and capture the output from the xp_CmdShell call.  It not only identifies what the errors were in a human readable form that's easy to find and "parse" but it also provides confirmation of the row counts, number of errors, and duration.

    As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, July 5, 2018 7:34 AM

    As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.

    I agree, if a single entity owns both the database and the server.  The risk comes into play with a scenario like:  you have multiple customers in separate databases on one server.  In that case you may want a customer to have control over their database, but not over the server (such that they could compromise security or stability of other databases).  In such a case, granting the customer xp_CmdShell does constitute a security risk.  Same thing is true for TRUSTWORTHY databases, "unsafe" CLR assemblies, etc.

    This would be a topic worthy of it's own discussion / article / etc.

  • @bruin - here you go. It's somewhat longer now but does type checking and (optionally with command line switch) char column width 

    param
    (
      [string]$Path,
      [string]$ConnectionString,
      [string]$Table,
      [int]$BatchSize = 75000,
      [switch]$StopOnError,
      [switch]$TruncateLongValues
    )

    $ErrorActionPreference = 'Stop'

    $columnLengthSql = @"
      select
      c.[max_length]
      from sys.columns c
      inner join sys.objects o
      on c.object_id = o.object_id and o.type = 'U'
      inner join sys.types t on c.system_type_id = t.system_type_id
      where o.Name = N'{0}'
      order by c.column_id

    "@
    try
    {
      # Connect to SQL Server
      $conn = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)
      $conn.Open()

      # Read the column widths, which we will use to check string (char type) columns
      $columnWidths = Invoke-Command -NoNewScope -ScriptBlock {

       try
       {
        $cmd = $conn.CreateCommand()
        $cmd.CommandText = [string]::Format($columnLengthSql, $Table)
        $cmd.CommandType = 'Text'
        $rdr = $cmd.ExecuteReader()

        while ($rdr.Read())
        {
          # Emit width into array
          [int]$rdr[0]
        }
       }
       finally
       {
        ($rdr, $cmd) |
          ForEach-Object {
          $_.Dispose()
        }
       }
      }

      $bcpOption = ([System.Data.SqlClient.SqlBulkCopyOptions]::TableLock -bor [System.Data.SqlClient.SqlBulkCopyOptions]::UseInternalTransaction)
      $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($conn, $bcpOption, $null)
      $bulkcopy.DestinationTableName = $Table
      $bulkcopy.bulkcopyTimeout = 0
      $bulkcopy.batchsize = $BatchSize
      $datatable = New-Object System.Data.DataTable

      # Get the column data for the given table
      # Sneakily selecting 1 row from the table puts the schema into the datatable
      try
      {
       $sql = 'select top 1 * from [' + $Table + ']'
       $dad = New-Object System.Data.SqlClient.SqlDataAdapter($sql, $conn)
       $dad.Fill($datatable)
      }
      finally
      {
       $dad.Dispose() 
      }

      # If we read a row, clear it.
      $datatable.Clear()

      # Init row counter
      $i = 0
     
      # Let Import-Csv deal with delimiter nonsense!
      Import-Csv -Path $Path |
       ForEach-Object {

       if ($TruncateLongValues)
       {
        # Check column widths - this will slow things down somewhat
        for ($col = 0; $col -lt $datatable.Columns.Count; ++$col)
        {
          if ($datatable.Columns[$col].DataType.FullName -eq 'System.String' -and $_.PSObject.Properties.Value[$col].ToString().Length -gt $columnWidths[$col])
          {
           Write-Warning "Row $($i + 1), Col $($col + 1): Value truncated"
           $_."$($_.PSObject.Properties.Name[$col])" = $_.PSObject.Properties.Value[$col].ToString().Substring(0, $columnWidths[$col])
          }
        }
       }

       try
       {
        [void]$datatable.Rows.Add($_.PSObject.Properties.Value)
       }
       catch
       {
        # Column datatype mismatch
        if ($StopOnError)
        {
          # Stop immediately
          throw
        }

        # Warn the user a row didn't go in and continue
        Write-Warning "Row $($i + 1): $($_.Exception.Message)"
       }

       if ((++$i % $BatchSize) -eq 0)
       {
        # Write batch
        $bulkcopy.WriteToServer($datatable)
        $datatable.Clear()
       }
      }

      if ($datatable.Rows.Count -gt 0)
      {
       # Write remaining rows
       $bulkcopy.WriteToServer($datatable)
       $datatable.Clear()
      }
    }
    catch
    {
      # A good script will handle exceptions here,
      # e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc.
      # for now, just re-throw
      throw
    }
    finally
    {
      # Clean up in finally block - ensures resources are released even in event of errors.
      ($bulkcopy, $datatable, $conn) |
       Where-Object { $_ -ne $null } |
       ForEach-Object {
     
       $_.Dispose()
      }
    }

  • very cool.. I'll run some test and report back results

     Thanks again.

  • Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?

    error

    does not allow DBNull.Value.

      Thanks.

  • David Rueter - Thursday, July 5, 2018 11:23 AM

    Jeff Moden - Thursday, July 5, 2018 7:34 AM

    As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.

    I agree, if a single entity owns both the database and the server.  The risk comes into play with a scenario like:  you have multiple customers in separate databases on one server.  In that case you may want a customer to have control over their database, but not over the server (such that they could compromise security or stability of other databases).  In such a case, granting the customer xp_CmdShell does constitute a security risk.  Same thing is true for TRUSTWORTHY databases, "unsafe" CLR assemblies, etc.

    This would be a topic worthy of it's own discussion / article / etc.

    That's what I mean.  People don't know how to use xp_CmdShell correctly.  You NEVER grant a customer privs to execute xp_CmdShell directly nor provide them with the ability to create/modify a proc that does (and, yes... that does mean who or what you make as the owner of the database).  You only grant such privs to stored procedures that they can use but cannot change (which is sometimes a bit of a trick).  And, yes... I totally agree that the use of TRUSTWORTHY in such a case is a bad thing to do because it is a mostly hidden security risk that people frequently miss.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bruin - Thursday, July 5, 2018 7:46 PM

    Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?

    error

    does not allow DBNull.Value.

      Thanks.

    Yes... stop trying to insert into the final table.  I ALWAYS use a staging table so I can pre-validate and delouse the data I've imported.  Then I'll move the validated data to the final table.  You would do such a thing just before you do the move.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bruin - Thursday, July 5, 2018 7:46 PM

    Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?

    error

    does not allow DBNull.Value.

      Thanks.

    If the DB column is NOT NULL and the CSV field is empty, then it probably should error. You'd need more logic (slowing down the import even more) to check for empty values and then supply a suitable default value for the column's data type - which would vary from case to case.
    I'd agree with Jeff's point above in that you bulk import to a staging table that has fewer constraints and massage the data to the final table using SQL i.e. a stored proc designed for the task.
    And - you're not taking out a table lock on your production table.

  • agreed.. Staging table is the preferred method before loading live(Prod) table.

    What the best solution for files with "No Headers"? 

    Thanks again for the script and comments .. just getting started with PS and very cool what you created to load csv to SQL.

  • We need to add the -Header argument to Import-Csv and give it dummy column names. The actual column names in the datatable are now determined from the table schema so it doesn't matter what the headers are in the CSV

    We also need to use argument splatting to pass the optional -Header parameter to Import-Csv. This makes it assume all input rows are data.

    Add a new argument to the script to indicate you want to load a file without headers

    [switch]$NoHeaders

    Since we know how many columns to expect because we've read the column widths, we can do this (after the bit that gets the column widths)

    $headerArgument = @{}
    if ($NoHeaders)
    {
        $headings = Invoke-Command -NoNewScope -ScriptBlock {
            for($i = 0; $i -lt $columnWidths.Length; $i++)
            {
                  "Dummy$($i)"
            }
        }
        $headerArgument = @{ Header = $headings }
    }

    And now change the Import-Csv line to read

         Import-Csv -Path $Path @headerArgument |

    https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/import-csv?view=powershell-3.0

  • many thanks!!!

     From this reply

    "The actual column names in the datatable are now determined from the table schema so it doesn't matter what the headers are in the CSV"

    It really doesn't need Headers ?

    Thanks.

  • Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.

    Many thanks again!!!

  • Brian F - Tuesday, July 3, 2018 7:04 AM

    I used a VB script inside an SSIS package to accomplish the same functionality. This is a smart way to take care of this issue.

    With the data I needed to import, the pipe ended up not being a good delimiter. So, especially in data fields full of user input, it's worth keeping in mind different alternatives. My solution was using a triple semicolon, which you should be hard pressed to find in normal user input, but I'd be interested to hear if anyone else had similar issues with different workaround delimiters.

    What about input files where the absence of a value is indicated by empty string?  We get raw files that look like
    "Some String";value1;value2;;;;;value7;value8

  • Bruin - Friday, July 6, 2018 12:04 PM

    Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.

    Many thanks again!!!

    To this, and the other questions...

    • The order of the columns in the CSV must match that of the table schema
    • The number of columns in the CSV must also match the table schema. 
    • If, as previously discussed you use a staging table for the import, then the two constraints above are satisfied
    • If the CSV is without headers, and you modify my script as detailed then it should still work fine. The datatable columns are populated by the select top 1 * on the table you're importing to
    • Note also that if you want a delimiter other than comma, there is a -Delimiter argument to Import-Csv to achieve this

  • Bruin - Friday, July 6, 2018 12:04 PM

    Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.

    Many thanks again!!!

    Yes, but there is a work around.  Import to a view that does the mapping between the table and the data being imported.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 31 through 45 (of 76 total)

You must be logged in to reply to this topic. Login to reply