Importing Tricky CSV Files with PowerShell

Question

Importing Tricky CSV Files with PowerShell

Viewing 15 posts - 31 through 45 (of 77 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004601 More actions · Answer 1

Bruin - Thursday, July 5, 2018 4:05 AM
"Creating specific datatypes for the datatable's columns is another exercise altogether, though I would do it by querying the schema of the table."
can you drop a example of how this is done?
another issue I always run into is field lengths of csv versus sql def. How can you
1) detect mismatch and write to error log or
2) truncate on insert and show error log of data row.
I was testing using your example and ran into this condition
Exception calling "WriteToServer" with "1" argument(s): "Received an invalid column length from the bcp client
Thanks very interested on how people handle these conditions without affecting the performance of a large import.

I tell BCP or BULK INSERT to use the built-in features to sequester bad rows to a separate file. The bad part is that the built in error indicators that are stored are rather cryptic and, even if you understand them, you come to find out that the error messages being stored are rather useless for troubleshooting. The error messages are no where near as helpful as when you run things manually, which tell you row, column number, column name, and what the error is.

My work around is to have the machine "pretend" that a human was running it. I use xp_CmdShell to call SQLCMD and capture the output from the xp_CmdShell call. It not only identifies what the errors were in a human readable form that's easy to find and "parse" but it also provides confirmation of the row counts, number of errors, and duration.

As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David Rueter SSCrazy Points: 2661 More actions · Answer 2

Jeff Moden - Thursday, July 5, 2018 7:34 AM
As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.

I agree, if a single entity owns both the database and the server. The risk comes into play with a scenario like: you have multiple customers in separate databases on one server. In that case you may want a customer to have control over their database, but not over the server (such that they could compromise security or stability of other databases). In such a case, granting the customer xp_CmdShell does constitute a security risk. Same thing is true for TRUSTWORTHY databases, "unsafe" CLR assemblies, etc.

This would be a topic worthy of it's own discussion / article / etc.

Ali M SSC Veteran Points: 251 More actions · Answer 3

@Bruin - here you go. It's somewhat longer now but does type checking and (optionally with command line switch) char column width
param ( [string]$Path, [string]$ConnectionString, [string]$Table, [int]$BatchSize = 75000, [switch]$StopOnError, [switch]$TruncateLongValues )

$ErrorActionPreference = 'Stop'

$columnLengthSql = @" select c.[max_length] from sys.columns c inner join sys.objects o on c.object_id = o.object_id and o.type = 'U' inner join sys.types t on c.system_type_id = t.system_type_id where o.Name = N'{0}' order by c.column_id

"@ try { # Connect to SQL Server $conn = New-Object System.Data.SqlClient.SqlConnection($ConnectionString) $conn.Open()

# Read the column widths, which we will use to check string (char type) columns $columnWidths = Invoke-Command -NoNewScope -ScriptBlock {

try { $cmd = $conn.CreateCommand() $cmd.CommandText = [string]::Format($columnLengthSql, $Table) $cmd.CommandType = 'Text' $rdr = $cmd.ExecuteReader()

while ($rdr.Read()) { # Emit width into array [int]$rdr[0] } } finally { ($rdr, $cmd) | ForEach-Object { $_.Dispose() } } }

$bcpOption = ([System.Data.SqlClient.SqlBulkCopyOptions]::TableLock -bor [System.Data.SqlClient.SqlBulkCopyOptions]::UseInternalTransaction) $bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($conn, $bcpOption, $null) $bulkcopy.DestinationTableName = $Table $bulkcopy.bulkcopyTimeout = 0 $bulkcopy.batchsize = $BatchSize $datatable = New-Object System.Data.DataTable

# Get the column data for the given table # Sneakily selecting 1 row from the table puts the schema into the datatable try { $sql = 'select top 1 * from [' + $Table + ']' $dad = New-Object System.Data.SqlClient.SqlDataAdapter($sql, $conn) $dad.Fill($datatable) } finally { $dad.Dispose() }

# If we read a row, clear it. $datatable.Clear()

# Init row counter $i = 0 # Let Import-Csv deal with delimiter nonsense! Import-Csv -Path $Path | ForEach-Object {

if ($TruncateLongValues) { # Check column widths - this will slow things down somewhat for ($col = 0; $col -lt $datatable.Columns.Count; ++$col) { if ($datatable.Columns[$col].DataType.FullName -eq 'System.String' -and $_.PSObject.Properties.Value[$col].ToString().Length -gt $columnWidths[$col]) { Write-Warning "Row $($i + 1), Col $($col + 1): Value truncated" $_."$($_.PSObject.Properties.Name[$col])" = $_.PSObject.Properties.Value[$col].ToString().Substring(0, $columnWidths[$col]) } } }

try { [void]$datatable.Rows.Add($_.PSObject.Properties.Value) } catch { # Column datatype mismatch if ($StopOnError) { # Stop immediately throw }

# Warn the user a row didn't go in and continue Write-Warning "Row $($i + 1): $($_.Exception.Message)" }

if ((++$i % $BatchSize) -eq 0) { # Write batch $bulkcopy.WriteToServer($datatable) $datatable.Clear() } }

if ($datatable.Rows.Count -gt 0) { # Write remaining rows $bulkcopy.WriteToServer($datatable) $datatable.Clear() } } catch { # A good script will handle exceptions here, # e.g. failure to connect to SQL server, incorrect/un-castable datatype for column etc. # for now, just re-throw throw } finally { # Clean up in finally block - ensures resources are released even in event of errors. ($bulkcopy, $datatable, $conn) | Where-Object { $_ -ne $null } | ForEach-Object { $_.Dispose() } }

Bruin SSCrazy Points: 2107 More actions · Answer 4

very cool.. I'll run some test and report back results

Thanks again.

Bruin SSCrazy Points: 2107 More actions · Answer 5

Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?

error

does not allow DBNull.Value.

Thanks.

Jeff Moden SSC Guru Points: 1004601 More actions · Answer 6

David Rueter - Thursday, July 5, 2018 11:23 AM
Jeff Moden - Thursday, July 5, 2018 7:34 AM
As for those that think using xp_CmdShell for such things constitutes a security risk, you really need to reconsider because, when done properly, it's not and it brings a whole 'nuther world of functionality into play that most only dream of or have to write some pretty nasty external code to do some of the same things.
I agree, if a single entity owns both the database and the server. The risk comes into play with a scenario like: you have multiple customers in separate databases on one server. In that case you may want a customer to have control over their database, but not over the server (such that they could compromise security or stability of other databases). In such a case, granting the customer xp_CmdShell does constitute a security risk. Same thing is true for TRUSTWORTHY databases, "unsafe" CLR assemblies, etc.
This would be a topic worthy of it's own discussion / article / etc.

That's what I mean. People don't know how to use xp_CmdShell correctly. You NEVER grant a customer privs to execute xp_CmdShell directly nor provide them with the ability to create/modify a proc that does (and, yes... that does mean who or what you make as the owner of the database). You only grant such privs to stored procedures that they can use but cannot change (which is sometimes a bit of a trick). And, yes... I totally agree that the use of TRUSTWORTHY in such a case is a bad thing to do because it is a mostly hidden security risk that people frequently miss.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004601 More actions · Answer 7

Bruin - Thursday, July 5, 2018 7:46 PM
Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?
error
does not allow DBNull.Value.
Thanks.

Yes... stop trying to insert into the final table. I ALWAYS use a staging table so I can pre-validate and delouse the data I've imported. Then I'll move the validated data to the final table. You would do such a thing just before you do the move.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ali M SSC Veteran Points: 251 More actions · Answer 8

Bruin - Thursday, July 5, 2018 7:46 PM
Is there a way to force a value(N/A) to a SQL field defined for NOT Null if the csv field is empty?
error
does not allow DBNull.Value.
Thanks.

If the DB column is NOT NULL and the CSV field is empty, then it probably should error. You'd need more logic (slowing down the import even more) to check for empty values and then supply a suitable default value for the column's data type - which would vary from case to case.
I'd agree with Jeff's point above in that you bulk import to a staging table that has fewer constraints and massage the data to the final table using SQL i.e. a stored proc designed for the task.
And - you're not taking out a table lock on your production table.

Bruin SSCrazy Points: 2107 More actions · Answer 9

agreed.. Staging table is the preferred method before loading live(Prod) table.

What the best solution for files with "No Headers"?

Thanks again for the script and comments .. just getting started with PS and very cool what you created to load csv to SQL.

Ali M SSC Veteran Points: 251 More actions · Answer 10

We need to add the -Header argument to Import-Csv and give it dummy column names. The actual column names in the datatable are now determined from the table schema so it doesn't matter what the headers are in the CSV

We also need to use argument splatting to pass the optional -Header parameter to Import-Csv. This makes it assume all input rows are data.

Add a new argument to the script to indicate you want to load a file without headers
[switch]$NoHeaders
Since we know how many columns to expect because we've read the column widths, we can do this (after the bit that gets the column widths)
$headerArgument = @{} if ($NoHeaders) { $headings = Invoke-Command -NoNewScope -ScriptBlock { for($i = 0; $i -lt $columnWidths.Length; $i++) { "Dummy$($i)" } } $headerArgument = @{ Header = $headings } }

And now change the Import-Csv line to read
Import-Csv -Path $Path @headerArgument |

https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/import-csv?view=powershell-3.0

Bruin SSCrazy Points: 2107 More actions · Answer 11

many thanks!!!

From this reply

"The actual column names in the datatable are now determined from the table schema so it doesn't matter what the headers are in the CSV"

It really doesn't need Headers ?

Thanks.

Bruin SSCrazy Points: 2107 More actions · Answer 12

Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.

Many thanks again!!!

Ray Herring SSCertifiable Points: 5533 More actions · Answer 13

Brian F - Tuesday, July 3, 2018 7:04 AM
I used a VB script inside an SSIS package to accomplish the same functionality. This is a smart way to take care of this issue.
With the data I needed to import, the pipe ended up not being a good delimiter. So, especially in data fields full of user input, it's worth keeping in mind different alternatives. My solution was using a triple semicolon, which you should be hard pressed to find in normal user input, but I'd be interested to hear if anyone else had similar issues with different workaround delimiters.

What about input files where the absence of a value is indicated by empty string? We get raw files that look like
"Some String";value1;value2;;;;;value7;value8

Ali M SSC Veteran Points: 251 More actions · Answer 14

Bruin - Friday, July 6, 2018 12:04 PM
Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.
Many thanks again!!!

To this, and the other questions...

The order of the columns in the CSV must match that of the table schema
The number of columns in the CSV must also match the table schema.
If, as previously discussed you use a staging table for the import, then the two constraints above are satisfied
If the CSV is without headers, and you modify my script as detailed then it should still work fine. The datatable columns are populated by the select top 1 * on the table you're importing to
Note also that if you want a delimiter other than comma, there is a -Delimiter argument to Import-Csv to achieve this

Jeff Moden SSC Guru Points: 1004601 More actions · Answer 15

Bruin - Friday, July 6, 2018 12:04 PM
Does the order of the csv have to match sql schema and what about if schema has 8 fields and csv has 4 fields.
Many thanks again!!!

Yes, but there is a work around. Import to a view that does the mapping between the table and the data being imported.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Importing Tricky CSV Files with PowerShell

Cookies on SQLServerCentral