SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

Importing Tricky CSV Files with PowerShell

By Stewart Campbell,

Every so often, one is required to process a comma delimited file (.csv) from an external source (e.g. suppliers price lists)

In general, this is relatively easy to do, using a variety of tools, e.g. BCP, BULK IMPORT, SSIS, to name but a few.

One way is to use PowerShell, making use of a SqlBulkCopy method, as shown here:

[void][Reflection.Assembly]::LoadWithPartialName("System.Data") 
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") 

$Date = $args[0]
$ConnectionString = $args[1]
$Directory = $args[2] 
$Table = $args[3]
$file="$Directory\$Table.csv" #

$delim = "," 
$FirstRowColumnNames = $true 

$batchsize = 75000  #This could, theoretically, be any number - i found that this has the least drain on cpu & memory   

$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) 
$bulkcopy.DestinationTableName = $Table 
$bulkcopy.bulkcopyTimeout = 0 
$bulkcopy.batchsize = $batchsize 
$datatable = New-Object System.Data.DataTable 
$reader = New-Object System.IO.StreamReader($file) 
$columns = (Get-Content $file -First 1).Split($csvdelimiter) 
if ($FirstRowColumnNames -eq $true) {$null = $reader.readLine() } 

foreach ($column in $columns) {  
    $null = $datatable.Columns.Add() 
} 

$i = 0  
while (($line = $reader.ReadLine()) -ne $null)  { 
    $null = $datatable.Rows.Add($line.Split($delim)) 
    $i++
       if (($i % $batchsize) -eq 0) {  
        $bulkcopy.WriteToServer($datatable)  
        $datatable.Clear()  
    }  
}  
  
if($datatable.Rows.Count -gt 0) { 
    $bulkcopy.WriteToServer($datatable) 
    $datatable.Clear() 
} 
$reader.Close()
$reader.Dispose() 
$bulkcopy.Close()
$bulkcopy.Dispose() 
$datatable.Dispose() 

This script should import several million rows in a matter of minutes, with minimal load on the server.

However, every so often, one encounters an issue with the import process, for example where the file contains a narrative or description field which, althought delimted by quotes, contains commas, for example:

ProductKey,Code,Description,Narrative,Comment
123,abc,product1,"Product reclassified, now freely available","reclassification by Bloggs, Bill"
345,fgh,product2,"Product withdrawn, issue with supplier","Product no longer available, issue logged re supplier by Soap, Joe"

These fields will cause the above to break, complaining about unexpected columns, or end-of-row not in expected location. We, therefore, need a means to isolate these commas within text between double-quotes and ignore them during the import process. The simplest way would be to replace the commas meant to be column delimiters with another delimiter (e.g. a pipe)

To begin, we need to replace the $delim variable with two new variables, e.g.

$olddelimiter = "," 
$newdelimiter = "|"

The first issue is to correctly identify those commas that are meant to be column delimiters and those that are part of a text field. This we do by converting each line into an array of characters

[char[]]$larr = $line

We then need to iterate through this list, locating each quote and incrementing a counter implemented for this purpose:

     if($item -eq """"){$sp++}

We then run through every character, replacing the comma with a pipe wherever it is encountered and the modulus of the quote counter and two equals zero

     if($item -eq $olddelimiter){
        if ($sp%2 -eq 0) {
            $line = $line.Remove($point,1).Insert($point,$newdelimiter)}

We can then clean up the line by removing the unnecessary quotation marks

     $Line = $line.Replace("""","")

The entire script herewith:

[void][Reflection.Assembly]::LoadWithPartialName("System.Data") 
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient") 
$Date = $args[0]
$ConnectionString = $args[1]
$Directory = $args[2] 
$Table = $args[3]

$file="$Directory\$Table.csv" #This could be anything, the file just happens to be the same as the table name in this instance
$oldDelimiter = "," 
$newDelimiter = "|"

$FirstRowColumnNames = $true 
$batchsize = 75000 #This could, theoretically, be any number - i found that this has the least drain on cpu & resources
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($Connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock) 
$bulkcopy.DestinationTableName = $Table 
$bulkcopy.bulkcopyTimeout = 0 
$bulkcopy.batchsize = $batchsize 
$datatable = New-Object System.Data.DataTable 
$reader = New-Object System.IO.StreamReader($file) 
$columns = (Get-Content $file -First 1).Split($oldDelimiter) 
if ($FirstRowColumnNames -eq $true) { $null = $reader.readLine() }
foreach ($column in $columns) {  
    $null = $datatable.Columns.Add() 
} 
$i = 0  
while (($line = $reader.ReadLine()) -ne $null)  { 
    $sp = 0
    $point = 0
    [char[]]$larr = $line #create a charater array of the specific line
    foreach ($item in $larr){ #itterate through the array to locate all quotes and commas
    if($item -eq """"){$sp++} #increment whenever a quotation mark found
    if($item -eq $oldDelimiter){ #is a comma encountered?
        if ($sp%2 -eq 0) { #if the character is a comma and the modulus of quotation counter and 2 is 0
            $line = $line.Remove($point,1).Insert($point,$newDelimiter)} #replace the specific character with the new delimiter
    }
    $point++
    }
    $Line = $line.Replace("""","") #cleanup unnecessary quotation marks in the line of data
    
    $null = $datatable.Rows.Add($line.Split($newDelimiter)) 
    $i++ 
if (($i % $batchsize) -eq 0) {  
        $bulkcopy.WriteToServer($datatable)  
        $datatable.Clear()  
    }  
}  
 
if($datatable.Rows.Count -gt 0) { 
    $bulkcopy.WriteToServer($datatable) 
    $datatable.Clear() 
}  

$reader.Close()
$reader.Dispose() 
$bulkcopy.Close()
$bulkcopy.Dispose() 
$datatable.Dispose()

This is maybe not the most elegant way of eliminating issues with comma separated fields, but it works (and is not excessively slow (it did increase the run time on 9 million rows to circa 40 minutes on the dev server)).

I am sure someone will be able to post a drastically improved method.

All comments and feedback welcome.

 
Total article views: 1302 | Views in the last 30 days: 27
 
Related Articles
FORUM

Return data from a vb.net DataTable object to SQL Server

Process to return modified data to SQL Server from a datatable object in a CLR Assembly

FORUM

comma delimited name column

comma delimited name column

FORUM

SSIS and MSMQ datatable

SSIS and MSMQ datatable

FORUM

commas in column on CSV file

commas in column on CSV file

BLOG

Max Length of a column in a DataTable in PowerShell

Whilst I was writing my Test-DbaLastBackup Posts I ran into a common error I get when importing data...

Tags
bulkinsert    
csv    
etl    
powershell    
 
Contribute