Remove Quotes from CSV

  • I'm importing a bunch of CSVs. They are TAB delimited and most of the char columns are quoted (I haven't looked yet to see what happens with any embedded quotes ... might be doubled-up and need sorting out too).

    I don't particularly want to use FORMAT files because of the hassle of maintaining them - external to the normal code base.

    I could knock up a pre-processor that strips out the chaff before the files are imported, but that's going to take me longer than "a bit of SQL".

    However, my concern is that the "bit of SQL" might be a performance/TLog/etc hog? They are multi-million row tablesI'm looking at either using STUFF/REVERSE or SUBSTRING/LEN
    UPDATE MyTable
    SET
        [MyCharCol1] = NullIf(
                              CASE WHEN [MyCharCol1] LIKE '"%"'
                                   THEN REVERSE(STUFF(
                                                      REVERSE(STUFF([MyCharCol1], 1, 1, ''))
                                                    , 1, 1, '')
                                         )
                                   ELSE [MyCharCol1]
                                   END
                              , '')
       , [MyCharCol2] = NullIf(CASE WHEN [MyCharCol2] LIKE '"%"'
                                   THEN SUBSTRING([MyCharCol2], 2, LEN([MyCharCol2])-2)
                                   ELSE [MyCharCol2]
                                   END
                              , '')
       , ...

    Any suggestions as to which would perform better? I suppose it depends whether LEN has to count-characters, or has that metadata available anyway. Or if there are any other (i.e. undesirable) side-effects of SUBSTRING / LEN

    Do I need to use DATALENGTH() rather than LEN() ? I've got a mixture of NVarchar and Varchar columns to process, probably some of them will have trailing spaces ... probably ought to RTrim() those too

  • I think BCP needs a format file for this.

    I  tend to use Powershell for this type of processing with the Out-DataTable function:

    https://gallery.technet.microsoft.com/scriptcenter/4208a159-a52e-4b99-83d4-8048468d29dd

    I then  use something like:

    . "YourPath\Out-DataTable.ps1"

    $Folder = 'YourFolder'
    $NameDate = "$((Get-Date -Format s).substring(0,13).replace('T', '-'))"
    $ConnectionString = "Data Source=YourDS; Database=YourDB; Trusted_Connection=True;";

    $Files = Get-ChildItem -Path $Folder | Where-Object{!($_.PSIsContainer) -and ($_.Name -like "$NameDate*.csv")} | Sort-Object Name
    $bulkCopy = new-object ("Data.SqlClient.SqlBulkCopy") $ConnectionString
    $connection = New-Object System.Data.SqlClient.SqlConnection($ConnectionString)
    $connection.Open()
    $sqlCommand = $connection.CreateCommand()

    foreach ($File in $Files)
    {
      $Table = $File.Name.Substring(19).replace('.csv','')
      #echo $Table
      $sqlCommand.CommandText = "TRUNCATE TABLE dbo." + $Table
      $result = $sqlCommand.ExecuteNonQuery()
      $csvDataTable = Import-CSV -Path $File.FullName | Out-DataTable
      $bulkCopy.DestinationTableName = $Table
      $bulkCopy.WriteToServer($csvDataTable)
    }

    $bulkCopy.Close()

    $connection.Close()

    This works well enough for me  as my maximum file size is 65K lines.

  • Thanks Ken, I'll give it a go 🙂

  • as your file is tab delimited you need to add
    -Delimiter "`t"
    to the import-csv command.

    also and if volumes are high you will wish to add batch size to it
    $bulkCopy.BatchSize = 50000 -- 50k rows batch size

    If for some reason your file is not fully CSV compliant you may need to look at other methods like 
    Microsoft.VisualBasic.FileIO.TextFieldParser example at https://www.powershellgallery.com/packages/dbatools/0.7.9.8/Content/functions%5CCsvSqlimport.ps1

    or a combination of the above plus regex

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply