Join Two CSV Files with a Common Column

  • Comments posted to this topic are about the item Join Two CSV Files with a Common Column

  • calbimonte.daniel - Sunday, January 22, 2017 7:40 AM

    Comments posted to this topic are about the item Join Two CSV Files with a Common Column

    no bcp love to import the files?

  • Thanks for your detailed and informative article. Just a FYI: if the files don't have headers then the trivial solution is to run

    COPY File1.CSV + File2.CSV OutputFile.CSV

    to stick two files together.

  • nick.mcdermaid - Monday, January 23, 2017 12:09 AM

    Thanks for your detailed and informative article. Just a FYI: if the files don't have headers then the trivial solution is to run

    COPY File1.CSV + File2.CSV OutputFile.CSV

    to stick two files together.

    I don't think that the result of this command will be what was expected.
    the article describes a situation where you have 2 files containing related data, that can be matched using a key column present in both files. example, a account,name file and an address file.
    different data and different structure except that both files have an account number (accId) column linking the data to specific account.  if this were the database table you would simply use "Join" to get the data set out.  but with flat files you need some extra steps as described in article.

    that said,however, there is an alternative method you can use bypassing the database altogether. 
    you can use PowerShell and manipulate files directly using  system objects like code below, 
    you can also  use powershell to import files into SQL database as well , see function CreateStagingTable below , without use of SSIS which can be more easily automated and you also have more control on how files are imported

    I am not truly a big fan of PowerShell as it is pushed on us by MS,   but I can appreciate the power of it, and give credit, where credit is do. 

    ## PowerShell file manipulation loop

      $reader = [System.IO.File]::OpenText($inputfilename)
      #  $writer = New-Object System.IO.StreamWriter $outputfilename
        $record = $reader.ReadLine() 
        while ($record -ne $null)
        {....
         do processing here
      ....
         $record = $reader.ReadLine()
        }
      $reader.Close()
      $reader.Dispose()

      Function CreateStagingTable($location, $file,$delim, $extension, $server, $database)
      {
        $full = $location + $file + $extension
        $all = Get-Content $full
        $columns = $all[0]
        $columns = $columns.Replace(" ","")
        $columns = $columns.Replace($delim,"] VARCHAR(100), [")
        $table = "CREATE TABLE " + $file + "([" + $columns + "] VARCHAR(100))"
        $table =$table.Replace("[]","[ExtraColumn]")
       
        $connection = New-Object System.Data.SqlClient.SqlConnection
        $buildTable = New-Object System.Data.SqlClient.SqlCommand
        $connection.ConnectionString = "Data Source=" + $server + ";Database=" + $database + ";integrated security=true"
        $buildTable.CommandText = $table
        $buildTable.Connection = $connection
        $connection.Open()
        $buildTable.ExecuteNonQuery()
        $connection.Close()
      }
      CreateStagingTable -location "<your source file path>" -file "<Source File Name>" -delim "<Data Delimiter>" -extension "<Source File Extencion>" -server "<SQL Server Name>" -database "SQL Database Name"

    PS>> you can also just use PowerShell to process the files and get output directly,but it might be slower. and does require quite a lot of code to accomplish what a few line of SQL can. however, if you want to make it more flexible and db independent it may be a better option.

  • Good morning all.  

    Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers.  An example:

    SELECT
         Identifier
         , FirstName
         , Surname
         , PostalCode
      FROM OPENROWSET('MSDASQL'
         ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\'
         ,'SELECT * FROM "MyDataFile.csv"')

  • John Bigler - Monday, January 23, 2017 6:49 AM

    Good morning all.  

    Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers.  An example:

    SELECT
         Identifier
         , FirstName
         , Surname
         , PostalCode
      FROM OPENROWSET('MSDASQL'
         ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\'
         ,'SELECT * FROM "MyDataFile.csv"')

    Good one,  
    completely forgot about "OPENROWSET "  even though used it extensively last year.

  • John Bigler - Monday, January 23, 2017 6:49 AM

    Good morning all.  

    Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers.  An example:

    SELECT
         Identifier
         , FirstName
         , Surname
         , PostalCode
      FROM OPENROWSET('MSDASQL'
         ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\'
         ,'SELECT * FROM "MyDataFile.csv"')

    How do you define the delimiter to be a semi-colon as in the example data provided? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, January 23, 2017 8:28 AM

    John Bigler - Monday, January 23, 2017 6:49 AM

    Good morning all.  

    Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers.  An example:

    SELECT
         Identifier
         , FirstName
         , Surname
         , PostalCode
      FROM OPENROWSET('MSDASQL'
         ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\'
         ,'SELECT * FROM "MyDataFile.csv"')

    How do you define the delimiter to be a semi-colon as in the example data provided? 😉

    well for "OPENROWSET" or "OPENDATASOURCE" to work, regardless of data provider, 
    you also need a file "schema.ini" to exists in the same folder where the source files are located.
    within that file you set a file definition config that is used by the functions above 

    so for example if you have file1.csv and file2.csv

    than with in shema.ini you would have 2 sections defining the file format and column names,
    the starting raw if needed to skip the header etc.
    without that, your data might/will get messed up or not processed at all.

    couple of years back, I had a project where I needed to automate an import/export of data in and out of SQL database. since back than, we run on windows server 2008 and old powershell version, using PS scripts were not possible,  so I build. up solution using the  "OPENROWSET" and "OPENDATASOURCE"  procedures on MS-SQL 2010. 

    last year we had a huge server crash which took out our SQL server with it. as it happened we also lost our backup of the processing DB with all my import/export code (we had a full up to date backup of all main DBs but one that I build and used for intermediate data processing.)
    so once I had the new server build out and moved to SQL 2014, was fairly easy to upgrade from 2010.
    I had to either rebuild whole processing DB from scratch(lost all my scripts few month before the crush to a cryptolocker), or figure out an easier way. was looking into PowerShell to get a feel of it at the time,
    tried a few things with other flat files, and though I had nothing to loose by trying.
    other than being slow, I like the PS way of doing things.   also I am sure that it is slow simply because I haven't found a better way of doing things I need do to the luck of knowledge, not the PS itself.  

    [file1.csv]
    FORMAT=CSVDelimited
    CHARACTERSET=OEM
    COLNAMEHEADER=FALSE
    TEXTDELIMITER = "
    DateTimeFormat = "MM/DD/YYYY"
    CurrencyDigits = 2
    CurrencySymbol =' '
    CurrencyNegFormat = 2
    STARTROW = 1
    MAXSCANROWS = 0
    COL1 = col1  CHAR WIDTH 200
    COL2 = col2  Short
    COL3 = col3  CHAR WIDTH 200

    [file2.csv]
    FORMAT=DELIMITED(;)
    CHARACTERSET=OEM
    COLNAMEHEADER=FALSE
    TEXTDELIMITER = "NONE"
    DateTimeFormat = "MM/DD/YYYY"
    CurrencyDigits = 2
    CurrencySymbol = ' '
    CurrencyNegFormat = 2
    STARTROW = 2
    MAXSCANROWS = 0
    COL1 = col1  CHAR WIDTH 200
    COL2 = col2  Short
    COL3 = col3  CHAR WIDTH 200
    COL4 = col4  CHAR WIDTH 200

  • vl1969-734655 - Monday, January 23, 2017 8:55 AM

    Jeff Moden - Monday, January 23, 2017 8:28 AM

    John Bigler - Monday, January 23, 2017 6:49 AM

    Good morning all.  

    Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers.  An example:

    SELECT
         Identifier
         , FirstName
         , Surname
         , PostalCode
      FROM OPENROWSET('MSDASQL'
         ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\'
         ,'SELECT * FROM "MyDataFile.csv"')

    How do you define the delimiter to be a semi-colon as in the example data provided? 😉

    well for "OPENROWSET" or "OPENDATASOURCE" to work, regardless of data provider, 
    you also need a file "schema.ini" to exists in the same folder where the source files are located.
    within that file you set a file definition config that is used by the functions above 

    so for example if you have file1.csv and file2.csv

    than with in shema.ini you would have 2 sections defining the file format and column names,
    the starting raw if needed to skip the header etc.
    without that, your data might/will get messed up or not processed at all.

    couple of years back, I had a project where I needed to automate an import/export of data in and out of SQL database. since back than, we run on windows server 2008 and old powershell version, using PS scripts were not possible,  so I build. up solution using the  "OPENROWSET" and "OPENDATASOURCE"  procedures on MS-SQL 2010. 

    last year we had a huge server crash which took out our SQL server with it. as it happened we also lost our backup of the processing DB with all my import/export code (we had a full up to date backup of all main DBs but one that I build and used for intermediate data processing.)
    so once I had the new server build out and moved to SQL 2014, was fairly easy to upgrade from 2010.
    I had to either rebuild whole processing DB from scratch(lost all my scripts few month before the crush to a cryptolocker), or figure out an easier way. was looking into PowerShell to get a feel of it at the time,
    tried a few things with other flat files, and though I had nothing to loose by trying.
    other than being slow, I like the PS way of doing things.   also I am sure that it is slow simply because I haven't found a better way of doing things I need do to the luck of knowledge, not the PS itself.  

    [file1.csv]
    FORMAT=CSVDelimited
    CHARACTERSET=OEM
    COLNAMEHEADER=FALSE
    TEXTDELIMITER = "
    DateTimeFormat = "MM/DD/YYYY"
    CurrencyDigits = 2
    CurrencySymbol =' '
    CurrencyNegFormat = 2
    STARTROW = 1
    MAXSCANROWS = 0
    COL1 = col1  CHAR WIDTH 200
    COL2 = col2  Short
    COL3 = col3  CHAR WIDTH 200

    [file2.csv]
    FORMAT=DELIMITED(;)
    CHARACTERSET=OEM
    COLNAMEHEADER=FALSE
    TEXTDELIMITER = "NONE"
    DateTimeFormat = "MM/DD/YYYY"
    CurrencyDigits = 2
    CurrencySymbol = ' '
    CurrencyNegFormat = 2
    STARTROW = 2
    MAXSCANROWS = 0
    COL1 = col1  CHAR WIDTH 200
    COL2 = col2  Short
    COL3 = col3  CHAR WIDTH 200
    COL4 = col4  CHAR WIDTH 200

    Well, Interestingly enough... I've never had to use a schema.ini file.  I'll need to use DBCC TIMEWARP but I'll try to find an old example for what I'm talking about.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • well if you never needed the ini, than how did your data got processed?
    I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

  • vl1969-734655 - Tuesday, January 24, 2017 5:51 AM

    well if you never needed the ini, than how did your data got processed?
    I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

    I tried to find the very old code that I used to use and couldn't.  There's apparently been too many computer changes in my life.  For those rare instances where we wanted to use OPENROWSET instead of any of the many other method there are for oddities that used other than Tab or Comma Delimited files, we did the registry-hack trick of reading the registry to remember the current delimiter, changing it to the delimiter that we needed, and changing it back when we were done.  It was all done through SQL Server using XP_REGREAD/WRITE and (later) XP_INSTANCE_REGREAD/WRITE.  It's a bloody shame that I can't find the code because it worked very well.  Of course, it may be just as well because most DBAs would rightfully flip their wigs if a stored procedure would do anything with the registry.  I didn't mind because I was the DBA and I'm the one that wrote and controlled the code. 😉

    On the PS thing, rumor has it that SSMS 2016 has some methods in it that make life a whole lot easier when it comes to writing results to tables.  I've not tried them yet and, because PS is interpretive, I suspect that your earlier observation of it being a bit slow compared to other things may be true.  Again, I've not confirmed that yet.  Not sure I'll take the time, either, because the other methods (BULK INSERT/BCP and OPENROWSET using the ACE drivers (for spreadsheets) do work just fine).  If you throw in a BCP FORMAT file, there's a whole lot of control available although it won't handle the occasional double-quoted entry that was double quoted because it contains a delimiter.  There are methods to homogenize that problem using the ImportCSV and ExportCSV cmdlets to double quote all entries so that a BCP FORMAT file will work nicely.  You can even use a FORMAT FILE with OPENROWSET although it's not my nature to do so because the other methods are faster and I don't have a need to update files directly from SQL Server.

    Thank you for posting the info on using a Schema.Ini file for "non-standard" delimiters.  Unless the OP can't put such a file out there on disk due to privs or some "corporate rules", that's probably just what the doctor ordered.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • vl1969-734655 - Tuesday, January 24, 2017 5:51 AM

    well if you never needed the ini, than how did your data got processed?
    I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

    The Microsoft Developer Network provides a useful article describing the schema.ini file at: https://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx

    I have only had success using OPENROWSET in the 32-bit versions of SQL Server, as I have been unable to locate a 64-bit version of the Microsoft Text Driver specified in the command OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Path\', 'SELECT * FROM "MyDataFile.csv"').

    --Gordon Rogers

  • gorthog - Tuesday, January 24, 2017 8:36 AM

    vl1969-734655 - Tuesday, January 24, 2017 5:51 AM

    well if you never needed the ini, than how did your data got processed?
    I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

    The Microsoft Developer Network provides a useful article describing the schema.ini file at: https://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx

    I have only had success using OPENROWSET in the 32-bit versions of SQL Server, as I have been unable to locate a 64-bit version of the Microsoft Text Driver specified in the command OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Path\', 'SELECT * FROM "MyDataFile.csv"').

    --Gordon Rogers

    You can use the 64-bit ACE engine...

    if you're doing this first on a PC, and it's running 64-bit Windows, but 32-bit Office, the 64-bit ACE engine will mess with 32-bit Office apps badly. The 32-bit ACE and 64-bit ACE on the same 64-bit OS server (without Office on it!) did not seem to step on each other in my experience, but I wouldn't count on it on a production server.

  • Just a note to anyone that may be watching... I think it's a mortal sin to install MS Office on the same "box" that SQL Server lives on.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, January 24, 2017 10:32 PM

    Just a note to anyone that may be watching... I think it's a mortal sin to install MS Office on the same "box" that SQL Server lives on.

    I agree with you Mr. Moden. But it at least gets done on (to?) SSIS servers to help deal with Excel files, especially with old .xls (XBIFF) files. There are at least two libraries that can directly read and write Excel files, though. (EPPlus.dll and I forget the other one)... It may be worth spending some time writing a Custom Excel component or two for SSIS that uses either of them... or just automate the process through Powershell like I did if your company won't spring for CozyRoc's components (SSIS calls powershell to run the script...)

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply