Join Two CSV Files with a Common Column

Question

Join Two CSV Files with a Common Column

Daniel Calbimonte

SSCertifiable

Points: 6418
More actions
January 22, 2017 at 7:40 am

#320831

Comments posted to this topic are about the item Join Two CSV Files with a Common Column

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

corey lawson Hall of Fame Points: 3773 More actions · Answer 1

calbimonte.daniel - Sunday, January 22, 2017 7:40 AM
Comments posted to this topic are about the item Join Two CSV Files with a Common Column

no bcp love to import the files?

nick.mcdermaid Hall of Fame Points: 3839 More actions · Answer 2

Thanks for your detailed and informative article. Just a FYI: if the files don't have headers then the trivial solution is to run

COPY File1.CSV + File2.CSV OutputFile.CSV

to stick two files together.

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 3

nick.mcdermaid - Monday, January 23, 2017 12:09 AM
Thanks for your detailed and informative article. Just a FYI: if the files don't have headers then the trivial solution is to run
COPY File1.CSV + File2.CSV OutputFile.CSV
to stick two files together.

I don't think that the result of this command will be what was expected.
the article describes a situation where you have 2 files containing related data, that can be matched using a key column present in both files. example, a account,name file and an address file.
different data and different structure except that both files have an account number (accId) column linking the data to specific account. if this were the database table you would simply use "Join" to get the data set out. but with flat files you need some extra steps as described in article.

that said,however, there is an alternative method you can use bypassing the database altogether.
you can use PowerShell and manipulate files directly using system objects like code below,
you can also use powershell to import files into SQL database as well , see function CreateStagingTable below , without use of SSIS which can be more easily automated and you also have more control on how files are imported

I am not truly a big fan of PowerShell as it is pushed on us by MS, but I can appreciate the power of it, and give credit, where credit is do.

## PowerShell file manipulation loop

PS>> you can also just use PowerShell to process the files and get output directly,but it might be slower. and does require quite a lot of code to accomplish what a few line of SQL can. however, if you want to make it more flexible and db independent it may be a better option.

John Bigler SSC-Addicted Points: 477 More actions · Answer 4

Good morning all.

Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:

SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 5

John Bigler - Monday, January 23, 2017 6:49 AM
Good morning all.
Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:
SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')

Good one,
completely forgot about "OPENROWSET " even though used it extensively last year.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 6

John Bigler - Monday, January 23, 2017 6:49 AM
Good morning all.
Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:
SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')

How do you define the delimiter to be a semi-colon as in the example data provided? 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 7

Jeff Moden - Monday, January 23, 2017 8:28 AM
John Bigler - Monday, January 23, 2017 6:49 AM
Good morning all.
Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:
SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')
How do you define the delimiter to be a semi-colon as in the example data provided? 😉

well for "OPENROWSET" or "OPENDATASOURCE" to work, regardless of data provider,
you also need a file "schema.ini" to exists in the same folder where the source files are located.
within that file you set a file definition config that is used by the functions above

so for example if you have file1.csv and file2.csv

than with in shema.ini you would have 2 sections defining the file format and column names,
the starting raw if needed to skip the header etc.
without that, your data might/will get messed up or not processed at all.

couple of years back, I had a project where I needed to automate an import/export of data in and out of SQL database. since back than, we run on windows server 2008 and old powershell version, using PS scripts were not possible, so I build. up solution using the "OPENROWSET" and "OPENDATASOURCE" procedures on MS-SQL 2010.

last year we had a huge server crash which took out our SQL server with it. as it happened we also lost our backup of the processing DB with all my import/export code (we had a full up to date backup of all main DBs but one that I build and used for intermediate data processing.)
so once I had the new server build out and moved to SQL 2014, was fairly easy to upgrade from 2010.
I had to either rebuild whole processing DB from scratch(lost all my scripts few month before the crush to a cryptolocker), or figure out an easier way. was looking into PowerShell to get a feel of it at the time,
tried a few things with other flat files, and though I had nothing to loose by trying.
other than being slow, I like the PS way of doing things. also I am sure that it is slow simply because I haven't found a better way of doing things I need do to the luck of knowledge, not the PS itself.

[file1.csv]
FORMAT=CSVDelimited
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol =' '
CurrencyNegFormat = 2
STARTROW = 1
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200

[file2.csv]
FORMAT=DELIMITED(;)
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "NONE"
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol = ' '
CurrencyNegFormat = 2
STARTROW = 2
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200
COL4 = col4 CHAR WIDTH 200

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 8

vl1969-734655 - Monday, January 23, 2017 8:55 AM
Jeff Moden - Monday, January 23, 2017 8:28 AM
John Bigler - Monday, January 23, 2017 6:49 AM
Good morning all.
Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:
SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')
How do you define the delimiter to be a semi-colon as in the example data provided? 😉
well for "OPENROWSET" or "OPENDATASOURCE" to work, regardless of data provider,
you also need a file "schema.ini" to exists in the same folder where the source files are located.
within that file you set a file definition config that is used by the functions above
so for example if you have file1.csv and file2.csv
than with in shema.ini you would have 2 sections defining the file format and column names,
the starting raw if needed to skip the header etc.
without that, your data might/will get messed up or not processed at all.
couple of years back, I had a project where I needed to automate an import/export of data in and out of SQL database. since back than, we run on windows server 2008 and old powershell version, using PS scripts were not possible, so I build. up solution using the "OPENROWSET" and "OPENDATASOURCE" procedures on MS-SQL 2010.
last year we had a huge server crash which took out our SQL server with it. as it happened we also lost our backup of the processing DB with all my import/export code (we had a full up to date backup of all main DBs but one that I build and used for intermediate data processing.)
so once I had the new server build out and moved to SQL 2014, was fairly easy to upgrade from 2010.
I had to either rebuild whole processing DB from scratch(lost all my scripts few month before the crush to a cryptolocker), or figure out an easier way. was looking into PowerShell to get a feel of it at the time,
tried a few things with other flat files, and though I had nothing to loose by trying.
other than being slow, I like the PS way of doing things. also I am sure that it is slow simply because I haven't found a better way of doing things I need do to the luck of knowledge, not the PS itself.
[file1.csv]
FORMAT=CSVDelimited
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol =' '
CurrencyNegFormat = 2
STARTROW = 1
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200
[file2.csv]
FORMAT=DELIMITED(;)
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "NONE"
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol = ' '
CurrencyNegFormat = 2
STARTROW = 2
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200
COL4 = col4 CHAR WIDTH 200

Well, Interestingly enough... I've never had to use a schema.ini file. I'll need to use DBCC TIMEWARP but I'll try to find an old example for what I'm talking about.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

vl1969-734655 Mr or Mrs. 500 Points: 545 More actions · Answer 9

well if you never needed the ini, than how did your data got processed?
I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 10

vl1969-734655 - Tuesday, January 24, 2017 5:51 AM
well if you never needed the ini, than how did your data got processed?
I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

I tried to find the very old code that I used to use and couldn't. There's apparently been too many computer changes in my life. For those rare instances where we wanted to use OPENROWSET instead of any of the many other method there are for oddities that used other than Tab or Comma Delimited files, we did the registry-hack trick of reading the registry to remember the current delimiter, changing it to the delimiter that we needed, and changing it back when we were done. It was all done through SQL Server using XP_REGREAD/WRITE and (later) XP_INSTANCE_REGREAD/WRITE. It's a bloody shame that I can't find the code because it worked very well. Of course, it may be just as well because most DBAs would rightfully flip their wigs if a stored procedure would do anything with the registry. I didn't mind because I was the DBA and I'm the one that wrote and controlled the code. 😉

On the PS thing, rumor has it that SSMS 2016 has some methods in it that make life a whole lot easier when it comes to writing results to tables. I've not tried them yet and, because PS is interpretive, I suspect that your earlier observation of it being a bit slow compared to other things may be true. Again, I've not confirmed that yet. Not sure I'll take the time, either, because the other methods (BULK INSERT/BCP and OPENROWSET using the ACE drivers (for spreadsheets) do work just fine). If you throw in a BCP FORMAT file, there's a whole lot of control available although it won't handle the occasional double-quoted entry that was double quoted because it contains a delimiter. There are methods to homogenize that problem using the ImportCSV and ExportCSV cmdlets to double quote all entries so that a BCP FORMAT file will work nicely. You can even use a FORMAT FILE with OPENROWSET although it's not my nature to do so because the other methods are faster and I don't have a need to update files directly from SQL Server.

Thank you for posting the info on using a Schema.Ini file for "non-standard" delimiters. Unless the OP can't put such a file out there on disk due to privs or some "corporate rules", that's probably just what the doctor ordered.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

gorthog Ten Centuries Points: 1084 More actions · Answer 11

vl1969-734655 - Tuesday, January 24, 2017 5:51 AM
well if you never needed the ini, than how did your data got processed?
I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.

The Microsoft Developer Network provides a useful article describing the schema.ini file at: https://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx

I have only had success using OPENROWSET in the 32-bit versions of SQL Server, as I have been unable to locate a 64-bit version of the Microsoft Text Driver specified in the command OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Path\', 'SELECT * FROM "MyDataFile.csv"').

--Gordon Rogers

corey lawson Hall of Fame Points: 3773 More actions · Answer 12

gorthog - Tuesday, January 24, 2017 8:36 AM
vl1969-734655 - Tuesday, January 24, 2017 5:51 AM
well if you never needed the ini, than how did your data got processed?
I have found that without ini file, the data is not properly formatted, also it must be a comma separated file, anything else gets scrambled. and needs to be heavily processed in SQL after import.
The Microsoft Developer Network provides a useful article describing the schema.ini file at: https://msdn.microsoft.com/en-us/library/ms709353(v=vs.85).aspx
I have only had success using OPENROWSET in the 32-bit versions of SQL Server, as I have been unable to locate a 64-bit version of the Microsoft Text Driver specified in the command OPENROWSET('MSDASQL', 'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\Path\', 'SELECT * FROM "MyDataFile.csv"').
--Gordon Rogers

You can use the 64-bit ACE engine...

if you're doing this first on a PC, and it's running 64-bit Windows, but 32-bit Office, the 64-bit ACE engine will mess with 32-bit Office apps badly. The 32-bit ACE and 64-bit ACE on the same 64-bit OS server (without Office on it!) did not seem to step on each other in my experience, but I wouldn't count on it on a production server.

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 13

Just a note to anyone that may be watching... I think it's a mortal sin to install MS Office on the same "box" that SQL Server lives on.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

corey lawson Hall of Fame Points: 3773 More actions · Answer 14

Jeff Moden - Tuesday, January 24, 2017 10:32 PM
Just a note to anyone that may be watching... I think it's a mortal sin to install MS Office on the same "box" that SQL Server lives on.

I agree with you Mr. Moden. But it at least gets done on (to?) SSIS servers to help deal with Excel files, especially with old .xls (XBIFF) files. There are at least two libraries that can directly read and write Excel files, though. (EPPlus.dll and I forget the other one)... It may be worth spending some time writing a Custom Excel component or two for SSIS that uses either of them... or just automate the process through Powershell like I did if your company won't spring for CozyRoc's components (SSIS calls powershell to run the script...)