RE: Join Two CSV Files with a Common Column

SSC Guru

Points: 1003863

January 23, 2017 at 3:55 pm

#1924501

vl1969-734655 - Monday, January 23, 2017 8:55 AM
Jeff Moden - Monday, January 23, 2017 8:28 AM
John Bigler - Monday, January 23, 2017 6:49 AM
Good morning all.
Nothing wrong with these solutions, but I prefer something more direct like just treating the CSV files as virtual tables and reading them using OPENROWSET instead of created linked servers. An example:
SELECT Identifier , FirstName , Surname , PostalCode FROM OPENROWSET('MSDASQL' ,'Driver={Microsoft Access Text Driver (*.txt, *.csv)};DefaultDir=C:\LoadDir\' ,'SELECT * FROM "MyDataFile.csv"')
How do you define the delimiter to be a semi-colon as in the example data provided? 😉
well for "OPENROWSET" or "OPENDATASOURCE" to work, regardless of data provider,
you also need a file "schema.ini" to exists in the same folder where the source files are located.
within that file you set a file definition config that is used by the functions above
so for example if you have file1.csv and file2.csv
than with in shema.ini you would have 2 sections defining the file format and column names,
the starting raw if needed to skip the header etc.
without that, your data might/will get messed up or not processed at all.
couple of years back, I had a project where I needed to automate an import/export of data in and out of SQL database. since back than, we run on windows server 2008 and old powershell version, using PS scripts were not possible, so I build. up solution using the "OPENROWSET" and "OPENDATASOURCE" procedures on MS-SQL 2010.
last year we had a huge server crash which took out our SQL server with it. as it happened we also lost our backup of the processing DB with all my import/export code (we had a full up to date backup of all main DBs but one that I build and used for intermediate data processing.)
so once I had the new server build out and moved to SQL 2014, was fairly easy to upgrade from 2010.
I had to either rebuild whole processing DB from scratch(lost all my scripts few month before the crush to a cryptolocker), or figure out an easier way. was looking into PowerShell to get a feel of it at the time,
tried a few things with other flat files, and though I had nothing to loose by trying.
other than being slow, I like the PS way of doing things. also I am sure that it is slow simply because I haven't found a better way of doing things I need do to the luck of knowledge, not the PS itself.
[file1.csv]
FORMAT=CSVDelimited
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol =' '
CurrencyNegFormat = 2
STARTROW = 1
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200
[file2.csv]
FORMAT=DELIMITED(;)
CHARACTERSET=OEM
COLNAMEHEADER=FALSE
TEXTDELIMITER = "NONE"
DateTimeFormat = "MM/DD/YYYY"
CurrencyDigits = 2
CurrencySymbol = ' '
CurrencyNegFormat = 2
STARTROW = 2
MAXSCANROWS = 0
COL1 = col1 CHAR WIDTH 200
COL2 = col2 Short
COL3 = col3 CHAR WIDTH 200
COL4 = col4 CHAR WIDTH 200

Well, Interestingly enough... I've never had to use a schema.ini file. I'll need to use DBCC TIMEWARP but I'll try to find an old example for what I'm talking about.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)