SQLServerCentral Article

Join Two CSV Files with a Common Column

,

Introduction

Sometimes, we need to get information of two files with common columns. This is very easy using relational databases like SQL Server, Oracle, MySQL, etc., but it is not that easy when the information is stored in flat files. In this article we will show 2 ways to solve this problem:

  1. Importing the csv files to SQL Server using SSIS and then exporting the results to a csv file.
  2. Using Linked Servers.

Requirements

  1. SQL Server installed.
  2. SSIS and SQL Server Data Tools (SSDT).
  3. SQL Server Management Studio (SSMS).

Importing the csv files into SQL Server using SSIS

In the first example, we will use SSIS to import the two .csv files into SQL Server. Once imported, creating queries with joins will be a straightforward process. We will use two CSV files named products.csv and sales.csv both files are included in resource files at the bottom of this article (csv files.zip).

products.csv

ProductID;ProductName;Price
1;PC;500
2;Tablet;600
3;Mouse;20
4;Keyboard;20
5;Printer;130

The file contains the ProductID, ProductName and price.

The second file is sales.csv and contains the ProductID of the products sold and the invoice number:

sales.csv

ProductID;InvoiceNumber
3;3456644
2;3244446
4;5434444
2;3435543
1;6534224
3;2355444
5;7663333
1;4344555
2;5345666
3;4389756
1;3555675

To convert these two files into two tables, we will import them using SSIS, which is a tool to import and export files from different sources. In SSDT, go to File>New Project and select SSIS Project. In the Solution Explorer right click SSIS Packages and select SSIS Import and Export Wizard:

In the Wizard, click Next:

In Data Source, select Flat File Source. In File name select the sales.csv file:

Click the Columns page. You will be able to see a Previous view. Press Next:

In Destination, select Microsoft OLE DB Provider for SQL Server. In Server name, specify the SQL Server name. In Authentication, specify the credentials to access to SQL Server. In Database, select any database of your preference:

If everything is OK, you will receive a success message:

To import the data in Solution Explorer, right click in the package created and select Execute Package. This action will import the data:

In solution Explorer, run the SSIS Import and Export Wizard again to import the products.csv file to SQL Server:

In SQL Server Management Studio (SSMS), check that the tables dbo.sales and dbo.producs were created:

You can query and make the reports you want in SQL Server. In relational databases, you use the join statement to join two or more tables. In this example, we will create a query to show the ProductID, InvoiceNumber and ProductName. This query retrives information from 2 tables. We will export the results to a csv file using SSIS.

In our SSIS project, run the SSIS Import and Export Wizard again:

Press next:

Select the Microsft OLE DB Provider for SQL Server Data Source and specify the SQL Server Name, credentials and database where the products and sales tables were created:

In Destination, select a Flat File Destination and in File name, specify the path and name of the file with the information of sales and products tables:

We need to write a query to specify the data to transfer:

This is the most important part of this article. We are selecting the ProductID, InvoiceNumber, ProductName from sales and sales is joined with the table products in the column that is common in both tables (ProductID):

In Column delimiter select the delimiter for the columns:

Once completed, press Finish:

In Solution Explorer, execute the package:

Now you have a csv file named salesinfo with the information of products.csv and sales.csv files:

Linked Servers

Linked Servers are another option to query flat files. We will use the files named linked server files.zip in the Resource files section.

EXEC sp_AddLinkedServer @server     = 'csvlinkedserver', 
                        @srvproduct = '',
                        @provider   = 'Microsoft.ACE.OLEDB.12.0', 
                        @datasrc    = 'C:\sql\data', 
                        @provstr    = 'Text'

sp_AddLinkedServer is the stored procedure used to add a linked server.

@server is the name of the Linked Server. It can be any name.

@provider is the OLEDB data source used to connect to the external source like a flat file or other database.

@datasrc in this example is the folder that contains the sales.csv and products.csf files.

@provstr is the connection string.

You can also configure Linked Server Logins using sp_AddLinkedSrvLogin:

EXEC sp_AddLinkedSrvLogin csvlinkedserver, FALSE
go

The following command when it is set to false, it means that no mapping will be used. When it is set to true, you can map your SQL Server logins with other accounts.

Once configured, you can query the files like SQL Tables:

SELECT s.ProductID,Productname,InvoiceNumber 
FROM csvlinkedserver...sales#csv s
INNER JOIN 
csvlinkedserver...products#csv p
ON
s.ProductID=p.ProductID

The query will show the ProductID, Product name, Invoice number of the sales.csv file and products.csv file. The files are joined using the ProductID column, which is the common column.

If you receive the following error, it means that the Microsoft Database Engine is not installed:

Microsoft.ACE.OLEDB.12.0 has not been registered

You can download the Database Engine here:

Download Microsoft Database Engine Redistributable

Make sure to download the 64 bits installer if your machine is 64 bits.

If everything is OK, you will have the following query results:

Conclusions

In this article, two ways to join the information of csv files were explained. One way is importing the files to SQL Server. This option will allow running queries in few seconds or milliseconds depending on the number of rows and columns. The queries will be very fast.

This option is recommended if we will need multiple queries to the data of the csv files.

SSIS is a very sophisticated tool to import files and you can convert, transform from different formats to import to SQL Server or other database of your preference.

The second option is very easy and simple to configure because you do not need to link file by file. If all the csv files are in the same folder a single Linked Server is enough. The problem is that the queries are slower and the structure of the file is not flexible. It is possible that you will need to convert and transform some data before creating the linked server.

This second option is a good choice if you need some few queries with few rows and columns.

References

For more information, refer to these links:

Resources

Rate

3.67 (6)

You rated this post out of 5. Change rating

Share

Share

Rate

3.67 (6)

You rated this post out of 5. Change rating