SQLServerCentral Article

ETL Magic with SSIS – Part I

,

Introduction

This paper will demonstrate a process for loading a comma-delimited data file into a staging table for further processing, then validating the data and moving specific data into host tables. There are several excellent papers and tutorials, individually describing simple data load processes, error-handling, logging, etc. using SSIS, but I haven’t found a source demonstrating a process from start to finish; this is what motivated me to write this paper. The intended audience includes beginning SSIS developers, DBAs, and anyone with a desire to experience the magic of data management with this remarkable tool.

A significant challenge in producing this paper was the identification of a data set that was non-trivial, not proprietary, yet allowed the demonstration of important techniques available in SSIS.

I settled on using a data set I downloaded from a U.S. government WWW site containing data produced using taxpayer dollars, thus publically available. The data set is the Worldwide M1+ Earthquakes, Past 7 Days csv file available at http://explore.data.gov/catalog/raw/.

I’ve included ample screen shots where appropriate (please note that the following screen shots are of my local system’s directory structure, etc.) I also attempted to describe the process as clearly and concisely as possible, without insulting anyone’s intelligence; however, be forewarned that finding the balance between too much or not enough detail is a challenge unto itself.

In developing this paper it quickly become apparent that a complete treatment of the sample ETL project would be too involved for a single paper; therefore, I have divided the paper into two parts, at a place where I believe the division makes sense. Taken together, the two parts of the paper provide, in my opinion (biased as it is), a thorough example of a non-trivial ETL project using “real” data. For the record, I use the terms Task and Transformation synonymously. SSIS purists would say that this is incorrect, and I accept their judgment, although I would argue, that the definition of a “task” may generically include a transformation. For the novice, a Task acts upon the Control Flow while a Transformation acts upon the Data Flow.

Preliminaries: Setting up SQL Server Configuration

Depending on the environment, some values used by the package may change when the package is moved from development to production. Personally, I store my packages in SQL Server after development and have no desire to revisit the package in BIDS to change connections or other values. I choose to use SQL Server configuration rather than environment variables, XML, or registry settings, etc. though that’s just my choice, it doesn’t have to be yours.

To setup SQL Server configuration, we first want to create an Environment Variable to define the connection string to the package’s configuration database. Call the Environment Variable SSISConfigurationDatabase (or, whatever takes your fancy) and set its value to the connection string for the server/instance/database where you will store the other configuration values. What does this do? It allows us to control where the package will look for its configuration data.

With the Environment Variable defined, we now create a new Integration Services Project in BIDS. Create a connection manager for the database that either contains, or will contain, the configuration table. In my case, the table already exists because I created it for other projects. If you don’t have a table we’ll create one in a few minutes. 

With the connection manager defined, right-click on the design surface and select Package Configurations…, ensuring that the Enable package configurations checkbox is checked. Select the Add… button, then configure the selections as shown below:

When you click the Next > button, the workflow will ask you to “Select the property which will have its value set by the configuration” as shown below:

Select the ConnectionString property belonging to the connection manager we just created.

That’s it! We now have an externally configurable connection to the database, that stores the other configuration properties for the package. 

Priming the Process

Although configuring a data source in SSIS to retrieve the data file from a WWW site is a straightforward process, the more realistic scenario, in my experience, is that a client delivers a data file via ftp or web service for example. For this reason, our example will assume that the data file is delivered to a specific directory designated to receive ftp transmissions.

So, we want to look in a directory for a file that will be processed by our package. Do we know when the file will be delivered? Do we know when the transmission of the file will be complete? If so, then the problem is somewhat simplified, because we can just pick the file up out of the directory. However, in my experience, the file may not be there or may be getting there, which gives us two options: 1) to fail the package if the file is not available, then execute the package again at a later time, or 2) we can wait for the file to be ready. EDIT- I have reconfigured the final sentence of this paragraph, as it was previously very hard to understand. Confirmation needed that my interpretation is correct

As with any programming endeavor there is always more than one approach to solving a problem. In this case I tried a few approaches before I found a very reliable component called the File Watcher Task, contributed to the SSIS community by Konesans, and is available here (please note the component’s accompanying license). This clever little component greatly simplified my packages, allowing execution at time t while the data source file may not be available until sometime after t.

First, we need to create a Flat File connection manager to our data. Enter double quotes (“) in the Text qualifier property’s textbox and check the Column names in the first data row checkbox as shown below:

Take a peek at the columns by selecting the Columns view:

Not bad, with very little effort SSIS has parsed the data file for us. If we wanted to change the names of the columns or the data types assigned to them, we select the Advanced view:

Personally, I prefer to give SSIS the first shot at defining the data types. To accomplish this I select the Suggest Types button:

I change the Number of rows value to 1000 then click OK.

Examine the individual column definitions; it appears that everything is acceptable except for the Datatime column which is defined as a string(42). Before we leave the Advanced view, note that this is where the names of the columns can be modified and columns can be added or deleted.

Click on the Preview view to examine the expected output:

From this we can see why the Datetime field is string(42).

Storing the Flat File ConnectionString in the Configuration Database

To access the Package Configurations Organizer right-click on the Control Flow design surface, then select the Add button.

Earlier we defined the Environment Variable that stores the location of the configuration database, now we will create a new table to store the package configurations, like the FlatFile Connection Manager.

If you have previously created a Configuration table, select it; otherwise select the New… button. The Configuration table is defined as:

CREATE TABLE [dbo].[SSIS Configurations]
(
                ConfigurationFilter NVARCHAR(255) NOT NULL,
                ConfiguredValue NVARCHAR(255) NULL,
                PackagePath NVARCHAR(255) NOT NULL,
                ConfiguredValueType NVARCHAR(20) NOT NULL
)

The Configuration filter is an identifier allowing the segregation of package configurations within the table; type in a useful identifier for this value. The storage system for package configuration values is complete. 

Select the values from the FlatFile Connection Manager to store in the configuration table as shown below:

The Package Configuration Organizer should now look like this:

Execute the following query to examine the configuration values stored in the configuration table for our package:

SELECT [ConfigurationFilter]
      ,[ConfiguredValue]
      ,[PackagePath]
      ,[ConfiguredValueType]
  FROM [test].[dbo].[SSISConfigurations]
  WHERE Configurationfilter like 'Earth%'

ConfigurationFilter

ConfiguredValue

PackagePath

ConfiguredValueType

EarthQuakeData

C:\SSIS Paper\worldwide M1+ Earthquakes - 7 Days.txt

\Package.Connections[EarthquakeData].Properties[ConnectionString]

String

Begin When Ready

Follow the directions for installing the Konesans File Watcher Task component and adding it to the toolbox. Then, drag the File Watcher Task onto the Control Flow design surface. My policy is to rename a component as soon as I drop it onto the design surface, because I find this approach helps me stay organized, so rename the File Watcher Task. While we’re at it, let’s set the file name Filter, Path, and change the value of the Find Existing Files property as shown below:

Once the File Watcher Task is configured you can run the package; the File Watcher Task will find the EarthQuakeData file in the specified directory and turn green – Success! If the data file is not found, then the File Watcher Task will turn yellow, waiting for the file to become available, making this component a very welcome addition to the SSIS toolbox.

Staging

Now that our package will control its own execution, based on when a file is available from a client, our focus shifts to the processing of the data into a staging table. First, drag a Data Flow Task onto the design surface, then, connect the EarthQuakeData File Watcher Task by dragging its green output arrow onto the Data Flow Task. The Control Flow design surface should look like this:

Double click on the Data Flow Task to switch to the Data Flow design surface. Our destination is a SQL Server instance and the source is a flat file, so drag both from the toolbox onto the design surface. Connect the Flat File Source to the SQL Server Destination by dragging its green output arrow onto the SQL Server Destination. The Data Flow design surface should look like this:

Notice the red circle with the ‘X’; this is telling us that the component is not properly configured.

Because SSIS uses the metadata from upstream components to seed the configuration values of the downstream components, we start configuration with the Flat File Source component. Double click on the component to popup this dialogue, pre-populated with the EarthQuakeData connection manager we previously created:

Press the Preview button to view the data from the flat file. 

Click on the Columns view to select/unselect and rename columns for output.

Click on the Error Output view to define the action the component should take if an error occurs during the processing of a row of data from the flat file input. For our example we will accept the defaults.

Double click the SQL Server Destination component. As mentioned above, SSIS uses the metadata created during the configuration of the Flat File Source to seed the configuration values of the SQL Server Destination. Create a connection manager for the database where the Earthquake data will be loaded. If the data is to be loaded into an existing table, then select that table from the Use a table or view drop down; if not, and you want it to be loaded into a new table, press the New button which will generate the Create Table script as shown below:

Add an ID column of type int IDENTITY(1,1) to the CREATE TABLE command.

The SQL Destination Editor should look like this:

Click on the Mappings view to see how the output columns from the Flat File Source will map to the columns in the database table. Take a look at the other available configuration settings in the Advanced view, then click the OK button. (Note: this connection manager may also be included in the package configurations as simply as the Flat File Connection Manager; I leave doing so to the reader.)

Run the package. The two tasks on the Data Flow design surface should have turned green – Success! Examine the database; the flat file data is loaded in the table SQL Server EarthQuake Data Table

With just a few mouse clicks we have created a process to read the source file into the database. Work remains to be done to drop the SQL Server EarthQuake Data Table containing the imported data.

Summary

Pretty straightforward in my opinion. Again, in practice I do not like to modify a tested package which is why I use the package configurations as I do. I would argue that externalizing the package configurations is an “SSIS Best Practice”. The ETL tasks included in Part I of this paper are an introduction to the power and simplicity of SSIS; Part II takes this example to the next level introducing validations, and transformations, error handling, and distribution of the data.               

Rate

3.97 (32)

You rated this post out of 5. Change rating

Share

Share

Rate

3.97 (32)

You rated this post out of 5. Change rating