SQLServerCentral Article

Extract multiple files from multiple directories with a data flow task

,

[AUTHOR] Comments to my previous related article indicated that it should have had error checking. I have added an addendum and another file to address this in this article.

A previous article demonstrated how to extract multiple files from a single subdirectory with a data flow task. This article demonstrates how to extract multiple files that reside in more than one subdirectory.

The best way to create an SSIS package that extracts data from a flat file and inserts it into a table is to use the SSIS Import and Export Wizard, because the Data Flow Task written by the wizard is far faster than anything you could write yourself.

To run the import/export wizard from an SSIS project in Business Intelligence Development Studio, right-click on the "SSIS Packages" node and select "SSIS Import and Export Wizard."

In the first screen, choose the "Flat File Source" data source and browse to the file you wish to import, then click the "Next" button.

On the following screen click on the "Advanced" node followed by the "Suggest Types" and "OK" buttons to allow the wizard to generate appropriate data types for the fields in the file.

Click "Next," then enter the server name and the database containing the table where you wish to store the data.

In the next screen you choose the table where the data will be stored.

The following screen shows how the data will be mapped from the source file to the destination table.

Click "Next," then the "Finish" button on the following screen to complete the creation of the package.

The resulting package contains a Data Flow Task. Execution of this package results in data being extracted from the source file and inserted into the destination table.

By default, a Data Flow Task written by the wizard is hard-coded to extract data from a specific file location. Sometimes, however, a system may transfer a continuous stream of data files to one or more network subdirectories that require periodic batch processing.

The following screen shot shows an SSIS package that reads the paths of all relevant files in designated subdirectories and feeds them to a Data Flow Task that extracts the data in each file to the destination table.

The above package can serve as a template for making any Data Flow Task capable of processing multiple files by pasting it into the Foreach loop container.

The first SSIS component in the package is the VB.net Script Task "Get list of data files"containing the following code:

Option Strict On
Imports System
Imports System.Collections
Imports System.Data
Imports System.Data.SqlClient
Imports System.Math
Imports System.Windows.Forms
Imports Microsoft.SqlServer.Dts.Runtime

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
    Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    Enum ScriptResults
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum
    Public Sub Main()
        'Get file name root from programParameters table with variable FileRootParamater as selection criteria
        Dim FileRootParameter As String = Dts.Variables("FileRootParameter").Value.ToString
        Dim mySqlStatement As String = "SELECT value FROM Test.dbo.programParameters WHERE parameterName = '" & FileRootParameter & "'"
        Dim myADONETConnection As SqlClient.SqlConnection = DirectCast(Dts.Connections("TestADONET").AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)
        Dim myCommand As New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)
        Dim reader As SqlDataReader = myCommand.ExecuteReader(CommandBehavior.CloseConnection)
        Dim FileRoot As String = ""
        Do While reader.Read()
            FileRoot = reader("value").ToString()
        Loop
        reader.Close()
        myADONETConnection.Close()
        'Get ArrayList of source directories from programParameters table with variable SourceDirectoryParamater as selection criteria
        Dim SourceDirectoryParameter As String = Dts.Variables("SourceDirectoryParameter").Value.ToString
        mySqlStatement = "SELECT value FROM Test.dbo.programParameters WHERE parameterName = '" & SourceDirectoryParameter & "'"
        myADONETConnection = DirectCast(Dts.Connections("TestADONET").AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)
        myCommand = New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)
        reader = myCommand.ExecuteReader(CommandBehavior.CloseConnection)
        Dim DirectoryList As New ArrayList
        Do While reader.Read()
            DirectoryList.Add(reader("value").ToString)
        Loop
        reader.Close()
        myADONETConnection.Close()
        'Loop source directory ArrayList to load file paths into a file list ArrayList
        Dim DataFiles As New ArrayList
        Dim file As IO.FileInfo
        Dim dir As String
        For Each dir In DirectoryList
            Dim DirInfo As New IO.DirectoryInfo(dir)
            Dim FileList As IO.FileInfo() = DirInfo.GetFiles(FileRoot & "*.csv")
            For Each file In FileList
                DataFiles.Add(dir & "\" & file.Name)
            Next
        Next
        'Set value of DataFileList object variable to file list ArrayList
        Dts.Variables("DataFileList").Value = DataFiles
        Dts.TaskResult = ScriptResults.Success
    End Sub
End Class

The above Script Task queries a configuration table called "programParameters" to acquire the root name of the datafiles and a list of all the subdirectories that the SSIS package should search for data files with that root name.

The purpose of the "programParameters" configuration table is to enable the root file name to be changed and search directories added or removed without having to revise and redeploy the production SSIS package.

The attached zip file contains a sql script named "programParameters.sql" that you can use to create this table, and a data file named "programParameters.csv" that you can use to load the demo data (or you can enter it manually).

This is what the programParameters table looks like with the demo data loaded.

The Script Task query loads the list of subdirectories into an ArrayList named "DirectoryList." It then loops through the directory list and creates a list of data files that it stores in an ArrayList named "DataFiles." This data file list is passed to the Foreach loop container through the "DataFileList" SSIS global package variable.

The "Collection" node of the Foreach loop is configured with enumerator type "Foreach From Variable Enumerator," and the Variable value is configured with the "User::DataFileList" variable.

The "Variable Mappings" node of the Foreach loop is configured with the Variable value "User::DataFileList" and the Index value zero.

The key to allowing the Data Flow Task to extract multiple files is to configure the SourcConnectionFlatFile parameter to ConnectionString to the "User::DataFilePath" variable, as shown here.

To configure the ConnectionString parameter, click on the "SourceConnectionFlatFile" Connection Manager, click on the "Expressions" parameter, then click on the button that pops up.

The "Delete data file" VB.net script task following the Data Flow Task deletes each data file following extraction.

Option Strict On
Imports System
Imports System.IO
Imports Microsoft.SqlServer.Dts.Runtime

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _
<System.CLSCompliantAttribute(False)> _
Partial Public Class ScriptMain
    Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    Enum ScriptResults
        Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success
        Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
    End Enum
    Public Sub Main()
        Try
            Dim DataFilePath As New FileInfo(Dts.Variables("DataFilePath").Value.ToString)
            DataFilePath.Delete()
        Catch ex As Exception
        End Try
        Dts.TaskResult = ScriptResults.Success
    End Sub
End Class(false)>("scriptmain",>

To run the MultipleFileExtraction SSIS package in the attached zip file, create four subdirectories c:\dir1, c:\dir2. c:\dir3 and c:dir4 and copy five different files from the twenty files in the "VendorSalesData" folder into each of them. Next run the "SalesData.sql" sql script to create the destination table.

Open the appropriate SSIS package (MultipleFileExtraction2005.dtsx, MultipleFileExtraction2008 or MultipleFileExtraction2008R2.dtsx) in Business Intelligence Development Studio, configure the database connections to the database where the demo tables have been created, and execute it.

After successful execution, the package will look like this:

The subdirectories where you copied the data files will now be empty, and the destination table will contain the data.

Conclusion

There are situations where multiple data files that need to be imported into a database table may be streaming into multiple network locations. Without a dynamic file path, this data must be imported one file at a time. The SSIS package outlined in this article shows how to configure a dynamic file path to conveniently import this type of data.

Addendum

Here is a version of the preceding package that includes error checking.

When it executes successfully it will show the following message.

The package will show a similar message upon failure.

This version is attached as MultipleFileExtractionWithValidation.dtsx.

It is an SSIS 2005 package, which should import into later versions of SSIS.

Resources

Rate

4.23 (13)

You rated this post out of 5. Change rating

Share

Share

Rate

4.23 (13)

You rated this post out of 5. Change rating