Dynamically create and populate staging tables from CSV files

Question

Post reply

Dynamically create and populate staging tables from CSV files

Stan Kulp-439977

SSChampion

Points: 10093
More actions
January 9, 2012 at 10:10 pm

#254710

Comments posted to this topic are about the item Dynamically create and populate staging tables from CSV files

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

bhavjay SSC Journeyman Points: 91 More actions · Answer 1

What will be the data type of columns? Will it take the appropriate types such as for integer field int etc...

Thanks

bhav

Rogman Ten Centuries Points: 1332 More actions · Answer 2

The code creates all the fields as varchar(20), though you could do an IsNumeric() check and then create the column as an int data type instead.

craig 81366 SSC Eights! Points: 808 More actions · Answer 3

I have a concern with this approach.

Since the objective is to "get data from a file into a (staging) table with as little fuss as possible", I presume you want little or no validation during this step.

The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.

I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.

Have I understood correctly?

How would you handle validation moving/copying data to the destination tables?

surreydude. SSChasing Mays Points: 601 More actions · Answer 4

A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.

Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?

crazy4sql SSCoach Points: 19590 More actions · Answer 5

i feel it as too much complicated when we have the import/export functionality available.

----------
Ashish

Tobias Kiemes SSCommitted Points: 1604 More actions · Answer 6

Regarding the datatype discussion:

I think it's a great idea to copy data to staging tables using a varchar datatype in the first step. You can check the datatypes in the next step with sql (sql is not poor ;-)) or in a dataflow. The important thing is, that you have a place within the database where you have all source data (also the bad one) decoupled from its source system(s).

In most circumstances this makes the subsequent operations easier (statistics on bad data, determination of deltas, restarting after an abortion, et cetera)

jimbobmcgee Ten Centuries Points: 1108 More actions · Answer 7

Something inside me really wanted to hate this technique, but I know it has its uses in a pure "staging" environment (i.e. one where you have already committed to do the T/L part of ETL in procedures between the stage and the ODS).

That said, I do want to offer some pointers with the .NET code itself, which I feel would make this better as a teaching article or reusable code block:

1. Remove all MessageBox.Show calls. Replace with Dts.Log(message) -- might be obvious but noone wants to press OK 30,000 times to load a file

2. In any case, favour StringBuilder over message += "blah"

3. Don't populate an ArrayList with the lines, they have no requirement to stay in memory once they have been read/inserted -- instead, do your processing in the Do/While Not (sLine Is Nothing) loop and discard each row once it is created. You can use an isFirst Boolean, or a lineCount Integer to handle your first-line semantics inside that loop -- this should allow for files far larger than 10,000 rows

4. Consider System.IO.Path.Combine over (di.ToString & fi.ToString) when building paths programatically

4a. For what it is worth, fi.FullName is the full path to the file (so the di.ToString & fi.ToString combo is not needed)

5. Consider System.IO.Path.GetFileNameWithoutExtension(fi.Name) over fi.ToString.Substring(0, fi.ToString.Length - 4)

6. Wrap your StreamReader code in a Using/End Using block [i.e. Using objReader As StreamReader = New StreamReader(...)] -- if the ReadLine code throws an exception (e.g. if a line length is > 2bn), the End Using will automatically clean up

7. Consider batching your inserts, running 10-100 at a time. Build the insert statements up in a StringBuilder, separating with a semi-colon/newline combination [builder.Append(sql).AppendLine(";"c)] and run them when ((lineCount - 1) Mod batchSize = 0)

7a. Consider offering the batchSize as a Dts.Variable

8. Properly escape table/column names, either with a RegEx replacement for any non-alphanumeric character, or by using a SqlClient.SqlCommandBuilder object, with QuotePrefix = "[", QuoteSuffix = "]", using the QuoteIdentifier method.

9. Properly escape CSV data values, with a SqlClient.SqlCommandBuilder object, with QuotePrefix = "'", QuoteSuffix = "'", using the QuoteIdentifier method.

9a. Alternatively, consider building your insert string as INSERT INTO table VALUES (@p1, @p2, @p3, ...); and using the Parameters collection on the SqlCommand object to populate the values

10. Consider NVARCHAR(MAX) instead of VARCHAR(20) as your default data type -- .NET strings are UTF16/UCS2 as standard, so it makes sense to store them as such, and the MAX length prevents data truncation

11. Catch specific exceptions, if you are going to handle them, e.g. IO.IOException is thrown by FileInfo.MoveTo if a file already exists

12a. Consider checking that the backup folder exists before attempting to copy to it [If Not Directory.Exists(Path.Combine(di.FullName, "Backup")) Then Directory.CreateDirectory(Path.Combine(di.FullName, "Backup"))]

12b. Consider checking that the backup file does not exist before attempting to move the source file [If Not File.Exists(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name)) Then File.Delete(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name))]

12c. Consider storing the backup folder location in a variable to reduce on Path.Combine calls -- allow the user to configure this location with a Dts.Variable

13. Consider making a IDENTITY PK on your imported tables, to aid in row identification later in the T/L process, either named tableName_id or according to a Dts.Variable -- you could turn this feature on or off by whether the variable is set or not (String.IsNullOrEmpty is your friend here).

14. Consider a recreateSchema Dts.Variable which triggers the drops of existing tables -- this will allow further development of the staging database, e.g. indexing, partitioning, which would be retained between data loads

CGSJohnson SSCertifiable Points: 7208 More actions · Answer 8

Thanks for posting the article.

I think that everyone is mis-understanding. It is not a fully-flushed out implementation, but, as the author mentioned, only to be used for a proof-of-concept, for small data loads.

Thanks...Chris

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 9

craig 81366 (1/10/2012)
The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.
I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.
Have I understood correctly?
How would you handle validation moving/copying data to the destination tables?

Gosh Craig, the only reason why I use staging tables to begin with is to do the very things that you've stated that SQL has a problem with (and it doesn't). I can easily validate data types, ranges and domains of data, do checks against known/expected values in lookup table, etc, ad infinitum, and all before I let possibly bad data anywhere near my real tables. Each row is premarked for insert, update, or error (along with the type of error) and the "move" code consists of two inserts (one for the good rows, one for the error rows to an errata table) and an update.

The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

fraser.graham Grasshopper Points: 11 More actions · Answer 10

Hi,

Thanks Stan for submitting this. I regularly encounter csv documents that do not have a set number of columns or where the column names are changed regularly. This solution is ideal for my purposes. I can then change the column names more easily using T-SQL.

Thanks 🙂

I have enhanced this script to do the following:-

Determine data types

Skip blank rows on the csv

Insert null for missing columns on any csv row

Insert 1000 rows at a time

I'm sure that there will be some bugs in this script and I will probably find them in the coming months. Please treat this as a beta version if you use it.

You must first amend the constant variables at the top of this code to use it.

' Microsoft SQL Server Integration Services Script Task

' Write scripts using Microsoft Visual Basic 2008.

' The ScriptMain is the entry point class of the script.

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Runtime

Imports System.IO

<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _

<System.CLSCompliantAttribute(False)> _

Partial Public Class ScriptMain

Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase

Enum ScriptResults

Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success

Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure

End Enum

Const _SampleRows As Long = 100000

Const source_directory As String = "C:\Path to your directory"

Const SchemaName As String = "Schema For Csv Imports On Your Sql Server"

Const DBConnection As String = "Name Of ADO.Net Connection In Your Package"

Class DBType

Private _HasDate As Boolean = False

Private _HasFloat As Boolean = False

Private _HasTiny As Boolean = False

Private _HasSmall As Boolean = False

Private _HasInt As Boolean = False

Private _HasBig As Boolean = False

Private _HasNumber As Boolean = False

Private _HasString As Boolean = False

Private _MaxLen As Long = 0

Private _DBType As String

Private _Container As String

Public Property HasDate() As Boolean

Get

Return _HasDate

End Get

Set(ByVal value As Boolean)

_HasDate = value

End Set

End Property

Public Property HasFloat() As Boolean

Get

Return _HasFloat

End Get

Set(ByVal value As Boolean)

_HasFloat = value

End Set

End Property

Public Property HasTiny() As Boolean

Get

Return _HasTiny

End Get

Set(ByVal value As Boolean)

_HasTiny = value

End Set

End Property

Public Property HasSmall() As Boolean

Get

Return _HasSmall

End Get

Set(ByVal value As Boolean)

_HasSmall = value

End Set

End Property

Public Property HasInt() As Boolean

Get

Return _HasInt

End Get

Set(ByVal value As Boolean)

_HasInt = value

End Set

End Property

Public Property HasBig() As Boolean

Get

Return _HasBig

End Get

Set(ByVal value As Boolean)

_HasBig = value

End Set

End Property

Public Property HasString() As Boolean

Get

Return _HasString

End Get

Set(ByVal value As Boolean)

_HasString = value

End Set

End Property

Public Property MaxLen() As Long

Get

Return _MaxLen

End Get

Set(ByVal value As Long)

_MaxLen = value

End Set

End Property

Public ReadOnly Property DBType() As String

Get

Return _DBType

End Get

End Property

Public ReadOnly Property Container() As String

Get

Return _Container

End Get

End Property

Public Sub Prepare()

If _HasString Or (_HasDate And _HasNumber) Then

_DBType = "NVARCHAR(" & IIf(_MaxLen >= 50, "Max", _MaxLen) & ")"

_Container = "'"

ElseIf _HasDate Then

_DBType = "DATETIME"

_Container = "'"

ElseIf _HasFloat Then

_DBType = "FLOAT"

_Container = ""

ElseIf _HasBig Then

_DBType = "BIGINT"

_Container = ""

ElseIf _HasInt Then

_DBType = "INT"

_Container = ""

ElseIf _HasSmall Then

_DBType = "SMALLINT"

_Container = ""

ElseIf _HasTiny Then

_DBType = "TINYINT"

_Container = ""

Else

_DBType = "NVARCHAR(Max)"

_Container = "'"

End If

End Sub

Public Sub ComputeValue(ByVal Value As String)

If IsDate(Value) Then

_HasDate = True

ElseIf IsNumeric(Value) And Not IsDate(Value) Then

_HasNumber = True

Dim dbl As Double = CDbl(Value)

If dbl Mod 1 <> 0 Then

_HasFloat = True

ElseIf dbl >= 0 And dbl <= 255 Then

_HasTiny = True

ElseIf dbl >= (0 - 32768) And dbl <= 32767 Then

_HasSmall = True

ElseIf dbl >= (0 - 2147483648) And dbl <= (2147483648 - 1) Then

_HasInt = True

ElseIf dbl >= (0 - 9223372036854775807) And dbl <= (9223372036854775807) Then

_HasBig = True

Else

_HasFloat = True

End If

Else

_HasString = True

End If

If _MaxLen < Value.Replace("'", "''").Length Then _MaxLen = Value.Replace("'", "''").Length

End Sub

End Class

Public Sub Main()

Dim source_file As String = "*.csv"

Dim di As New System.IO.DirectoryInfo(source_directory)

Dim aryFi As IO.FileInfo() = di.GetFiles(source_file)

Dim fi As IO.FileInfo

For Each fi In aryFi

Dim sLine As String

Dim arrText As New ArrayList()

Dim objReader As New StreamReader(di.ToString & "\" & fi.ToString)

Dim DBTypes As New Collection

Do

sLine = objReader.ReadLine()

If Not sLine Is Nothing Then

arrText.Add(sLine)

End If

Loop Until sLine Is Nothing

objReader.Close()

Dim FirstLine As String = ""

Dim lngSampled As Long = 0

For Each sLine In arrText

If FirstLine = "" Then

FirstLine = sLine

ElseIf FirstLine <> "" And _SampleRows >= lngSampled Then

Dim ID As Long = 0

Dim DTyp As DBType

For Each s As String In sLine.Split(",")

If DBTypes.Contains("Col_" & ID) = False Then

DTyp = New DBType

DBTypes.Add(DTyp, "Col_" & ID)

Else

DTyp = DBTypes("Col_" & ID)

End If

DTyp.ComputeValue(s)

ID += 1

Next

lngSampled += 1

End If

Next

For Each dt As DBType In DBTypes

dt.Prepare()

Next

Dim FieldNames() As String = Split(FirstLine, ",")

Dim count As Integer = Split(FirstLine, ",").Length

Dim mySqlStatement As String = "BEGIN TRY DROP TABLE " & SchemaName & "." & fi.Name.Replace(".", "_") & " END TRY BEGIN CATCH END CATCH CREATE TABLE " & SchemaName & "." & fi.Name.Replace(".", "_") & " ("

Dim comma As String = ""

For index As Integer = 0 To count - 1

mySqlStatement += comma & "[" & FieldNames(index) & "]" & " " & CType(DBTypes("Col_" & index), DBType).DBType

comma = ","

Next

mySqlStatement += ")"

Dim myADONETConnection As SqlClient.SqlConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)

Dim myCommand As New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)

Try

myCommand.ExecuteNonQuery()

myADONETConnection.Close()

Catch ex As Exception

Dts.TaskResult = ScriptResults.Failure

Throw ex

Exit Sub

End Try

FirstLine = ""

Dim line_count As Integer = 1

mySqlStatement = "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES "

Dim comma2 As String = ""

Dim c As Long = 0

For Each sLine In arrText

If Trim(sLine) <> "" Then

If FirstLine = "" Then

FirstLine = sLine

Else

line_count += 1

Dim fields() As String = Split(sLine, ",")

Dim FieldCount As Integer = Split(sLine, ",").Length

mySqlStatement &= comma2 & "("

comma = ""

For FieldIndex As Integer = 0 To FieldCount - 1

If FieldIndex <= FieldNames.Length - 1 Then

Dim dtyp As DBType

dtyp = DBTypes("Col_" & FieldIndex)

If Not dtyp.DBType Like "*varchar*" And Trim(fields(FieldIndex)) = "" Then

mySqlStatement += comma & "NULL"

Else

If dtyp.DBType = "DATETIME" Then

mySqlStatement += comma & dtyp.Container & CDate(fields(FieldIndex)).ToString("yyyy-MM-dd hh:mm:ss tt") & dtyp.Container

Else

mySqlStatement += comma & dtyp.Container & Replace(fields(FieldIndex), "'", "''") & dtyp.Container

End If

comma = ","

End If

Next

If FieldCount < FieldNames.Length Then

For i = FieldCount To FieldNames.Length - 1

mySqlStatement += comma & "NULL"

Next

comma = ","

End If

mySqlStatement += ")"

c += 1

If c = 1000 Then

Try

myADONETConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)

myCommand = New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)

myCommand.ExecuteNonQuery()

myADONETConnection.Close()

Catch ex As Exception

Dts.TaskResult = ScriptResults.Failure

Throw ex

Exit Sub

End Try

mySqlStatement = "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES "

comma2 = ""

c = 0

Else

comma2 = ","

End If

Next

If mySqlStatement <> "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES " Then

Try

myADONETConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection)

myCommand = New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)

myCommand.ExecuteNonQuery()

myADONETConnection.Close()

Catch ex As Exception

Dts.TaskResult = ScriptResults.Failure

End Try

End If

Dim file As New System.IO.FileInfo(di.ToString & fi.ToString)

Try

file.Delete()

Catch ex As Exception

Dts.TaskResult = ScriptResults.Failure

End Try

DBTypes = Nothing

Next

Dts.TaskResult = ScriptResults.Success

End Sub

End Class

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 11

surreydude. (1/10/2012)
A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.
Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?

I agree... it's a rare thing. However, I have worked with vendors such as "double-click.net" that will only provide the data in a CSV format as if it were exported from a reporting system or a spreadsheet where there are a handful of key columns on the left and an unknown quantity of columns on the right. The unknown columns actually needed to be paired in the final import. Each column was marked with a name which included the type of column it was and pairs of like-named columns could be paired based on those names. In our case, the files could be anywhere from 11 columns (9 key columns) to hundreds of columns wide based on how many active internet ads a customer may have for any given week.

Although I don't particularly care for SSIS nor the fact that the technique in this article had to use a script to do the job (I did my thing 100% in T-SQL with no problems), the article is a reasonable introduction as to how you might begin to approach such a thing in SSIS. The author included verification popups for experimentation purposes and once such a system is in place, these popups can be easily removed for hands-off scheduled imports.

The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.

Of course, you could also pre-process tables before they come anywhere near SQL Server but I've found that writing such code is a bit of a fright because you might not have set based, declarative tools such as are available in SQL Server to keep from having to write read/write routines with lots of loops and decisions on your own.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 12

Stan,

Nice step-by-step article with some timely graphics to show how to do things. Although it wasn't specifically named as such, I believe that folks will come to realize that this is a very good "spackle" article that demonstrates the beginnings of a technique. Like any "spackle" article, it's not meant to be a complete solution but it'll sure get some thought juices going for people who have ever had to import "unknown number of columns" files like I've had to do in the past. You could have saved yourself a bit of commentary by identifying that the popups where there only for verification during development and, perhaps, more clearly identifying not only when this technique might be used along with suggested "next steps", but it's a good article. Thanks for taking the time to write it and share it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Stan Kulp-439977 SSChampion Points: 10093 More actions · Answer 13

The only reason I wrote this is because a responder to one of my previous articles asked how to do it. I don't consider it to be something all that useful for all the reasons you stated. I think it is cool how fast it creates and populates the tables. The code works. As a developer, I just enjoy making something work sometimes, even if it is silly.

ahperez SSC Eights! Points: 962 More actions · Answer 14

Thanks Stan for taking the time to write the article and post the script. I often have to import files from wide-ranging sources with little information regarding the specific contents of these files. You've given me lots of food for thought in how to do this more efficiently. I'll definitely experiment with the code and technique you've shown here.