|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Friday, January 11, 2013 12:37 PM
Points: 1,
Visits: 29
|
|
Hi,
Thanks Stan for submitting this. I regularly encounter csv documents that do not have a set number of columns or where the column names are changed regularly. This solution is ideal for my purposes. I can then change the column names more easily using T-SQL.
Thanks 
I have enhanced this script to do the following:-
Determine data types Skip blank rows on the csv Insert null for missing columns on any csv row Insert 1000 rows at a time
I'm sure that there will be some bugs in this script and I will probably find them in the coming months. Please treat this as a beta version if you use it.
You must first amend the constant variables at the top of this code to use it.
' Microsoft SQL Server Integration Services Script Task ' Write scripts using Microsoft Visual Basic 2008. ' The ScriptMain is the entry point class of the script.
Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Runtime Imports System.IO
<System.AddIn.AddIn("ScriptMain", Version:="1.0", Publisher:="", Description:="")> _ <System.CLSCompliantAttribute(False)> _ Partial Public Class ScriptMain Inherits Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
Enum ScriptResults Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure End Enum
Const _SampleRows As Long = 100000 Const source_directory As String = "C:\Path to your directory" Const SchemaName As String = "Schema For Csv Imports On Your Sql Server" Const DBConnection As String = "Name Of ADO.Net Connection In Your Package"
Class DBType
Private _HasDate As Boolean = False Private _HasFloat As Boolean = False Private _HasTiny As Boolean = False Private _HasSmall As Boolean = False Private _HasInt As Boolean = False Private _HasBig As Boolean = False Private _HasNumber As Boolean = False Private _HasString As Boolean = False Private _MaxLen As Long = 0 Private _DBType As String Private _Container As String
Public Property HasDate() As Boolean Get Return _HasDate End Get Set(ByVal value As Boolean) _HasDate = value End Set End Property
Public Property HasFloat() As Boolean Get Return _HasFloat End Get Set(ByVal value As Boolean) _HasFloat = value End Set End Property
Public Property HasTiny() As Boolean Get Return _HasTiny End Get Set(ByVal value As Boolean) _HasTiny = value End Set End Property
Public Property HasSmall() As Boolean Get Return _HasSmall End Get Set(ByVal value As Boolean) _HasSmall = value End Set End Property
Public Property HasInt() As Boolean Get Return _HasInt End Get Set(ByVal value As Boolean) _HasInt = value End Set End Property
Public Property HasBig() As Boolean Get Return _HasBig End Get Set(ByVal value As Boolean) _HasBig = value End Set End Property
Public Property HasString() As Boolean Get Return _HasString End Get Set(ByVal value As Boolean) _HasString = value End Set End Property
Public Property MaxLen() As Long Get Return _MaxLen End Get Set(ByVal value As Long) _MaxLen = value End Set End Property
Public ReadOnly Property DBType() As String Get Return _DBType End Get End Property
Public ReadOnly Property Container() As String Get Return _Container End Get End Property
Public Sub Prepare() If _HasString Or (_HasDate And _HasNumber) Then _DBType = "NVARCHAR(" & IIf(_MaxLen >= 50, "Max", _MaxLen) & ")" _Container = "'" ElseIf _HasDate Then _DBType = "DATETIME" _Container = "'" ElseIf _HasFloat Then _DBType = "FLOAT" _Container = "" ElseIf _HasBig Then _DBType = "BIGINT" _Container = "" ElseIf _HasInt Then _DBType = "INT" _Container = "" ElseIf _HasSmall Then _DBType = "SMALLINT" _Container = "" ElseIf _HasTiny Then _DBType = "TINYINT" _Container = "" Else _DBType = "NVARCHAR(Max)" _Container = "'" End If End Sub
Public Sub ComputeValue(ByVal Value As String) If IsDate(Value) Then _HasDate = True ElseIf IsNumeric(Value) And Not IsDate(Value) Then
_HasNumber = True
Dim dbl As Double = CDbl(Value)
If dbl Mod 1 <> 0 Then _HasFloat = True ElseIf dbl >= 0 And dbl <= 255 Then _HasTiny = True ElseIf dbl >= (0 - 32768) And dbl <= 32767 Then _HasSmall = True ElseIf dbl >= (0 - 2147483648) And dbl <= (2147483648 - 1) Then _HasInt = True ElseIf dbl >= (0 - 9223372036854775807) And dbl <= (9223372036854775807) Then _HasBig = True Else _HasFloat = True End If Else _HasString = True End If
If _MaxLen < Value.Replace("'", "''").Length Then _MaxLen = Value.Replace("'", "''").Length
End Sub
End Class
Public Sub Main()
Dim source_file As String = "*.csv"
Dim di As New System.IO.DirectoryInfo(source_directory) Dim aryFi As IO.FileInfo() = di.GetFiles(source_file) Dim fi As IO.FileInfo
For Each fi In aryFi
Dim sLine As String Dim arrText As New ArrayList() Dim objReader As New StreamReader(di.ToString & "\" & fi.ToString)
Dim DBTypes As New Collection
Do sLine = objReader.ReadLine() If Not sLine Is Nothing Then arrText.Add(sLine) End If Loop Until sLine Is Nothing
objReader.Close()
Dim FirstLine As String = ""
Dim lngSampled As Long = 0
For Each sLine In arrText If FirstLine = "" Then FirstLine = sLine ElseIf FirstLine <> "" And _SampleRows >= lngSampled Then Dim ID As Long = 0 Dim DTyp As DBType
For Each s As String In sLine.Split(",") If DBTypes.Contains("Col_" & ID) = False Then DTyp = New DBType DBTypes.Add(DTyp, "Col_" & ID) Else DTyp = DBTypes("Col_" & ID) End If
DTyp.ComputeValue(s)
ID += 1
Next
lngSampled += 1
End If Next
For Each dt As DBType In DBTypes dt.Prepare() Next
Dim FieldNames() As String = Split(FirstLine, ",") Dim count As Integer = Split(FirstLine, ",").Length Dim mySqlStatement As String = "BEGIN TRY DROP TABLE " & SchemaName & "." & fi.Name.Replace(".", "_") & " END TRY BEGIN CATCH END CATCH CREATE TABLE " & SchemaName & "." & fi.Name.Replace(".", "_") & " (" Dim comma As String = ""
For index As Integer = 0 To count - 1 mySqlStatement += comma & "[" & FieldNames(index) & "]" & " " & CType(DBTypes("Col_" & index), DBType).DBType comma = "," Next
mySqlStatement += ")"
Dim myADONETConnection As SqlClient.SqlConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection) Dim myCommand As New SqlClient.SqlCommand(mySqlStatement, myADONETConnection)
Try myCommand.ExecuteNonQuery() myADONETConnection.Close() Catch ex As Exception Dts.TaskResult = ScriptResults.Failure Throw ex Exit Sub End Try
FirstLine = "" Dim line_count As Integer = 1
mySqlStatement = "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES "
Dim comma2 As String = "" Dim c As Long = 0
For Each sLine In arrText
If Trim(sLine) <> "" Then
If FirstLine = "" Then FirstLine = sLine Else line_count += 1
Dim fields() As String = Split(sLine, ",") Dim FieldCount As Integer = Split(sLine, ",").Length
mySqlStatement &= comma2 & "("
comma = ""
For FieldIndex As Integer = 0 To FieldCount - 1
If FieldIndex <= FieldNames.Length - 1 Then
Dim dtyp As DBType
dtyp = DBTypes("Col_" & FieldIndex)
If Not dtyp.DBType Like "*varchar*" And Trim(fields(FieldIndex)) = "" Then mySqlStatement += comma & "NULL" Else If dtyp.DBType = "DATETIME" Then mySqlStatement += comma & dtyp.Container & CDate(fields(FieldIndex)).ToString("yyyy-MM-dd hh:mm:ss tt") & dtyp.Container Else mySqlStatement += comma & dtyp.Container & Replace(fields(FieldIndex), "'", "''") & dtyp.Container End If End If
comma = "," End If Next
If FieldCount < FieldNames.Length Then For i = FieldCount To FieldNames.Length - 1 mySqlStatement += comma & "NULL" Next comma = "," End If
mySqlStatement += ")"
c += 1
If c = 1000 Then Try myADONETConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection) myCommand = New SqlClient.SqlCommand(mySqlStatement, myADONETConnection) myCommand.ExecuteNonQuery() myADONETConnection.Close() Catch ex As Exception Dts.TaskResult = ScriptResults.Failure Throw ex Exit Sub End Try
mySqlStatement = "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES " comma2 = "" c = 0 Else comma2 = "," End If
End If End If Next
If mySqlStatement <> "INSERT INTO " & SchemaName & "." & fi.Name.Replace(".", "_") & " VALUES " Then Try myADONETConnection = DirectCast(Dts.Connections(DBConnection).AcquireConnection(Dts.Transaction), SqlClient.SqlConnection) myCommand = New SqlClient.SqlCommand(mySqlStatement, myADONETConnection) myCommand.ExecuteNonQuery() myADONETConnection.Close() Catch ex As Exception Dts.TaskResult = ScriptResults.Failure End Try
End If Dim file As New System.IO.FileInfo(di.ToString & fi.ToString)
Try
file.Delete()
Catch ex As Exception
Dts.TaskResult = ScriptResults.Failure
End Try
DBTypes = Nothing
Next
Dts.TaskResult = ScriptResults.Success End Sub
End Class
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 11:54 PM
Points: 33,113,
Visits: 27,041
|
|
surreydude. (1/10/2012) A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.
Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?
I agree... it's a rare thing. However, I have worked with vendors such as "double-click.net" that will only provide the data in a CSV format as if it were exported from a reporting system or a spreadsheet where there are a handful of key columns on the left and an unknown quantity of columns on the right. The unknown columns actually needed to be paired in the final import. Each column was marked with a name which included the type of column it was and pairs of like-named columns could be paired based on those names. In our case, the files could be anywhere from 11 columns (9 key columns) to hundreds of columns wide based on how many active internet ads a customer may have for any given week.
Although I don't particularly care for SSIS nor the fact that the technique in this article had to use a script to do the job (I did my thing 100% in T-SQL with no problems), the article is a reasonable introduction as to how you might begin to approach such a thing in SSIS. The author included verification popups for experimentation purposes and once such a system is in place, these popups can be easily removed for hands-off scheduled imports.
The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.
Of course, you could also pre-process tables before they come anywhere near SQL Server but I've found that writing such code is a bit of a fright because you might not have set based, declarative tools such as are available in SQL Server to keep from having to write read/write routines with lots of loops and decisions on your own.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 11:54 PM
Points: 33,113,
Visits: 27,041
|
|
Stan,
Nice step-by-step article with some timely graphics to show how to do things. Although it wasn't specifically named as such, I believe that folks will come to realize that this is a very good "spackle" article that demonstrates the beginnings of a technique. Like any "spackle" article, it's not meant to be a complete solution but it'll sure get some thought juices going for people who have ever had to import "unknown number of columns" files like I've had to do in the past. You could have saved yourself a bit of commentary by identifying that the popups where there only for verification during development and, perhaps, more clearly identifying not only when this technique might be used along with suggested "next steps", but it's a good article. Thanks for taking the time to write it and share it.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: 2 days ago @ 6:07 AM
Points: 81,
Visits: 628
|
|
| The only reason I wrote this is because a responder to one of my previous articles asked how to do it. I don't consider it to be something all that useful for all the reasons you stated. I think it is cool how fast it creates and populates the tables. The code works. As a developer, I just enjoy making something work sometimes, even if it is silly.
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: 2 days ago @ 1:59 PM
Points: 402,
Visits: 489
|
|
| Thanks Stan for taking the time to write the article and post the script. I often have to import files from wide-ranging sources with little information regarding the specific contents of these files. You've given me lots of food for thought in how to do this more efficiently. I'll definitely experiment with the code and technique you've shown here.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Thursday, June 13, 2013 11:22 AM
Points: 6,
Visits: 227
|
|
Hi Stan, I very much like the article.
Sometimes I just want to ram some data into a database simply to explore the data. This technique can be very helpful when that is my aim.
Nice work.
John R
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Yesterday @ 3:11 PM
Points: 205,
Visits: 709
|
|
jimbobmcgee (1/10/2012)
That said, I do want to offer some pointers with the .NET code itself, which I feel would make this better as a teaching article or reusable code block:
[enumerated list 1-14 of constructive feedback]
Thank you for the additional insights. If the original is a solution-starter, your comments are a reminder to keep a quality end in mind during development. Since we rarely go back to improve these kinds of utilities, it's a good idea to get them as near-optimal as possible on the first effort. (time permitting)
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: 2 days ago @ 6:07 AM
Points: 81,
Visits: 628
|
|
That's some really nice code.
I just wrote mine to show that the idea would work.
Yours would work well.
Thanks!
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: 2 days ago @ 6:07 AM
Points: 81,
Visits: 628
|
|
@bhavjay
To me, a staging table is just that, the first stage of processing the data.
Since we don't really know what kind of data the fields are going to hold, we poke the data into VARCHARs just to get it into a table.
The next stage would be actually looking at the data and writing an insert into/select statement that would cast the fields to the desired types and insert the records into the destination table.
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: Wednesday, January 16, 2013 4:23 PM
Points: 415,
Visits: 2,333
|
|
Jeff Moden (1/10/2012) Of course, you could also pre-process tables before they come anywhere near SQL Server but I've found that writing such code is a bit of a fright because you might not have set based, declarative tools such as are available in SQL Server to keep from having to write read/write routines with lots of loops and decisions on your own.
Actually, performing loops and tests in other languages is much easier. T-SQL is not that great a language in regards to either performance or feature sets. T-SQL's main advantage is that it has a database system hooked up to it.
I don't think you should give up on loops and tests just because T-SQL is not that great a programming environment, unless you're really just not that interested in programming in general.
/opinion
|
|
|
|