Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 1234»»»

Dynamically create and populate staging tables from CSV files Expand / Collapse
Author
Message
Posted Monday, January 9, 2012 10:10 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Today @ 8:46 PM
Points: 124, Visits: 895
Comments posted to this topic are about the item Dynamically create and populate staging tables from CSV files
Post #1232909
Posted Monday, January 9, 2012 10:31 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, June 26, 2014 10:51 PM
Points: 1, Visits: 102
What will be the data type of columns? Will it take the appropriate types such as for integer field int etc...
Thanks
bhav
Post #1232916
Posted Monday, January 9, 2012 11:27 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, July 21, 2014 3:07 PM
Points: 41, Visits: 236
The code creates all the fields as varchar(20), though you could do an IsNumeric() check and then create the column as an int data type instead.


Post #1232927
Posted Tuesday, January 10, 2012 1:00 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Friday, July 18, 2014 2:44 AM
Points: 45, Visits: 336
I have a concern with this approach.

Since the objective is to "get data from a file into a (staging) table with as little fuss as possible", I presume you want little or no validation during this step.

The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.

I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.

Have I understood correctly?
How would you handle validation moving/copying data to the destination tables?
Post #1232961
Posted Tuesday, January 10, 2012 1:06 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, July 21, 2014 1:49 AM
Points: 238, Visits: 134
A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.

Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?
Post #1232964
Posted Tuesday, January 10, 2012 2:35 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 8:47 AM
Points: 880, Visits: 4,082
i feel it as too much complicated when we have the import/export functionality available.

----------
Ashish
Post #1232994
Posted Tuesday, January 10, 2012 3:55 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, January 30, 2012 6:49 AM
Points: 94, Visits: 359
Regarding the datatype discussion:
I think it's a great idea to copy data to staging tables using a varchar datatype in the first step. You can check the datatypes in the next step with sql (sql is not poor ) or in a dataflow. The important thing is, that you have a place within the database where you have all source data (also the bad one) decoupled from its source system(s).
In most circumstances this makes the subsequent operations easier (statistics on bad data, determination of deltas, restarting after an abortion, et cetera)
Post #1233026
Posted Tuesday, January 10, 2012 4:47 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, July 21, 2014 8:50 AM
Points: 62, Visits: 751
Something inside me really wanted to hate this technique, but I know it has its uses in a pure "staging" environment (i.e. one where you have already committed to do the T/L part of ETL in procedures between the stage and the ODS).

That said, I do want to offer some pointers with the .NET code itself, which I feel would make this better as a teaching article or reusable code block:

1. Remove all MessageBox.Show calls. Replace with Dts.Log(message) -- might be obvious but noone wants to press OK 30,000 times to load a file

2. In any case, favour StringBuilder over message += "blah"

3. Don't populate an ArrayList with the lines, they have no requirement to stay in memory once they have been read/inserted -- instead, do your processing in the Do/While Not (sLine Is Nothing) loop and discard each row once it is created. You can use an isFirst Boolean, or a lineCount Integer to handle your first-line semantics inside that loop -- this should allow for files far larger than 10,000 rows

4. Consider System.IO.Path.Combine over (di.ToString & fi.ToString) when building paths programatically

4a. For what it is worth, fi.FullName is the full path to the file (so the di.ToString & fi.ToString combo is not needed)

5. Consider System.IO.Path.GetFileNameWithoutExtension(fi.Name) over fi.ToString.Substring(0, fi.ToString.Length - 4)

6. Wrap your StreamReader code in a Using/End Using block [i.e. Using objReader As StreamReader = New StreamReader(...)] -- if the ReadLine code throws an exception (e.g. if a line length is > 2bn), the End Using will automatically clean up

7. Consider batching your inserts, running 10-100 at a time. Build the insert statements up in a StringBuilder, separating with a semi-colon/newline combination [builder.Append(sql).AppendLine(";"c)] and run them when ((lineCount - 1) Mod batchSize = 0)

7a. Consider offering the batchSize as a Dts.Variable

8. Properly escape table/column names, either with a RegEx replacement for any non-alphanumeric character, or by using a SqlClient.SqlCommandBuilder object, with QuotePrefix = "[", QuoteSuffix = "]", using the QuoteIdentifier method.

9. Properly escape CSV data values, with a SqlClient.SqlCommandBuilder object, with QuotePrefix = "'", QuoteSuffix = "'", using the QuoteIdentifier method.

9a. Alternatively, consider building your insert string as INSERT INTO table VALUES (@p1, @p2, @p3, ...); and using the Parameters collection on the SqlCommand object to populate the values

10. Consider NVARCHAR(MAX) instead of VARCHAR(20) as your default data type -- .NET strings are UTF16/UCS2 as standard, so it makes sense to store them as such, and the MAX length prevents data truncation

11. Catch specific exceptions, if you are going to handle them, e.g. IO.IOException is thrown by FileInfo.MoveTo if a file already exists

12a. Consider checking that the backup folder exists before attempting to copy to it [If Not Directory.Exists(Path.Combine(di.FullName, "Backup")) Then Directory.CreateDirectory(Path.Combine(di.FullName, "Backup"))]

12b. Consider checking that the backup file does not exist before attempting to move the source file [If Not File.Exists(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name)) Then File.Delete(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name))]

12c. Consider storing the backup folder location in a variable to reduce on Path.Combine calls -- allow the user to configure this location with a Dts.Variable

13. Consider making a IDENTITY PK on your imported tables, to aid in row identification later in the T/L process, either named tableName_id or according to a Dts.Variable -- you could turn this feature on or off by whether the variable is set or not (String.IsNullOrEmpty is your friend here).

14. Consider a recreateSchema Dts.Variable which triggers the drops of existing tables -- this will allow further development of the staging database, e.g. indexing, partitioning, which would be retained between data loads
Post #1233057
Posted Tuesday, January 10, 2012 5:45 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Friday, July 25, 2014 2:23 PM
Points: 1,544, Visits: 1,332
Thanks for posting the article.

I think that everyone is mis-understanding. It is not a fully-flushed out implementation, but, as the author mentioned, only to be used for a proof-of-concept, for small data loads.

Thanks...Chris
Post #1233091
Posted Tuesday, January 10, 2012 5:55 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 7:30 PM
Points: 36,766, Visits: 31,222
craig 81366 (1/10/2012)
The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.

I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.

Have I understood correctly?
How would you handle validation moving/copying data to the destination tables?


Gosh Craig, the only reason why I use staging tables to begin with is to do the very things that you've stated that SQL has a problem with (and it doesn't). I can easily validate data types, ranges and domains of data, do checks against known/expected values in lookup table, etc, ad infinitum, and all before I let possibly bad data anywhere near my real tables. Each row is premarked for insert, update, or error (along with the type of error) and the "move" code consists of two inserts (one for the good rows, one for the error rows to an errata table) and an update.

The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1233097
« Prev Topic | Next Topic »

Add to briefcase 1234»»»

Permissions Expand / Collapse