SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Dynamically create and populate staging tables from CSV files


Dynamically create and populate staging tables from CSV files

Author
Message
Stan Kulp-439977
Stan Kulp-439977
Ten Centuries
Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)Ten Centuries (1.4K reputation)

Group: General Forum Members
Points: 1388 Visits: 1131
Comments posted to this topic are about the item Dynamically create and populate staging tables from CSV files
bhavjay
bhavjay
Grasshopper
Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)Grasshopper (17 reputation)

Group: General Forum Members
Points: 17 Visits: 133
What will be the data type of columns? Will it take the appropriate types such as for integer field int etc...
Thanks
bhav
Rogman
Rogman
SSC-Enthusiastic
SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)SSC-Enthusiastic (108 reputation)

Group: General Forum Members
Points: 108 Visits: 271
The code creates all the fields as varchar(20), though you could do an IsNumeric() check and then create the column as an int data type instead.



craig 81366
craig 81366
SSC-Enthusiastic
SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)SSC-Enthusiastic (179 reputation)

Group: General Forum Members
Points: 179 Visits: 489
I have a concern with this approach.

Since the objective is to "get data from a file into a (staging) table with as little fuss as possible", I presume you want little or no validation during this step.

The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.

I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.

Have I understood correctly?
How would you handle validation moving/copying data to the destination tables?
surreydude.
surreydude.
Old Hand
Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)Old Hand (335 reputation)

Group: General Forum Members
Points: 335 Visits: 179
A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.

Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?
crazy4sql
crazy4sql
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2231 Visits: 4510
i feel it as too much complicated when we have the import/export functionality available.

----------
Ashish
tobe_ha
tobe_ha
SSC Veteran
SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)SSC Veteran (239 reputation)

Group: General Forum Members
Points: 239 Visits: 384
Regarding the datatype discussion:
I think it's a great idea to copy data to staging tables using a varchar datatype in the first step. You can check the datatypes in the next step with sql (sql is not poor ;-)) or in a dataflow. The important thing is, that you have a place within the database where you have all source data (also the bad one) decoupled from its source system(s).
In most circumstances this makes the subsequent operations easier (statistics on bad data, determination of deltas, restarting after an abortion, et cetera)
jimbobmcgee
jimbobmcgee
SSC-Enthusiastic
SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)

Group: General Forum Members
Points: 152 Visits: 916
Something inside me really wanted to hate this technique, but I know it has its uses in a pure "staging" environment (i.e. one where you have already committed to do the T/L part of ETL in procedures between the stage and the ODS).

That said, I do want to offer some pointers with the .NET code itself, which I feel would make this better as a teaching article or reusable code block:

1. Remove all MessageBox.Show calls. Replace with Dts.Log(message) -- might be obvious but noone wants to press OK 30,000 times to load a file

2. In any case, favour StringBuilder over message += "blah"

3. Don't populate an ArrayList with the lines, they have no requirement to stay in memory once they have been read/inserted -- instead, do your processing in the Do/While Not (sLine Is Nothing) loop and discard each row once it is created. You can use an isFirst Boolean, or a lineCount Integer to handle your first-line semantics inside that loop -- this should allow for files far larger than 10,000 rows

4. Consider System.IO.Path.Combine over (di.ToString & fi.ToString) when building paths programatically

4a. For what it is worth, fi.FullName is the full path to the file (so the di.ToString & fi.ToString combo is not needed)

5. Consider System.IO.Path.GetFileNameWithoutExtension(fi.Name) over fi.ToString.Substring(0, fi.ToString.Length - 4)

6. Wrap your StreamReader code in a Using/End Using block [i.e. Using objReader As StreamReader = New StreamReader(...)] -- if the ReadLine code throws an exception (e.g. if a line length is > 2bn), the End Using will automatically clean up

7. Consider batching your inserts, running 10-100 at a time. Build the insert statements up in a StringBuilder, separating with a semi-colon/newline combination [builder.Append(sql).AppendLine(";"c)] and run them when ((lineCount - 1) Mod batchSize = 0)

7a. Consider offering the batchSize as a Dts.Variable

8. Properly escape table/column names, either with a RegEx replacement for any non-alphanumeric character, or by using a SqlClient.SqlCommandBuilder object, with QuotePrefix = "[", QuoteSuffix = "]", using the QuoteIdentifier method.

9. Properly escape CSV data values, with a SqlClient.SqlCommandBuilder object, with QuotePrefix = "'", QuoteSuffix = "'", using the QuoteIdentifier method.

9a. Alternatively, consider building your insert string as INSERT INTO table VALUES (@p1, @p2, @p3, ...); and using the Parameters collection on the SqlCommand object to populate the values

10. Consider NVARCHAR(MAX) instead of VARCHAR(20) as your default data type -- .NET strings are UTF16/UCS2 as standard, so it makes sense to store them as such, and the MAX length prevents data truncation

11. Catch specific exceptions, if you are going to handle them, e.g. IO.IOException is thrown by FileInfo.MoveTo if a file already exists

12a. Consider checking that the backup folder exists before attempting to copy to it [If Not Directory.Exists(Path.Combine(di.FullName, "Backup")) Then Directory.CreateDirectory(Path.Combine(di.FullName, "Backup"))]

12b. Consider checking that the backup file does not exist before attempting to move the source file [If Not File.Exists(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name)) Then File.Delete(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name))]

12c. Consider storing the backup folder location in a variable to reduce on Path.Combine calls -- allow the user to configure this location with a Dts.Variable

13. Consider making a IDENTITY PK on your imported tables, to aid in row identification later in the T/L process, either named tableName_id or according to a Dts.Variable -- you could turn this feature on or off by whether the variable is set or not (String.IsNullOrEmpty is your friend here).

14. Consider a recreateSchema Dts.Variable which triggers the drops of existing tables -- this will allow further development of the staging database, e.g. indexing, partitioning, which would be retained between data loads
CGSJohnson
CGSJohnson
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2224 Visits: 1686
Thanks for posting the article.

I think that everyone is mis-understanding. It is not a fully-flushed out implementation, but, as the author mentioned, only to be used for a proof-of-concept, for small data loads.

Thanks...Chris
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)SSC Guru (118K reputation)

Group: General Forum Members
Points: 118407 Visits: 41480
craig 81366 (1/10/2012)
The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.

I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.

Have I understood correctly?
How would you handle validation moving/copying data to the destination tables?


Gosh Craig, the only reason why I use staging tables to begin with is to do the very things that you've stated that SQL has a problem with (and it doesn't). I can easily validate data types, ranges and domains of data, do checks against known/expected values in lookup table, etc, ad infinitum, and all before I let possibly bad data anywhere near my real tables. Each row is premarked for insert, update, or error (along with the type of error) and the "move" code consists of two inserts (one for the good rows, one for the error rows to an errata table) and an update.

The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search