|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Yesterday @ 2:08 PM
Points: 79,
Visits: 600
|
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: 2 days ago @ 7:13 AM
Points: 1,
Visits: 72
|
|
What will be the data type of columns? Will it take the appropriate types such as for integer field int etc... Thanks bhav
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Monday, April 29, 2013 10:35 AM
Points: 35,
Visits: 197
|
|
The code creates all the fields as varchar(20), though you could do an IsNumeric() check and then create the column as an int data type instead.
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Wednesday, May 01, 2013 11:07 AM
Points: 39,
Visits: 221
|
|
I have a concern with this approach.
Since the objective is to "get data from a file into a (staging) table with as little fuss as possible", I presume you want little or no validation during this step.
The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.
I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.
Have I understood correctly? How would you handle validation moving/copying data to the destination tables?
|
|
|
|
|
SSC-Enthusiastic
      
Group: General Forum Members
Last Login: Friday, April 19, 2013 4:50 AM
Points: 109,
Visits: 114
|
|
A great little technique but I would probably only use this kind of thing for an initial Proof of Concept. Rarely would you want to (or be allowed to) create dynamic table structures with possibly inaccurate data types and columns sizes in a production database. Morever, I suspect the ETL routines would fail if you have changing data types.
Consider that the CSV has nulls in the first row on the first load - how would you know what the data type is then? What happens if the data is longer than 20 characters?
|
|
|
|
|
SSC Eights!
      
Group: General Forum Members
Last Login: Thursday, May 16, 2013 8:10 PM
Points: 876,
Visits: 3,731
|
|
i feel it as too much complicated when we have the import/export functionality available.
---------- Ashish
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Monday, January 30, 2012 6:49 AM
Points: 94,
Visits: 359
|
|
Regarding the datatype discussion: I think it's a great idea to copy data to staging tables using a varchar datatype in the first step. You can check the datatypes in the next step with sql (sql is not poor ) or in a dataflow. The important thing is, that you have a place within the database where you have all source data (also the bad one) decoupled from its source system(s). In most circumstances this makes the subsequent operations easier (statistics on bad data, determination of deltas, restarting after an abortion, et cetera)
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Today @ 4:25 AM
Points: 36,
Visits: 556
|
|
Something inside me really wanted to hate this technique, but I know it has its uses in a pure "staging" environment (i.e. one where you have already committed to do the T/L part of ETL in procedures between the stage and the ODS).
That said, I do want to offer some pointers with the .NET code itself, which I feel would make this better as a teaching article or reusable code block:
1. Remove all MessageBox.Show calls. Replace with Dts.Log(message) -- might be obvious but noone wants to press OK 30,000 times to load a file
2. In any case, favour StringBuilder over message += "blah"
3. Don't populate an ArrayList with the lines, they have no requirement to stay in memory once they have been read/inserted -- instead, do your processing in the Do/While Not (sLine Is Nothing) loop and discard each row once it is created. You can use an isFirst Boolean, or a lineCount Integer to handle your first-line semantics inside that loop -- this should allow for files far larger than 10,000 rows
4. Consider System.IO.Path.Combine over (di.ToString & fi.ToString) when building paths programatically
4a. For what it is worth, fi.FullName is the full path to the file (so the di.ToString & fi.ToString combo is not needed)
5. Consider System.IO.Path.GetFileNameWithoutExtension(fi.Name) over fi.ToString.Substring(0, fi.ToString.Length - 4)
6. Wrap your StreamReader code in a Using/End Using block [i.e. Using objReader As StreamReader = New StreamReader(...)] -- if the ReadLine code throws an exception (e.g. if a line length is > 2bn), the End Using will automatically clean up
7. Consider batching your inserts, running 10-100 at a time. Build the insert statements up in a StringBuilder, separating with a semi-colon/newline combination [builder.Append(sql).AppendLine(";"c)] and run them when ((lineCount - 1) Mod batchSize = 0)
7a. Consider offering the batchSize as a Dts.Variable
8. Properly escape table/column names, either with a RegEx replacement for any non-alphanumeric character, or by using a SqlClient.SqlCommandBuilder object, with QuotePrefix = "[", QuoteSuffix = "]", using the QuoteIdentifier method.
9. Properly escape CSV data values, with a SqlClient.SqlCommandBuilder object, with QuotePrefix = "'", QuoteSuffix = "'", using the QuoteIdentifier method.
9a. Alternatively, consider building your insert string as INSERT INTO table VALUES (@p1, @p2, @p3, ...); and using the Parameters collection on the SqlCommand object to populate the values
10. Consider NVARCHAR(MAX) instead of VARCHAR(20) as your default data type -- .NET strings are UTF16/UCS2 as standard, so it makes sense to store them as such, and the MAX length prevents data truncation
11. Catch specific exceptions, if you are going to handle them, e.g. IO.IOException is thrown by FileInfo.MoveTo if a file already exists
12a. Consider checking that the backup folder exists before attempting to copy to it [If Not Directory.Exists(Path.Combine(di.FullName, "Backup")) Then Directory.CreateDirectory(Path.Combine(di.FullName, "Backup"))]
12b. Consider checking that the backup file does not exist before attempting to move the source file [If Not File.Exists(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name)) Then File.Delete(Path.Combine(Path.Combine(di.FullName, "Backup"), fi.Name))]
12c. Consider storing the backup folder location in a variable to reduce on Path.Combine calls -- allow the user to configure this location with a Dts.Variable
13. Consider making a IDENTITY PK on your imported tables, to aid in row identification later in the T/L process, either named tableName_id or according to a Dts.Variable -- you could turn this feature on or off by whether the variable is set or not (String.IsNullOrEmpty is your friend here).
14. Consider a recreateSchema Dts.Variable which triggers the drops of existing tables -- this will allow further development of the staging database, e.g. indexing, partitioning, which would be retained between data loads
|
|
|
|
|
Ten Centuries
      
Group: General Forum Members
Last Login: Monday, May 20, 2013 1:31 AM
Points: 1,385,
Visits: 1,085
|
|
Thanks for posting the article.
I think that everyone is mis-understanding. It is not a fully-flushed out implementation, but, as the author mentioned, only to be used for a proof-of-concept, for small data loads.
Thanks...Chris
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Today @ 5:13 AM
Points: 32,906,
Visits: 26,793
|
|
craig 81366 (1/10/2012) The trouble then is that validation will only be performed when getting data from the staging tables to the destination tables.
I consider SQL, as a set based language to be relatively poorly suited to the task of identifying & excluding specific problematic rows.
Have I understood correctly? How would you handle validation moving/copying data to the destination tables?
Gosh Craig, the only reason why I use staging tables to begin with is to do the very things that you've stated that SQL has a problem with (and it doesn't). I can easily validate data types, ranges and domains of data, do checks against known/expected values in lookup table, etc, ad infinitum, and all before I let possibly bad data anywhere near my real tables. Each row is premarked for insert, update, or error (along with the type of error) and the "move" code consists of two inserts (one for the good rows, one for the error rows to an errata table) and an update.
The "final move" to real tables would be done using a dynamic unpivot (you already have the column names in a separate table to make this a bit easier) and then dynamically/conditionally repivoted to meet the structure of the final tables for insert/update. But before you can do all of that, you have to get the data in to work on it and this article shows one possible method for doing such a thing.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|