Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

SSIS Lookup using a Script Component Expand / Collapse
Author
Message
Posted Tuesday, December 30, 2008 5:55 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, November 04, 2013 1:49 AM
Points: 164, Visits: 188
kh (12/30/2008)
I find this related article very interesting:
http://blogs.msdn.com/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx


It describes the performace differences from Lookup, Merge, and Script Component used to solve the same logic. Very cool reference.


Indeed a good reference. I am surprised that I never came across it while googling around for SSIS lookup.





Regards,
Sachin Dedhia
Post #627273
Posted Tuesday, December 30, 2008 6:17 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, November 04, 2013 1:49 AM
Points: 164, Visits: 188
Hi Rob,

The data I have does not have a PK. It is basically a dump for a year from many locations and used for analysis on the avaliable fields.

While the missing list is expected to have hardly anything, there can be a case where the reference data may be missing completely making the Sort component (and may be the server as well) go for a toss.

Moreover, if even one of the values from a row (for CourseID) is missing, the entire record should be discarded. At the same time, other fields (non CourseID fields) should still be validated and all the failed validations should be consolidated in a comments field for the user.

Lastly, the reference dataset I use is filtered and has ~15K rows in one scenario and ~50K rows in another. The underlying table may have much more data.





Regards,
Sachin Dedhia
Post #627283
Posted Tuesday, December 30, 2008 9:42 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, November 27, 2012 12:43 PM
Points: 12, Visits: 35
Ah hah.. well in the case of a missing PK you could make a surrogate key representing the import job / filename, etc... Your solution is ovbiously working, and performs. Like I said in the beginning, I am curious to know how my approach would perform since I believed in the beginning that it would use less memory from the lookup, but I realize now after the pivot that my solution would use more memory per column.. I'll have to do some reading about the SSIS memory "back-pressure" controls I heard about at an SQL meeting.. I think that would help keep the memory needed for the entire task to a minimum.
Post #627444
Posted Tuesday, December 30, 2008 4:30 PM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, August 10, 2012 6:08 PM
Points: 1,156, Visits: 801
Nicely illustrated article, and stepped explanations. Easy to follow. It also demonstrates nomenclature and useage of script components. There are little trinkets of caveats that one must know to succesfully navigate the script component, and it appears you have worked it out.

This approach is an RBAR approach... even more so, not just row by row, but field by field in each row. As such would not be very flexible for change in column count.

Having dealt with this type of data many times in the education industry, there are often multiple course selections. In practice it has always served better, faster, and more flexible for future use/re-use of both the component and the data to normalize the data, no matter what the final output format may be. In a normalized format, it matters not if there are 1 or 100 CourseID columns, nor the size of the comparison set.
Post #627732
Posted Saturday, January 03, 2009 8:42 AM


Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Thursday, March 13, 2014 10:40 AM
Points: 573, Visits: 107
Its a real shame the script component doesn't expose a columns collection for the Row object. That way you could iterate over the columns very easily and not have to repeat the same code.

You could do that in a custom component but thats another level of work.

This is a great solution, I would question the performance of the Find method for the Data Table.

You might find it quicker to populate a Dictionary object and do the lookup that way.



Simon Sabin
SQL Server MVP

http://sqlblogcasts.com/blogs/simons
Post #629274
Posted Monday, January 05, 2009 7:26 AM


Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, April 10, 2014 5:39 AM
Points: 5, Visits: 33
Hi Sachin

Being a DW developer, I am always on the lookout for the simplest way of doing things. Another suggestion/question would be (if I understand the scenario correctly), why not add an intelligent key column... something like checksum(col1 + col2 + col3... + coln). Then, just add the same key logic to your source data query, and in the look-up to compare one set of keys? That way you are passing the logic out of SSIS, and to T-SQL, where the queries will be more malleable and agile.

Great article though... Please post more of these "out-of-the-box" articles!!

Thanks.



_____________________________________________________________________
Select 'Remember, we were all newbies once'
From Truth
Order by Experience

_____________________________________________________________________
Post #629786
Posted Monday, January 05, 2009 7:52 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, November 04, 2013 1:49 AM
Points: 164, Visits: 188
Hi Simon

I am not a .NET guy, so there may be better approach for caching the reference data like Dictionary object as you said.

You may want to refer to http://www.eggheadcafe.com/articles/20030830.asp for some inputs on Find method.

HTH!





Regards,
Sachin Dedhia
Post #629810
Posted Monday, January 05, 2009 7:57 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, November 04, 2013 1:49 AM
Points: 164, Visits: 188
Zanonil,

Can you elaborate more on CHECKSUM? An example would help.






Regards,
Sachin Dedhia
Post #629818
Posted Monday, January 05, 2009 8:16 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, November 04, 2013 1:49 AM
Points: 164, Visits: 188
Hi dphillips,

I appreciate your observation on the "nomenclature"! And as far as the article is concerned, thanks to Steve for his valuable comments.

Not sure why the approach is RBAR approach? In almost all cases, ETL stuff would involve transformation that work on row level data.

While I agree with you on having a normalized database structure, my requirement does not require much to do with the normalized data. If normalized, I would have to again de-normalize the data for generating a file as a feed to another application.





Regards,
Sachin Dedhia
Post #629835
Posted Monday, January 05, 2009 9:24 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Friday, August 10, 2012 6:08 PM
Points: 1,156, Visits: 801
Sachin Dedhia (1/5/2009)
Not sure why the approach is RBAR approach? In almost all cases, ETL stuff would involve transformation that work on row level data.


Being an RBAR code process is strictly separate from the fact that this may or may not be your fastest execution path for your given needs. I was not speaking to speed, but rather flexibility and re-useability. Your code component is fast, but not dynamic.

An ETL plan doesn't have to process via SSIS one row at a time, nor one column at a time, especially if you are dealing with something that can and may change (such as number of course columns). It is usually more prudent to seek instead for an approach where one does not have to modify code and re-deploy to the package server to process the next file, or worse yet, the nasty scenario of having a different package for each source when all are doing the same type of task.

If your incoming format will remain stable (a rare thing indeed), then it likely doesn't matter. However, if it can change, then ETL-ing into a database, and using the SQL engine to process the matches by making SSIS call the appropriate procedures, etc., is far more flexible, and still stays within the realm of SSIS ETL.

I have also learned (the hard way) that the data is better kept for other purposes. Rare indeed are the projects I've come accross that serve an "enrichment" or "counts" only purpose (one example of enrichment is with data cleansing projects against a third-party tool, during importing and normalizing incoming lead data).

Your case may be just such, however, so your mileage may vary. But in principle, the approach as it stands has custom code, making it flex only by brute code force, even for simple changes such as more (or less) columns of the same type, giving it a high probability for failure over time.

Output (usually the fastest step in the process) doesn't preclude one appending the current file, writing a new one in-process, or even an SSIS task that stuffs data from the database to a file.

I have to admit that dumping en-masse notes about all the courses in the row makes me shudder and question the process as a whole.

I do think the article is useful however on several points, as your article was about the use of the component, and not necessarily the particular data process.
Post #629920
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse