SSIS Lookup using a Script Component

  • Sachin Dedhia

    SSCrazy

    Points: 2104

    Comments posted to this topic are about the item SSIS Lookup using a Script Component


    Regards,
    Sachin Dedhia

  • Robert C. Wafle

    SSC Veteran

    Points: 214

    Hello Sachin.

    Sounds like your solution works very well.

    Did you consider creating a transformation where you transform the multiple columns into say two or three columns, so you would have just one lookup component? I believe the advantage of this approach would be that you would think you would be able to process lookups against a larger set of values by letting SSIS take care of the caching, or by turning off the caching option in SSIS.

    Your original table structure would be something like:

    PK course_1, course_2, course_3, ...

    1, a, b, c

    Your new table structure would end up being something like:

    PK, column_name, column_value

    1, course_1, a

    1, course_2, b

    1, course_3, c

    You could easily process the data in this format and provide the desired output into your two tables. I would be curious to know if this approach ends up being any faster or slower in your case.

    If you don't know how to perform that kind of a transformation, I will be happy to point you in the correct direction.

    Thanks

    Rob

  • Anipaul

    SSC-Insane

    Points: 24681

    Nice article with good explanation...

  • Jeff Moden

    SSC Guru

    Points: 995703

    I'm curious... how many rows in the Course Reference table and any chance of you attaching the first 10 rows of the 10 million row file you imported? I'd like to play... 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin Dedhia

    SSCrazy

    Points: 2104

    Hello Rob,

    If I understand it right, you mean Pivot the CourseID columns and perform the lookup operation. If so, I would have to Unpivot it again as my destination table is not normalized. Also, the input would have many more fields like Name, Age, etc...., the CourseID columns may or may not have a value, may be in any upper case or lower case. Moreover, I needed to build a unique list of the values missing in the reference table which I could do in the same script component as well. And, all this coupled with the huge volume of 10+ million rows.

    That said, the approach you mentioned is definitely a good try.


    Regards,
    Sachin Dedhia

  • kh-619447

    SSC Journeyman

    Points: 88

    I find this related article very interesting:

    http://blogs.msdn.com/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx

    It describes the performace differences from Lookup, Merge, and Script Component used to solve the same logic. Very cool reference.

  • Robert C. Wafle

    SSC Veteran

    Points: 214

    Hi Sachin.

    Yes, pivot and unpivot.. This would might make your script component contain less code in the case where there are 20 columns. The, you could lookup or merge and split, like in the article KH suggested. Perhaps a novice SSIS user with no coding experience could modify the pivot to pivot a few more columns easier than editing the script component... Perhpaps not.. But, since you have 10-15m rows.. the performance gain given by the script task will likely outweigh any need for simplicity in maintaining the SSIS package. Alternatively, if you had many different file formats to work with, one could adapt just the first component to the differing schemas instead of adapting the code which is three transformations into the package..

    The unique list of values that didn't match could be created with a distinct sort against the rows where the lookup failed... as long as too many rows didn't actually fail. I've experienced performance issues with the distinct sort along with anyone else who has tried to use it.

    Please note you can greatly reduce the amount of memory used by the lookup in many cases simply by specifying exactly what columns you'll actually be using in the lookup using a select statement instead of specifying the whole table with the drop down. I doubt your production server will have nearly as much free memory as your workstation.

    In the end, the number of rows in the course reference table will dictate if you can continue to load all the rows into your datatable or not. It is likely this table is very small. How many rows do you have there?

    Thanks

    Rob

    Rob

  • Jeff Moden

    SSC Guru

    Points: 995703

    Like I said, I'm curious... how many rows in the Course Reference table and any chance of you attaching the first 10 rows of the 10 million row file you imported? I'd like to play...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin Dedhia

    SSCrazy

    Points: 2104

    Jeff,

    I can't get you the exact sample of what I have, but I'll try and get you something near to it soon.

    BTW, the reference dataset used for lookup is filtered in my case and varies from ~15K rows in one scenario to ~50K rows in another. The table can have much more data.


    Regards,
    Sachin Dedhia

  • Jeff Moden

    SSC Guru

    Points: 995703

    Thanks... guess the rowcounts you gave will do just fine. Don't worry too much about the data... I guess I can gin something up.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sachin Dedhia

    SSCrazy

    Points: 2104

    kh (12/30/2008)


    I find this related article very interesting:

    http://blogs.msdn.com/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx

    It describes the performace differences from Lookup, Merge, and Script Component used to solve the same logic. Very cool reference.

    Indeed a good reference. I am surprised that I never came across it while googling around for SSIS lookup.


    Regards,
    Sachin Dedhia

  • Sachin Dedhia

    SSCrazy

    Points: 2104

    Hi Rob,

    The data I have does not have a PK. It is basically a dump for a year from many locations and used for analysis on the avaliable fields.

    While the missing list is expected to have hardly anything, there can be a case where the reference data may be missing completely making the Sort component (and may be the server as well) go for a toss.

    Moreover, if even one of the values from a row (for CourseID) is missing, the entire record should be discarded. At the same time, other fields (non CourseID fields) should still be validated and all the failed validations should be consolidated in a comments field for the user.

    Lastly, the reference dataset I use is filtered and has ~15K rows in one scenario and ~50K rows in another. The underlying table may have much more data.


    Regards,
    Sachin Dedhia

  • Robert C. Wafle

    SSC Veteran

    Points: 214

    Ah hah.. well in the case of a missing PK you could make a surrogate key representing the import job / filename, etc... Your solution is ovbiously working, and performs. Like I said in the beginning, I am curious to know how my approach would perform since I believed in the beginning that it would use less memory from the lookup, but I realize now after the pivot that my solution would use more memory per column.. I'll have to do some reading about the SSIS memory "back-pressure" controls I heard about at an SQL meeting.. I think that would help keep the memory needed for the entire task to a minimum.

  • DPhillips-731960

    Hall of Fame

    Points: 3904

    Nicely illustrated article, and stepped explanations. Easy to follow. It also demonstrates nomenclature and useage of script components. There are little trinkets of caveats that one must know to succesfully navigate the script component, and it appears you have worked it out.

    This approach is an RBAR approach... even more so, not just row by row, but field by field in each row. As such would not be very flexible for change in column count.

    Having dealt with this type of data many times in the education industry, there are often multiple course selections. In practice it has always served better, faster, and more flexible for future use/re-use of both the component and the data to normalize the data, no matter what the final output format may be. In a normalized format, it matters not if there are 1 or 100 CourseID columns, nor the size of the comparison set.

  • Simon Sabin

    SSCrazy Eights

    Points: 8142

    Its a real shame the script component doesn't expose a columns collection for the Row object. That way you could iterate over the columns very easily and not have to repeat the same code.

    You could do that in a custom component but thats another level of work.

    This is a great solution, I would question the performance of the Find method for the Data Table.

    You might find it quicker to populate a Dictionary object and do the lookup that way.


    Simon Sabin
    SQL Server MVP

    http://sqlblogcasts.com/blogs/simons

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply