Delete semi-dups during Data Flow?

Question

Delete semi-dups during Data Flow?

Brandie Tarvin

SSC Guru

Points: 173108
More actions
April 8, 2008 at 4:37 am

#185115

I'm pulling records from a SQL db and a mainframe db. The SQL db exports records to the mainframe, but not all records. The mainframe records never get sent to SQL, so I can have matching records in the mainframe & SQL due to the SQL Export, but have non-matching records in both systems.
SQL DB records have an identity field that gets copied over to the mainframe, ContractID. The mainframe records (both the originals and the "imported from SQL") have their own identity field called CertNo.
When I do a pull in my Data Flow tasks from both systems, I'm getting dupes (obviously), but not "true" dupes because the SQL records don't have CertNo in them. I'm doing a MERGE between the two data sets, but I'm trying to figure out a way to remove the duplicated SQL records. I want to keep the mainframe records that have both the ContractID and the CertNo, but I don't see a way in the SORT to choose which records I can keep and which records get removed. In fact, I can't figure out how to tell the SORT to see the records as duplicate.
"Duplicated" Data example is as follows:
ContractID CertNo LastName FirstName MidInitial
112356 NULL Mouse Timothy Q Delete this
112356 ABC1255 Mouse Timothy Q Keep this
225288 NULL Duck Donald Delete this
225288 12dd58 Duck Donald Keep this
The ones with the NULL CertNo value come from SQL, the others come from the mainframe.
Now, I could do this with a Execute SQL Task in the Control Flow (a fairly simple DELETE statement). However, I'd rather get rid of the information before I load it into the Staging table instead of after. Does that make sense?
Any thoughts on how I can do this in Data Flow instead of as a Control Flow Execute SQL Task?
Thanks in advance!
Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Frank Bazan SSCertifiable Points: 5378 More actions · Answer 1

Hi Brandie,

How about this... I apologise if this seems a little sketchy, but I think this could work. Multicast straight after your merge into 2 pipelines. Pass one through an aggregate transform, counting the contractids. Then pass through a conditional split. Ones with a count of 1 get merged back to the pipeline inner joining on ContractId and CertId, those with a count of higher go to another transform. Add a derived column to these that basically says. ImADupe = TRUE.

Back to your previous pipeline. Add a derived column to this one too. This one is called MyCertIdIsNotNull and the expression looks something liket this "!IsNull(CertId)?TRUE:FALSE" So now you just need to perform a final join. This time you're joining on ContractId=ContracttId and ImADupe = MyCertIdIsNull. Hopefully you will only have those records where ContractId Matches and MyCertIdIsNotNull is True.

It sounds like it would work. Let us know how it goes 😉

Kindest Regards,

Frank Bazan

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 2

Frank,

Hmmm. Multicast doesn't like being put after my Merge. It comes up with a totally blank screen on both sides of the divide. Apparently it can't see any outputs to put on the left side so I can choose it.

I'll have to think about this one.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 3

Okay, the Execute SQL Task might be the way to go. Since your suggestion didn't seem to like me, I tried adding a Lookup to the first pipeline (right before the final MERGE), to verify whether or not the records where in the second dataset. Then sent the "failure" records over to the MERGE.

Running this took 1 hour 6 minutes for approximately 17 million records. At least 20 minutes of this time was taken up by the FULL CACHE process on the Lookup transformation that I added.

Now I'm testing without the Lookup transformation and already it's working on every single transformation task. Which doesn't necessarily mean anything yet. We'll see how this plus the time for the Execute SQL Task take together and then I'll know which is the most efficient way to go on this.

Interesting how sometimes a person's original implementation thoughts can sometimes be better than the multiple options they try and implement instead. @=)

I'll post again when I'm sure what the better solution is.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 4

It's official. Doing the delete in the Execute SQL Task rather than trying to filter the rows ahead of time is faster by 26 minutes.

WOW.

Of course, I'm sure it depends on what a person is doing with their SSIS package. Here's a 'pic' of my data flow task (attached). I did my LOOKUP test with the LOOKUP transformation between the "Merge SQL Names List" and the "Merge Customer Names" transformations. And since the mainframe files are on a whole nother server, I'm sure that slowed some things down.

Thanks for your suggestion, Frank. And I hope this thread helps someone else out in the future!

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Frank Bazan SSCertifiable Points: 5378 More actions · Answer 5

Hi Brandie, thanks for the feedback.

Looking at your dataflow there is something that I think could improve the performance even further.... if you don't have a requirement for rows to be output in the order they arrive, you could swap the merges for union all transforms. These tend to be quicker as they can pass records on as soon as they arrive. With a small recordset probably wouldn't be a significant improvement but with your 17 million rows, you may shave some more time from the dataflow.

Kindest Regards,

Frank Bazan

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 6

Frank,

As a matter of fact, I just realized that an hour ago. @=) I was about to post the change (especially as I forgot to add in one of my Conditional Split success paths) as soon as I finished altering and testing my package.

I don't know what my issue was yesterday (Brandie had CPU blockage somewhere @=), but my brain is working with a much higher thread count this morning. @=)

Thanks for the suggestion. I'll post a new picture when I'm finished fixing the step.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.