I am validating a large database of addresses. I've already deployed a number of ways to validate city, state and zipcode against a lookup database. As a final step, I want to take entries with invalid and missing zip codes, and attempt to validate the city and state to see if there is only 1 valid zip code for that city/state combination. This has turned out to be a much more complex operation that I initially thought.
Excellent, thank you for helping me understand the use case for the logic you're applying. It'll help me get you untangled. I think you've just overthought it.
What I've come up with so far is pretty convoluted. Because I need to log off any non-matches, I have to multicast the input rows.
Nope. You need to use error record redirection. Hang on.
I then have to sort the rows and merge with the lookup database.
Sort? Why are you sorting?
The result is multicasted, with one output going to an aggregate transformation that counts the rows grouped by the unique identifier.
Okay, I get why you might want this for a rowcount, but why is this re-merging?
This is then merged with the original input rows. A conditional split separates the rows with only 1 result from the others. The rows with 0 or more than 1 match are sent of to a log. The good rows are then joined back with the lookup output to get the valid lookup zip code.
What seems like a simple transformation actually takes over 10 transformation steps with lines crossing each other at several points. It's virtually unreadable, will be difficult to maintain down the road, and is pretty slow.
Is there a better way to do this that I am not thinking of?
... I'm not even sure where this was going anymore.
For the original use case, I would do the following:
Bring the records in from the source database you're checking into the stream. You need the key and the City/State/Zip Fields from the source, that should be it. Next, apply a lookup for valid C/S/Z fields on all three columns. Redirect error rows via Error Output and Lookup Match Output.
This leaves your 'good' records in an intact stream. Ignore this for now. What we want is the error path (red arrow). Now, feed this into another lookup. This lookup should be fed by a query that looks something like this:
ValidCSZList AS v
COUNT( DISTINCT [Zip]) = 1
ON v.[City] = drv.[City]
AND v.[State] = drv.[State]
That query will give you only city/state combinations with a single zip code.
Re-lookup the error stream with this data, hooking only off city/state. Output the Zip from the lookup along with the original data in the main stream. Again, Error control the lookup and redirect failures.
You now have 3 streams of data.
The first stream is good data. Not sure what you want to do with that other than ignore it, it needs no changes.
The second stream, the success off the second lookup, are invalid records in the source that have a valid correction that can be applied. You can feed this back into an SCD2 change control or a list of some kind to alter the origination records with the correct zip code. This is why you brought the key along from the origination system.
The third stream (the error stream from the second lookup) contains the records that have invalid C/S/Z combinations and cannot be matched up via C/S to a unique Zip. This needs to be logged off to a table somewhere for manual intervention of the source data. Most likely by calling the client/contact/whatever to get it corrected.
If you need to do more with this data from that point, work from there where you have the unique streams. You shouldn't need sorts of any kind, and the only aggregator in SSIS I recommend is COUNT(*) for a recordset total, which will avoid sorting.
- Craig Farrell
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake. For better assistance in answering your questions | Forum Netiquette
For index/tuning help, follow these directions. |Tally TablesTwitter: @AnyWayDBA