July 4, 2025 at 7:43 am
Dear All,
Good Day.
I have a requirement to implement de-duplication on customer data in our policy system before loading it into the target, ensuring data is clean.
I initially suggested using SSIS Fuzzy Lookup, but the client is not satisfied with this approach. Could you please suggest alternative methods or tools to implement de-duplication with a higher success ratio?
Your guidance on this would be much appreciated.
Best Regards,
Balamurugan D
July 4, 2025 at 5:47 pm
My opinion - it depends on what they consider "duplicate" and de-duplication.
Are they meaning on the row level or column level? At the column level, if you have a lot of character based data, normalizing the data is probably a good approach as it will safe disk space and can improve lookup times depending on how you build your queries and indexes.
If it is at the row level, my approach is to pick the columns that would mark a row as a duplicate (may be all, may not be all) and generate a hash on that value. Once complete, use the hash column to determine how many duplicate rows there are and remove the duplicates. Finally, insert it into the target. May not be the most efficient approach, but it is a reliable approach and is much faster than comparing the row contents one by one.
Alternately, if hashing is not possible for some reason or another, you can concatenate the values in the column that are used to determine duplication and then compare that last column. Single column lookups for finding duplicates are going to be faster than multi-column lookups. Hash generation OR string concatenation are both fairly quick operations.
ALTERNATELY, if you are using all columns to determine duplicates, you could do a DISTINCT on the data which will remove all duplicates.
The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.
Viewing 2 posts - 1 through 2 (of 2 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy