MDM workflow

  • Hi Folks,

    I'm currently looking to define an MDM workflow and I'm having some challenges.

    I would seem to have a number of options;

    1. Match in DQS (which only supports creation of clusters, as opposed to resolution to masters)

    2. Match in SSIS (which has both fuzzy matching and fuzzy grouping)

    3. Match in T-SQL either using the MDS functions, or a roll-your-own matching solution

    4. Match in Excel using the MDS add-in with DQS matching (which again only supports clusters)

    Here's where it gets tricky. Simply resolving to master is fine except for the unmatched. The unmatched may in fact be multiple instances of the same entity, so a second process using clustered matching would seem to be appropriate for these.

    Clustering the new records unioned with the masters gives me matching of new to master and successfully clusters the new rows that don't relate to a master. The trouble is in the event of a master being a part of a cluster, I'd want it as the pivot and survivor. And if I end up with two masters in a cluster, well that's a problem.

    With the SSIS and DQS option, there is also the issue of neither treating NULLs correctly. Let's say we're looking at postcodes. A NULL and a NULL is indeterminate and shouldn't be considered equal and contribute to the matching score. A NULL and a non-NULL is also indeterminate and shouldn't contribute to the matching score. Only two non-NULLs should be considered for equality and contribute to the matching score.

    Using the Excel add-in for MDS and DQS matching is a farce. There's no way for a data steward to approve or reject a match, and there's no kind of key management. All in all, I find the plug-in next to useless for a real world scenario. It's not just the master data that is all important; the key mapping from source to master is absolutely critical.

    Another issue is survivorship. Why should a data steward, or even an automated process not be able to build the most complete master with consideration to all of the matching rows. If one row has postcode and another has country, why can't I take the best from each? Even having a data steward select a single desired survivor isn't supported.

    After some prototyping in DQS and MDS, I can only conclude that they're not fit for purpose.

    Does anyone have a design pattern that works for this? Everything I've read is singing the praises of DQS, but none address real world problems. Nowhere have I seen a post that considers initial load versus deltas, and nowhere have I seen anyone talk about the key mapping.

    I look forward to hearing of the experiences of others.

Viewing 0 posts

You must be logged in to reply to this topic. Login to reply