Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Tim Mitchell

Tim Mitchell is a business intelligence consultant, author, trainer, and SQL Server MVP with over a decade of experience. Tim is the principal of Tyleris Data Solutions and is a Linchpin People teammate. Tim has spoken at international, regional, and local venues including the SQL PASS Summit, SQLBits, SQL Connections, SQL Saturday events, and various user groups and webcasts. He is a board member at the North Texas SQL Server User Group in the Dallas area. Tim is coauthor of the book SSIS Design Patterns, and is a contributing author on MVP Deep Dives 2. You can visit his website and blog at TimMitchell.net or follow him on Twitter at twitter.com/Tim_Mitchell.

DQS Domain Validation Rules

dqsA compelling feature of the new Data Quality Services in SQL Server 2012 is the ability to apply rules to fields (domains) to describe what makes up a valid value.  In this brief post, I’d like to walk through the concepts of domain validation and demonstrate how this can be implemented in DQS.

Domain validation essentials

Let’s ponder domain validation by way of a concrete example.  Consider the concept of age: it’s typically expressed in discrete, non-negative whole numbers.  However, the expected values of the ages of things will vary greatly depending on the context.  An age of 10 years seems reasonable for a building, but sounds ridiculous when describing fossilized remains.  A date of “1/1/1950” is a valid date and would be appropriate for classifying a person’s date of birth, but would be out of context if describing when a server was last restarted.  In a nutshell, the purpose of domain validation is to allow context-specific rules to provide reasonableness checks on the data.

A typical first step in data validation would involve answering the following questions:

  • Is the data of the right type?  This helps us to eliminate values such as the number “purple” and the date “3.14159”.
  • Does the data have the right precision? This is similar to the point above: If I’m expecting to store the cost of goods at a retail store, I’m probably not going to configure the downstream elements to store a value of $100 million for a single item.
  • Is the data present where required?  When expressing address data, the first line of an address might be required while a second line could be optional.

Domain validation goes one step further by answering the question, “Is a given value valid when used in this context?”  It takes otherwise valid data and validates it to be sure it fits the scenario in play.

Domain validation in DQS

Even if you don’t use this term to describe it, you’re probably already doing some sort of domain validation as part of your ETL or data maintenance routines.  Every well-designed ETL system has some measure sanity check to make sure data fits semantically as well as technically.

The downside to many of these domain validation scenarios is that they can be inconsistent and are usually decentralized.  Perhaps they are implemented at the outer layer of the ETL before data is passed downstream.  Maybe the rules are applied as stored procedures after they are loaded, or even as (yikes!) triggers on the destination tables.

Data Quality Services seeks to remedy the inconsistency and decentralization issue, as well as make the process easier, by way of domain validation rules.  When creating a domain in DQS, you are presented with the option of creating domain rules that govern what constitutes a valid value for that domain.  For the example below, I’m using data for automobile makes and models, and am implementing a domain rule to constrain the value for the number of doors for a given model.

SNAGHTML612bbb0

With the rule created, I can apply one or more conditions to each of the rules.  As shown, I am going to constrain the valid values to lie between 1 and 9 inclusive, which should account for the smallest and largest automobile types (such as limousines and buses).

SNAGHTML6292bf0

For this rule, I’m setting the conditions that the value must be greater than zero or less than ten.  Note that there is no requirement to use this bookend qualification process; you can specify a single qualifier (for example, greater than zero) or have multiple conditions strung together in the same rule.  You can even change the AND qualifier to an OR if the rule should be met if either condition is true – though I would caution you when mixing 3 or more conditions using both AND and OR, as the behavior may not yield what you might expect.

That’s all there is to creating a simple domain validation rule.  Remember that for the condition qualifiers, you can set greater than, less than, greater than/equal to, etc., for the inclusion rule when dealing with numerical or date domain data types.  For string data types, the number of options is even greater, as shown below:

image

Of particular interest here is that you can leverage regular expressions and patterns to look for partial or pattern matches within the string field.  You can also check the string value to see if it can be converted to numeric or date/time.

The rule in action

With the new domain validation rule in place, let’s run some test data through it.  I’m going to create a few test records, some of which violate the rule we just created, and run them through a new DQS project using the knowledge base we modified with this rule.

I’ll start off with the dirty data as shown below.  You can probably infer that we’ve got a few rows that do not comply with the rule we created, on both ends of the value scale:

image

After creating a new data cleansing project, I use the data shown above to test the rule constraining the number of doors.  As shown below in the New output tab, we have several rows that comply with this new rule:

SNAGHTML65f97ec

In addition, there are two distinct values found that do not meet the criteria specified in the new rule.  Selecting the Invalid tab, I see the values 0 and 12 have failed validation, as they fall outside the range specified by the rule.  In the Reason column, you can see that we get feedback indicating that our new rule is the reason that these records are marked as Invalid:

SNAGHTML661229e

So by implementing this rule against my data, I am able to validate not only that the value is present and of the correct type, but that it is reasonable for this scenario.

Conclusion

In this post we’ve reviewed the essentials of domain validation and how we can implement these checks through domain rules in SQL Server Data Quality Services.  In my next post, I’ll continue the discussion around domain rules by reviewing how these rules can be applied to composite domains in DQS.

Comments

Leave a comment on the original post [www.timmitchell.net, opens in a new window]

Loading comments...