Data Quality - Addressing non-stated requirements

  • Comments posted to this topic are about the item Data Quality - Addressing non-stated requirements

  • Good article, and good point about the difficulty in getting non-technical people to realize how much they can contribute. I recall that an Auditor once complimented us on the consistent formatting of our address data, saying that it gave him confidence in the rest of the system!

    BTW: Typo in third blue box, UK Postcode --- A99 should be 9AA (or NAA as the PAF documentation uses.)

  • RobertMcClean (7/28/2016)


    BTW: Typo in third blue box, UK Postcode --- A99 should be 9AA (or NAA as the PAF documentation uses.)

    Thanks and good spot.

    Excellent point about confidence. It's too easy to underestimate the corrosive affect of little errors on the perceived quality and therefore the confidence people have in the system

  • RobertMcClean (7/28/2016)


    Good article, and good point about the difficulty in getting non-technical people to realize how much they can contribute. I recall that an Auditor once complimented us on the consistent formatting of our address data, saying that it gave him confidence in the rest of the system!

    BTW: Typo in third blue box, UK Postcode --- A99 should be 9AA (or NAA as the PAF documentation uses.)

    Yes there is definitely a need to apply business domain knowledge in determining these rules. Keeping with the postcode example there are a number of non standard format postcodes e.g. Girobank in Bootle has the postcode GIR 0AA - this is actually the very first postcode and the reason for it being non standard is because the format was not standardised until some time later.

    Any postcode validation routines should handle these. I would probably suggest holding such as a separate table and validate against this as exceptions separately as there is a very small number of these.

    This is the sort of thing that I would not expect the IT specialists whether Application or SQL developer or architect to realise. Even a BA may not understand these but should probably be asking about to the business users.

  • Good topic, and you have also written in a style that's accessible and relevent not just for data professionals but also for non-technical team members and stake holders. Lack of definition and implementation of constraints, or lack of proper master data management in general, is perhaps the primary cause for project to overruns in terms of time and cost. You definately want to confront data quality in the requirements gathering phase and bake it into the development effort, rather than kicking the can down the road and letting QA or end users call it to your attention.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Unfortunately my experience has been no-one knows what they don't know. So it comes down to putting something together that matches the requirements, have it tested, discover what was missed, fix it, try again...until everyone is satisfied. Then hopefully someone along the way did some documentation.

  • Frequently you do know what you don't know.

    Here there are 4 categories of knowledge

    Known knowns

    Unknown knowns

    Known unknowns

    Unknown Unknowns

    Most people will understand the first and last of these.

    The second is typically identified where you have a Eureka moment.

    The third is where you know what you want to achieve but don't know how to achieve it and generally is resolved by either research or delegation.

    The last is where you get bugs.

  • Nice article. I have a quibble on the latitude and longitude 'rules'. Other than the poles, latitude degrees should never be higher 89 and anything over 80 should probably be investigated. Similarly, minutes and seconds should never be higher than 59, not 60. A longitude of 180 degrees is very suspicious and I would set the threshold at 179 as there are only a few land places at exactly 180 degrees (https://en.wikipedia.org/wiki/180th_meridian). And then, of course, minutes or fractional degrees must be zero. Our systems still operate in terms of absolutes so you would not want to code some of these limits as constraints. But it would be nice to flag them as 'Are you sure?' and have a way to identify that they have been investigated and validated.

  • Great article, David. I'm currently working on a series of articles called "The Devils in the Data" (intentionally pluralized rather than the conjunction of "Devil's") and one of the points is exactly the point you've touched on. Of course, this is one of the many places where having an excellent QA team in place really helps. I embrace it as an opportunity to learn when they break something I've done rather than blasting them about "it meets the written requirements" as so many people do.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s. I'd like to add a reference link to this article in mine, if that's OK with you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/28/2016)


    p.s. I'd like to add a reference link to this article in mine, if that's OK with you.

    Go ahead. In an ideal world we'd have a repository so we could build up the data types and rules to go with them as a shared wiki.

  • I think the postcodes aren’t the best example in what concern the data quality assessment rules. Even if one succeeds in his attempts of reengineering the postcodes rules, by next rule change the rules need to be changed as well. There are also cases in which the postcode system changes from the grounds up. Imagine you have to code rules for postcodes from all over the world!

    At least in what concerns the postcodes and address information, one should use when possible an address validator or row data provided by an authorized entity. I know that there are such kinds of services. If possible one should implement such address validators directly in the source systems. This depends also on the number of new addresses added each year. Sometimes a validator is cost-effective, other times it isn’t.

    Same approach should be used also for credit cards. Checking only the prefix and length of a credit card doesn’t help much, though it's a step in assessing the quality.

    Usually I asses data quality based on following dimensions: duplication (aka uniqueness), completeness, consistency, conformity, accuracy, integrity and eventually timeliness.

    I have a series of posts on this theme: http://sql-troubles.blogspot.de/search?q=data+quality

  • sql-troubles (7/29/2016)


    I think the postcodes aren’t the best example in what concern the data quality assessment rules. Even if one succeeds in his attempts of reengineering the postcodes rules, by next rule change the rules need to be changed as well. There are also cases in which the postcode system changes from the grounds up. Imagine you have to code rules for postcodes from all over the world!

    At least in what concerns the postcodes and address information, one should use when possible an address validator or row data provided by an authorized entity. I know that there are such kinds of services. If possible one should implement such address validators directly in the source systems. This depends also on the number of new addresses added each year. Sometimes a validator is cost-effective, other times it isn’t.

    Same approach should be used also for credit cards. Checking only the prefix and length of a credit card doesn’t help much, though it's a step in assessing the quality.

    Usually I asses data quality based on following dimensions: duplication (aka uniqueness), completeness, consistency, conformity, accuracy, integrity and eventually timeliness.

    I have a series of posts on this theme: http://sql-troubles.blogspot.de/search?q=data+quality

    When you are going to use SaaS validation you have to consider how frequently that service is going to be called, what latency that service introduces and the cost of the service.

    If you know categorically that a valid data entity fails to conform to a known pattern then you can avoid calling the validation service because you already know that it is going to fail. Having worked for an information aggregator I know that they had a comprehensive set of filters that prejudged whether to submit a request for information to various systems. They tracked these "filter fails" for two reasons:-

    1. Validation that the rule was operating as intended

    2. Revenue opportunity for sharing with the information source provider.

    I would not try and write a global postcode rule-set. I would have a separate rule-set per country because (as you say) the postcodes and their intended use varies greatly across the world.

    Postcode rules DO change but very slowly. Ditto telephone number rules. As long as the thing that checks a particular data entity has a single point of maintenance then I don't see this as an issue. If you were to have a plethora of components copy/pasted between software projects then yes, maintaining the rules becomes a problem, but then again, you probably have bigger problems in that scenario.

  • Doing basic prevalidation before calling a validation service seems to be an acceptable approach. Personally I try to avoid building too much logic in a validation. If a rule can’t be built as a simple pattern or sets of patterns, it’s maybe the case to use another approach. Sure, one has sometimes to compromise or be inventive… I met cases in which a rule wasn’t so easy to write, especially as an SQL query. Some of these were cases in which a way to standardize the values was necessary. Then one has to check with the business if something can be done in this direction.

    If the validation against a service takes time, one can use the second approach I mentioned – use row data from a data provider. For this is enough to load the data in an internal database and have the data synchronized on a weekly basis or on demand when needed. I prefer this second approach because is fast, reliable and can be used against multiple systems. It is sufficient to store postcodes, streets, regions, countries and cities. One can use these data for a full validation, or for a prevalidation before accessing a validation service.

    I’m also trying to avoid hardcoding too many values. I prefer to store the values I use for validations in a table that’s easy to maintain by me or users.

    Some validations can be built also “asynchronously”, done e.g. against a copy of the source data.

  • Definitely

    Perform basic validation of datatype and format in the UI or input routines, Identify any exceptions and handle appropriately. Fail the validation immediately if these fail so we don't incur costs of full validation for the obviously wrong but leave the final validation of the remainder to well proven routines if that level is required.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply