But, but, but, my data is clean!

  • @eric, You've got some good points there

    "How data is conformed, constraints, and exceptioned is definately something that needs to be documented upfront." - I wholeheartedly agree.

    The "Date" constraint. I like the idea of your date constraint. I wish I had thought of it before. If it is considered too risky to convert the column to a proper "date" type, then what you suggest is a good alternative. I'm going to put that in the kitbag.

    And you are completely right: dates are one of the places where one finds invalid data, and other types being used to hold dates allow for some really strange data!:-)

    Tom Gillies LinkedIn Profilewww.DuhallowGreyGeek.com[/url]

  • Tom Gillies (6/23/2014)


    @Eric, You've got some good points there

    "How data is conformed, constraints, and exceptioned is definately something that needs to be documented upfront." - I wholeheartedly agree.

    The "Date" constraint. I like the idea of your date constraint. I wish I had thought of it before. If it is considered too risky to convert the column to a proper "date" type, then what you suggest is a good alternative. I'm going to put that in the kitbag.

    And you are completely right: dates are one of the places where one finds invalid data, and other types being used to hold dates allow for some really strange data!:-)

    If a table is referenced directly by an ORM or SSIS, then changing the data type will typically break column mapping in stuff down stream, even if the dates are valid and properly formatted. When that happens, it's perceived as the DBA's fault. However, if the data type remains the same, but an exception is thrown because the application entered an ambiguous or invalid date like "2/3/12", then it's a lot easier to shift responsibility back to the developers where it belongs.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • My top two "bad data" lessons both come under the heading of "my own damn fault"

    May may years ago I was running some final tests to confirm a move to production went smoothly. I was being lazy so I entered in my own address on several different test records. I then forgot to delete them. And shortly there after started getting junk mail from my own company. In fact when I moved several years later I was still getting junk mail! Those "test" records I had created had been used as templates by some lazy data entry people and my address was in the system in so many places I never did find them all.

    Then several years later I was working on some code and one of the managers came up and said "Is there any possible way we can get a demo of what you are working on?" I told them I wasn't quite done but these several parts were close enough to demo. I then got back to work and didn't think about it. I was a bit surprised when I got called into the CIOs office late that day and was told how the client had been highly offended that I called them "stupid". I had put "they're just stupid" in several columns as my test data days earlier. I mean it was a test system right? No one would ever see it but me. Turns out I was wrong!

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • I battle "DO NOT USE" data every day.

    I just ran a query to get a list of part numbers for the purpose of labeling containers of produced parts. More than half the records include some verbiage in the description indicating that the parts are no longer made.

    "Do not use"

    "Obsolete"

    "Old part"

    "Part has been replaced by________"

    "Invalid"

    etc.

    ARGH!

  • GilaMonster (6/23/2014)


    andrew gothard (6/23/2014)


    GilaMonster (6/21/2014)


    I love arguing this one with devs, so much 'agile' nonsense sometimes. 😀 "Creating validation in the database is a violation of the DRY principal"

    The concept's not nonsense - their interpretation is.

    True, I should have been clearer.

    My problem isn't with agile, it's with people who incorrectly use agile to justify poor coding practices.

    hell, yeah. I've had situations where I'm pointing out flaws so big you can see them from Pluto.

    "No, no, we've got to be agile"

    "If you can show me the bit in the Agile manifest about "Just doing It Wrong", fine. It's still going to fail mind"

    "you're not being agile - we'll refactor"

    etc, etc. This was a horrible sub-eav with 10 tables from the original design rammed into one (sort of) multi self referencing pig's ear added to the mix, I was totally ignored and it went into production. For several million users. I was wrong about one thing - I gave it 20 minutes. 7 IIRC

    But every issue raised was just met with we've got to be Agile, that's not Agile, You Don't Understand Agile.

    Actually I do - if done right (and doing it right is actually HARD - it's not just slinging throwaway prototypes together and putting them live*)

    * For some interpretations of Agile

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • VarChar "date" columns.

    AAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRGGGGGGGGGGGGGHHHHHHHHHHHHHH. Andrew ANGRY, Andrew SMASH!

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • andrew gothard (6/23/2014)


    VarChar "date" columns.

    AAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRGGGGGGGGGGGGGHHHHHHHHHHHHHH. Andrew ANGRY, Andrew SMASH!

    /raises hand

    I second that "AAAAAARRRGH". I'm responsible for maintaining a web site that basically runs Crystal reports against a "data warehouse" (and yes, the quotes are there to imply sarcasm). The web dev who created the site is actually rather good except for one tiny little thing. He made every single field either a varchar or a float and then converted it to whatever he needed it to via dynamic sql calls on the web page side and in the crystal reports. Dates...varchar. Identity columns...float. boolean flags...varchar where the values were either "Y" or blank.

    There was that one identity column that he did make an int! Of course, it should have been a bigint...that was a fun day tracking that one down.

    Rant aside, data typing all columns appropriately (not just shoving dates into varchars) goes a long way to keeping data clean.

    -G

  • Greg A Goss (6/24/2014)


    andrew gothard (6/23/2014)


    VarChar "date" columns.

    AAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRGGGGGGGGGGGGGHHHHHHHHHHHHHH. Andrew ANGRY, Andrew SMASH!

    /raises hand

    I second that "AAAAAARRRGH". I'm responsible for maintaining a web site that basically runs Crystal reports against a "data warehouse" (and yes, the quotes are there to imply sarcasm). The web dev who created the site is actually rather good except for one tiny little thing. He made every single field either a varchar or a float and then converted it to whatever he needed it to via dynamic sql calls on the web page side and in the crystal reports. Dates...varchar. Identity columns...float. boolean flags...varchar where the values were either "Y" or blank.

    There was that one identity column that he did make an int! Of course, it should have been a bigint...that was a fun day tracking that one down.

    Rant aside, data typing all columns appropriately (not just shoving dates into varchars) goes a long way to keeping data clean.

    -G

    Yes, correct data constraint begins with correct data typing.

    I've seen things like: MaritalStatus BigInt or MaritalStatus VarChar(1). Something like that should be either TinyInt or Char(1).

    I once saw a data warehouse where every key column was BigInt, and the fact table had about 50 foreign keys. When I asked the DBA why BigInt was chosen as the data type for all those columns, he responded "Because a 64 bit CPU can process 64 bit integers are more effciently." This was back in 2005 and he was proud of his shiny new 64 bit SQL Server stack. But of course he could offer no proof of the assertion that 64 bit integers would offer some performance gain. A 64 bit integer is simply not a new and improved version of a 32 bit integer.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • I'm gonna go with the un-agile option. 😀

    In the fight against bad data I love the retort "We don't care if it is bad or incomplete, just put it in the database."

    To which I reply, "Are you sure you want questionable data, or obviously bad data?"

    Inevitably I get "We'll deal with it later if it becomes an issue."

    I just laugh to myself about it.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • GilaMonster (6/23/2014)


    andrew gothard (6/23/2014)


    GilaMonster (6/21/2014)


    I love arguing this one with devs, so much 'agile' nonsense sometimes. 😀 "Creating validation in the database is a violation of the DRY principal"

    The concept's not nonsense - their interpretation is.

    True, I should have been clearer.

    My problem isn't with agile, it's with people who incorrectly use agile to justify poor coding practices.

    So true. The problem often lies with laziness justified (incorrectly) by misquoting principles and standards.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Eric M Russell (6/23/2014)


    Often times developers are reluctant to hard code domain or relational constraints, because there are no written specifications formally defining what should be considered valid. How data is conformed, constraints, and exceptioned is definately something that needs to be documented upfront.

    One common issue I see are VarChar "date" columns. What I sometimes do (when refactoring the column to Date type isn't an option) is place a check constraint on the varchar column. First it casts the value to Date type (which will throw an exception if the value isn't a real date), and then it compares the value to a YYYYMMDD formatted conversion of itself (which verifies the value conforms to this standard string format).

    For example:

    create table foo

    (

    foo_date varchar(30) not null

    constraint ck_foo_date_yyyymmdd

    check (foo_date = convert(char(8),cast(foo_date as datetime),112))

    );

    insert into foo (foo_date) values ('2011/02/28');

    Msg 547, Level 16, State 0, Line 1

    The INSERT statement conflicted with the CHECK constraint "ck_foo_date_yyyymmdd".

    insert into foo (foo_date) values ('20110229');

    Msg 241, Level 16, State 1, Line 1

    Conversion failed when converting date and/or time from character string.

    insert into foo (foo_date) values ('20120229');

    (1 row(s) affected

    Not using the correct types for values is not Agile. That is a complete fallacy. I hate it when dates aren't stored as dates as then you might have difficulty with timezones and internationalisation e.g. is it DDMMYYYY or MMDDYYYY when someone stores 06062014?

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Eric M Russell (6/24/2014)


    Greg A Goss (6/24/2014)


    andrew gothard (6/23/2014)


    VarChar "date" columns.

    AAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRGGGGGGGGGGGGGHHHHHHHHHHHHHH. Andrew ANGRY, Andrew SMASH!

    /raises hand

    I second that "AAAAAARRRGH". I'm responsible for maintaining a web site that basically runs Crystal reports against a "data warehouse" (and yes, the quotes are there to imply sarcasm). The web dev who created the site is actually rather good except for one tiny little thing. He made every single field either a varchar or a float and then converted it to whatever he needed it to via dynamic sql calls on the web page side and in the crystal reports. Dates...varchar. Identity columns...float. boolean flags...varchar where the values were either "Y" or blank.

    There was that one identity column that he did make an int! Of course, it should have been a bigint...that was a fun day tracking that one down.

    Rant aside, data typing all columns appropriately (not just shoving dates into varchars) goes a long way to keeping data clean.

    -G

    Yes, correct data constraint begins with correct data typing.

    I've seen things like: MaritalStatus BigInt or MaritalStatus VarChar(1). Something like that should be either TinyInt or Char(1).

    I once saw a data warehouse where every key column was BigInt, and the fact table had about 50 foreign keys. When I asked the DBA why BigInt was chosen as the data type for all those columns, he responded "Because a 64 bit CPU can process 64 bit integers are more effciently." This was back in 2005 and he was proud of his shiny new 64 bit SQL Server stack. But of course he could offer no proof of the assertion that 64 bit integers would offer some performance gain. A 64 bit integer is simply not a new and improved version of a 32 bit integer.

    It was the case when converting WIN16 applications to WIN32 that on the then shiny brand new 32 bit processors that 32 bit integers were processed faster due to the register manipulation required for 16 bit values. I am not going to bother look up the evidence as it is years decades since I have suffered the pain of thunking!!!

    It is not true that 64 bit processors handle 64 bit integers a lot quicker than 32 bit integers. This also has been documented. I suspect that those who were involved in migrating applications from WIN16 to WIN32 might be assuming the same will apply. In their defence it is intuitive to think that it would follow suit, however, it doesn't. Wonder why? Please ask someone who deals with processors at a low level.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • "I know - why not just put the entire address in one column, what could possibly go wrong there ... "

    The number of times I've seen that bit of idiocy :angry:

    I'm a DBA.
    I'm not paid to solve problems. I'm paid to prevent them.

  • I've been amused by some of the comments about dates. :-):-)

    I like this too:

    andrew gothard (6/26/2014)


    "I know - why not just put the entire address in one column, what could possibly go wrong there ... "

    🙂

    Here's another family of problems (or maybe several):

    Inappropriate use of number fields for especially for currency.

    Numbers where there ought to be units documented somewhere (but nobody knows where)

    Numbers where simple common-sense defensive range checking is possible, but isn't done. A trivial example is where for some reason the system wants a "day number". A small or tiny integer is more than adequate and checking that it between 1 (or maybe 0) and 366 would trap nonsense at almost no cost to a programmer.

    Unlike most of you here, I'm not a DBA, I don't claim to be and don't really want to be, but I have had to deal with the consequences of dirty data. Mind you, I suppose it gave me a job for a while.;-)

    Tom Gillies LinkedIn Profilewww.DuhallowGreyGeek.com[/url]

  • Gary Varga (6/26/2014)


    Eric M Russell (6/24/2014)


    Greg A Goss (6/24/2014)


    andrew gothard (6/23/2014)


    VarChar "date" columns.

    AAAAAAAAAAAAAAAAARRRRRRRRRRRRRRRRGGGGGGGGGGGGGHHHHHHHHHHHHHH. Andrew ANGRY, Andrew SMASH!

    /raises hand

    I second that "AAAAAARRRGH". I'm responsible for maintaining a web site that basically runs Crystal reports against a "data warehouse" (and yes, the quotes are there to imply sarcasm). The web dev who created the site is actually rather good except for one tiny little thing. He made every single field either a varchar or a float and then converted it to whatever he needed it to via dynamic sql calls on the web page side and in the crystal reports. Dates...varchar. Identity columns...float. boolean flags...varchar where the values were either "Y" or blank.

    There was that one identity column that he did make an int! Of course, it should have been a bigint...that was a fun day tracking that one down.

    Rant aside, data typing all columns appropriately (not just shoving dates into varchars) goes a long way to keeping data clean.

    -G

    Yes, correct data constraint begins with correct data typing.

    I've seen things like: MaritalStatus BigInt or MaritalStatus VarChar(1). Something like that should be either TinyInt or Char(1).

    I once saw a data warehouse where every key column was BigInt, and the fact table had about 50 foreign keys. When I asked the DBA why BigInt was chosen as the data type for all those columns, he responded "Because a 64 bit CPU can process 64 bit integers are more effciently." This was back in 2005 and he was proud of his shiny new 64 bit SQL Server stack. But of course he could offer no proof of the assertion that 64 bit integers would offer some performance gain. A 64 bit integer is simply not a new and improved version of a 32 bit integer.

    It was the case when converting WIN16 applications to WIN32 that on the then shiny brand new 32 bit processors that 32 bit integers were processed faster due to the register manipulation required for 16 bit values. I am not going to bother look up the evidence as it is years decades since I have suffered the pain of thunking!!!

    It is not true that 64 bit processors handle 64 bit integers a lot quicker than 32 bit integers. This also has been documented. I suspect that those who were involved in migrating applications from WIN16 to WIN32 might be assuming the same will apply. In their defence it is intuitive to think that it would follow suit, however, it doesn't. Wonder why? Please ask someone who deals with processors at a low level.

    Yes, even if there were a non-trivial CPU performance benefit for 64 bit integers (let's assume that all others things being equal it required 30% less CPU utilization), there would still be a strong and compelling reason for a DBA to implement 32, 16, or 8 bit integer data types. Disk I/O or perhaps memory are far more likely to be the biggest bottleneck in a database. Even when CPU is the bottleneck, it's something like execution plan recompilation or cxpacket wait states. There are exceptions, but the vast majority of line of business databases simply don't perform complex integer crunching; they mostly traverse indexes and move pages of data around the disk, memory, and network. The usage of 64 bit integers makes the pages twice or mayben even three times are big, which doubles or tripples the amount of buffer caching, logging, and I/O.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply