SQL query help

  • I think we all struggle because there is no good answer to this problem.

    There are three types of formats for data that serve different purposes:

    Storage - using the built-in data types from sql server and the DBA is responsible.

    Display - usually the front-end app or SSRS is responsible.

    Validation/Edit Mask - usually the front-end app is responsible.

    Often we would like to validate a data item but information has been lost from the front-end so we don't know how it should be validated. For example, are we validating a Canadian Postal Code or and Australian one? We would like to save space on a storage format but stripping out delimiters might eliminate even more information from the data. Data becomes less homogeneous as we become more internationalized. Extreme normalization adds overhead. Other than going to an OODB, does anyone have a good solution to this dilemma?

  • Stripping out delimiters should never eliminate information from the data. If you need to, you can set a tinyint code that reflects the meaning of the delimiters. I'm going to avoid storing delimiters unless 100% absolutely necessary.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (4/23/2015)


    Stripping out delimiters should never eliminate information from the data. If you need to, you can set a tinyint code that reflects the meaning of the delimiters. I'm going to avoid storing delimiters unless 100% absolutely necessary.

    I hope you don't need to store ISBN numbers. The delimiters in an ISBN number are actually significant since each group of numbers are dynamic in length.

  • Lynn Pettis (4/23/2015)


    ScottPletcher (4/23/2015)


    Stripping out delimiters should never eliminate information from the data. If you need to, you can set a tinyint code that reflects the meaning of the delimiters. I'm going to avoid storing delimiters unless 100% absolutely necessary.

    I hope you don't need to store ISBN numbers. The delimiters in an ISBN number are actually significant since each group of numbers are dynamic in length.

    True, because they are separate elements. Thus, to satisfy 1NF, the isbn elements should be stored separately, not as a single, concatenated value. Particularly since you could need to search by a single one of those values; for example, I believe one of the isbn segments effectively identifies the book's language.

    But, again, to me the delimiters themselves should not be stored.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (4/23/2015)


    Stripping out delimiters should never eliminate information from the data. If you need to, you can set a tinyint code that reflects the meaning of the delimiters. I'm going to avoid storing delimiters unless 100% absolutely necessary.

    But it does eliminate information; that is my whole point above.

    And where do you set the tiny code? Codds rules state no column should depend on another column except the primary key.

    I worked for a place with a user defined account numbering system; I told them it was inherently flawed because they wanted to strip out the delimiters on save:

    AAAA-BBB = 1234-567 = 1234567

    AAA-BBBB = 123-4567 = 1234567

    So which account number did we store?

  • Bill Talada (4/23/2015)


    ScottPletcher (4/23/2015)


    Stripping out delimiters should never eliminate information from the data. If you need to, you can set a tinyint code that reflects the meaning of the delimiters. I'm going to avoid storing delimiters unless 100% absolutely necessary.

    But it does eliminate information; that is my whole point above.

    And where do you set the tiny code? Codds rules state no column should depend on another column except the primary key.

    I worked for a place with a user defined account numbering system; I told them it was inherently flawed because they wanted to strip out the delimiters on save:

    AAAA-BBB = 1234-567 = 1234567

    AAA-BBBB = 123-4567 = 1234567

    So which account number did we store?

    If they are different elements, you should have stored:

    1234, 567 --i.e. different columns

    123,4567

    ISBN is displayed with dashes for human convenience, but it's actually composed of elements: Prefix element; Registration group element; Registrant element; Publication element; Check digit. In a relational model, again, under the rules for 1NF, each of those should be stored in separate columns. They can be formatted as necessary for display, just like a date, ssn, or other data would be.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • CELKO (4/23/2015)


    ISBN is displayed with dashes for human convenience, but it's actually composed of elements .. again, under the rules for 1NF, each of those should be stored in separate columns.

    No. Atomic and scalar are different concepts. Does breaking the value into parts destroy the meaning of the data element? In the case of an ISBN, it sure does! I cannot find the book without all of the atomic and it is scalar ISBN. But the {longitude, latitude} pair is atomic because it locates one and only one point on the globe and a pair is not scalar.

    This is trickier than people think when they first learn 1NF.

    Not necessarily, because I might need to search by only a single element, particularly the one for English-language. Or to find books by a specific publisher.

    You can find a single given book by comparing all the separate parts, even if they are specified as one string with dashes. How to do the lookup is a development issue, not a db design issue.

    By your "logic", first name and last name would have to be stored together, since I can't find a single person by only one or the other (in simple systems where name, rather than id, is used for lookup).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • CELKO (4/23/2015)


    I might need to search by only a single element, particularly the one for English-language. Or to find books by a specific publisher.

    The ISBN is atomic; it identifies one and only one book. The book is my โ€œunit of workโ€ which cannot be decomposed. The language and publisher are levels of aggregation up the hierarchy. Thanks to the encoding scheme they are easy to find with substrings and regular expressions.

    In the early days of the ISBN, there were ambiguous encoding, but that was decades ago. Today, each ISBN has a unique parsing and we do not need the dashes. In SQL, the {YEAR, MONTH, DAY, HOUR MINUTE, SECOND} are called fields, so let's use that term. Some encodings have fixed length fields and some have varying length. The varying length ones are tricky.

    By your "logic", first name and last name would have to be stored together, since I can't find a single person by only one or the other (in simple systems where name, rather than id, is used for lookup).[/quote]

    Yes, in the same row just like {longitude, latitude}; again read the stuff about atomic versus scalar. ๐Ÿ™‚ Think about display and abstraction; numerals versus numbers, etc.

    [/quote]

    According to ISBN.org:

    Every ISBN will consist of thirteen digits in 2007. The thirteen digit number is divided into five parts of variable length, each part separated by a hyphen.

    I guess you know more about ISBN than they do???

    From what I read, it's more involved to convert the non-hyphenated, combined ISBN back into the version with hyphens.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • CELKO (4/23/2015)

    Getting back to check digits, the encoding has to be in a string because the weights are assigned by position. There is no concept of a position in an abstract, SQL style numeric data type.

    Of course there is, since you can cast the number to a string any time you want, although you'd never need to compute a check digit that's already stored, the point of a check digit being to verify the data before it is used (edit: stored/used).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • CELKO (4/24/2015)


    Of course there is, since you can cast the number to a string any time you want, although you'd never need to compute a check digit that's already stored,

    Was I casting that integer from its internal format number to decimal, octal or Roman numerals? See the problem?

    A straw man argument. Scott stated explicitly that you can convert the number to a string. Nothing there about octal, Roman numerals or any other format.

  • CELKO (4/24/2015)


    Of course there is, since you can cast the number to a string any time you want, although you'd never need to compute a check digit that's already stored,

    Was I casting that integer from its internal format number to decimal, octal or Roman numerals? See the problem?

    the point of a check digit being to verify the data before it is used.

    Agreed.

    No, it's ridiculous to even try to bring Roman numerals into this.

    Btw, earlier you lectured:

    We do not store computed values in a table

    So why are you storing the check digit, since by definition it can be computed??

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • CELKO (4/21/2015)


    We do not store computed values in a table and we seldom use FLOAT for anything; the rounding errors are illegal in most commercial applications. Do you really measure miles in nanometers?

    nSpouting nonsense again, I see. The roiunding errors in FLOAT are far less than they are in a type like DECIMAL(10,2). Presumably you are objecting to the representation errors, which are not rounding errors. But using the current floating point standard one can choose to have exactly the same representation errors as are forced on one when using a fixed point decimal-oriented type. Unfortunately our RDBMSs are way out of date on this, partly because the SQL standards bodies aren't interested in conforming to the latest FLOAT standard since it would require new versions of the FLOAT type (probably splitting it into two separate types, and adding a greater range of variation in precision than is currently allowed, and of course providing decent error detection for operations on the types) and they aren't interested in conforming with international standards when it requires that much effort.

    Storing computed values in tables is a perfectly reasonable thing to do in circumstances where repeatedly re-computing a value would cost too much performance; that's why we have the computed columns feature with the option of realising the column in the table's permanent storage, and also the original reason for providing manifest views (indexed views in T-SQL-speak). Claiming that computed values should never be stored in tables is just pretending that performance never matters. Obviously storing computed values in such a way that the DBMS doesn't always ensure that the computed value is currently valid would be a mistake, but that's not what you said - you made the claim that storing computed values in tables is always wrong.

    Tom

  • Lynn Pettis (4/23/2015)


    CELKO (4/22/2015)


    you really can divide miles by hours and get speed! ๐Ÿ™‚

    Also, since you are so picky about using the correct terms, distance divided by time equals velocity!

    Maybe you should brush up on your math/science terms.

    He got it right for once, Lynn. He divided miles (a scalar distance) by hours (a scalar time) and got a scalar called speed. It's rather difficult to get a vector (velocity) by dividing one scalar by another.

    I'm well aware that the term "velocity" in non-scientific use often means speed, but I don't see how that would make it wrong to call a speed a speed instead of calling it a velocity and you did in fact specify that you were using velocity in the scientific sense. According to OED velocity in scientific usage is " speed together with the direction of travel, as a vector quantity" (and that's what it is used to mean in every physics, maths, or engineering document I've ever seen it used in). Collins agrees with the OED that it's a vector quantity in scientific usage and Chambers reckons it's a vector in anything other than loose usage.

    Tom

  • CELKO (4/23/2015)


    ISBN is displayed with dashes for human convenience, but it's actually composed of elements .. again, under the rules for 1NF, each of those should be stored in separate columns.

    No. Atomic and scalar are different concepts. Does breaking the value into parts destroy the meaning of the data element? In the case of an ISBN, it sure does! I cannot find the book without all of the atomic and it is scalar ISBN. But the {longitude, latitude} pair is atomic because it locates one and only one point on the globe and a pair is not scalar.

    This is trickier than people think when they first learn 1NF.

    The whole point of the ISBN is that it its individual components identify several attributes of a book which when taken in conjunction provide identification of the book, just as the whole point of the logtitude:latitude position identifier is that its individual components identify several attributes of a position which when taken in conjunction identify the book. I can no more find the position without all of the position identifier. If I choose to assign some simple means of parsing (fixed number of places after decimal point, for example) I can store the whole thing in one column and it is just as atomic and just as scalar as an ISBN. In fact an ISBN is an ordered quintet or for older ISBNs an ordered quartet of numbers, and an ordered quartet or quintet is just as scalar and just as atomic as an ordered pair. Separators in the ISBN case are for use with the printed representation of the ISBN, and a separator is usually either white space or a short dash although in theory any non-numeric character could be used - but claiming that this presentation form is the ISBN would be somewhat perverse, to say the least.

    ISBN's and Positions can be treated as atomic if the individual components are not to be accessible in the database, which will depend on what use is being made of them.

    Calling either a scalar is outrageous: neither is "a quantity having magnitude but no direction, and representable by a single real number" since neither is a quantity and neither has magnitude (neither has direction and each can be represented by a single real number in all sorts of ways, so teh rest of the definition is stisfied); and if you are using some other definition of scalar you are completely out of line both with mathematics and with decent English dictionaries.

    Understanding data structures that are composed of several components appears to be somewhat trickier than you think.

    Tom

  • Lynn Pettis (4/23/2015)


    Also, since you are so picky about using the correct terms, distance divided by time equals velocity!

    Maybe you should brush up on your math/science terms.

    Not true, it equals speed. Velocity is a vector that also includes direction.

    Don Simpson



    I'm not sure about Heisenberg.

Viewing 15 posts - 16 through 30 (of 76 total)

You must be logged in to reply to this topic. Login to reply