Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Stairway to Database Design - STEP 1: Data Elements Expand / Collapse
Author
Message
Posted Tuesday, April 27, 2010 9:54 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: 2 days ago @ 6:21 PM
Points: 1,945, Visits: 3,059
>> Suppose my customers want the following four column names in all the inputs, all the reports, and all the result sets: open, high, low, close.<<

Let xx = {open, high, low, close}
Was that "xx_price", "xx_volume", "xx_rating" or something else? Those four names are ambiguous in a financial context, and even more so if we have "xx_temperature", "xx_rainfall", etc.

It is easy enough to use a text editor to global changes. And data dictionary tools can find data elements without definitions in source code today.

>> If I change my column names to something else and rename columns as they are selected, what benefits shall I get for my additional effort, more complexity in my code, my longer code? <<

When I was at AIRMICS, we did research on naming conventions and formatting, we found that you can save 8-12% on maintenance. This was confirmed over and over by TRW, DoD and other large software users. Since 80% or more of the lifetime cost of a system is in maintaining it, this is a lot of money

>> I have the impression that your research was done a while ago and is not completely relevant to modern realities. Considering a modern team in Silicon Valley, where every team member may come from a different country, and many people simply don't have capital letters in their first languages, how can we apply the research done several decades ago, most likely in a much more homogeneous team? <<

Only four languages have cases in their alphabets -- Latin, Greek, Cyrillic and Arabic (initial, middle, terminal and stand-alone forms).

The basics of visual psychology were established by typographers long before computers. The modern research began with Newspapers looking for typefaces and Military research in WWI and WWII.

What has been added in the last decade is better equipment -- Springer had a book out this year on eye movement measurements. Brain wave language studies are less than a decade old.

But the basic laws of visual psychology have held up. Proximity, linear ordering, Boumas, etc. are just getting better measurements.

What is interesting about non-English programmers their logic. Jerry Weinberg touched on this in his classic PSYCHOLOGY OF COMPUTER PROGRAMMING in terms of people writing the new language as if it were the one they know best.

I made the observation that Arabs tended to put the sort columns on the right hand side of reports, even in English and that they would decrement subscripts in arrays to get a right-to-left ordering in their mind.

Chinese students tended to write:
IF <exceptional condition>
THEN <error handling code)
ELSE <normal processing>;

and
DO <statement> UNTIL <exit condition>;

while Westerners would prefer:

IF <expected condition>
THEN <normal processing>
ELSE <error handling code>;

WHILE <continue condition> DO <statement>;



Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #911256
Posted Tuesday, April 27, 2010 11:05 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Tuesday, September 16, 2014 8:22 AM
Points: 1,623, Visits: 359
"But track and standardize what you do."

This is by far the most useful sentence in this article. Over half of the article is jokes and a listing of the author's personal preferences for naming. And the statement that the listed postfixes have exact meanings is only true for those persons that have access to the definitions for them. For example "_loc" is give for a location/address. I work for an engineering company whose primary focus is on nuclear power. Those letters would cause anyone in my company to immediately think "loss of coolant", as this is a common acronym in our work. Since we find that the end product is much better when our developers understand the thing that they're developing, this particular postfix would be potentially confusing to our developers and their managers.

I totally agree that defined standards are necessary. And that is the main point of the article. However, those standards need to make sense for the environment in which they are used.

Edit: By the way, that was a very interesting post on the way that different cultures affect the logical structure of programming. Thanks!
Post #911307
Posted Tuesday, April 27, 2010 11:28 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, October 23, 2012 8:18 AM
Points: 28, Visits: 101
Joe Celko (4/27/2010)

The old IDEF standard used singular names for files because you process the records of the file one at a time. In the RDBMS world, we process an entire set of rows at a time, so we use collective names (or plural if no collective is available). If you do have only one Customer, then that would be the correct table name.


I must respectfully disagree. A singular name is appropriate because what is being modeled is a single relation. You are naming what every row in a table represents. It is not up to the person who has the role of designing the data model to predict how many rows will exist in any given table; that task falls to the person who has the role of implementing the data model. It does not matter if you wind up with a single Customer or a million of them; what each row in the table represents is a single customer, not several customers, and not a "client collection."

But I should say that I while I prefer singular relation names, I can see the logic behind using plural or collective names. This matter is probably one of individual taste. The most important thing is that you choose one or the other convention and stick with it throughout your design.

Nice article, by the way!

Adam
Post #911317
Posted Tuesday, April 27, 2010 11:46 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Tuesday, October 14, 2014 7:36 AM
Points: 235, Visits: 728
People who write in weakly typed languages often put a prefix or postfix on data element names to tell them what the original data type was in case they change type in the program. This is a violation of the ISO-11179 metadata rules. This standard comes down to the idea that you name a data element for what it is by its nature and not where it is located in storage or how it is used in one table.

The ISO_11179 format is “[<role>_]<attribute>_<property>” for data elements. A data element has one and only one name in the entire schema. Better, it should have one and only one name in the enterprise.


In my organization, we prefix data element names with a data type abbreviation; e.g., lProjectID is the identifier of a construction project, and the "l" (lower case el) indicates an integer data type. I suppose this is a carryover from an older file-base process where the "l" represented a "long" data type?

Date data elements have "dt" in front of the name, money data elements an "m", etc.

So, my question is, according to the author's presentation of standards, this is sort of silly, correct?

On the other hand, it's nice to immediately know what data type I'm dealing with when it comes to query tuning . . .
Post #911331
Posted Tuesday, April 27, 2010 11:50 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 8:16 AM
Points: 1,035, Visits: 410
Joe Celko (4/27/2010)
>> I have used the "Id" only name for most of my databases because while you might end up with a whole bunch of "Id" columns in the database, each one is given an exact context by the table it belongs to. <<

So sometimes "id" is squid, sometimes "id" is an automobile, etc. and you think this is a good thing? Imagine a creature that changes form as it moves from room to room. Fun for Science Fiction, but not for data modeling.

Did you notice that a VIN is always a VIN? Validated by a regular expression from ISO and verified by the manufacturers (or more likely, the DMV, CarFax or dealership).


Yes, as a matter of fact, as the context changes i.e. we are looking in different tables, the MEANING changes. This shouldn't be such a difficult concept.

VIN is also a system generated key, why do you think it makes a difference when a system generated key is generated? You seem to think that if a key is generated only after information about that thing is recorded in a database, that somehow makes that key invalid. I don't.

VIN's, like almost all other Serial Numbers also contains information that record the order in which THAT particular vehicle rolled off the assembly line (kind of like IDENTITY records when the data hit the database...) Yes, VIN can be validated by a regular expression...sort of...VIN's in their current form have only existed for about the last 25 years, if your database can contain information on cars built prior to that and some smaller (and non-U.S.) manufacturers you are out of luck. VIN is also a "complex" key with lots of additional meaning hidden within it which breaks the scalar value rule, so you might want to think of a different example to use...


>> I always name tables in the singular "Customer" instead of "Customers" and then in any referencing tables such as "Invoice" the Customer.Id column is referenced as a foreign key from the Invoice.CustomerId column. <<

The old IDEF standard used singular names for files because you process the records of the file one at a time. In the RDBMS world, we process an entire set of rows at a time, so we use collective names (or plural if no collective is available). If you do have only one Customer, then that would be the correct table name. Sorry business is so bad :)

Consider this:
Employee = bad, only one guy
Employees = better, but your mindset is still locked into record at a time process; you see individuals, trees, not the forest.
Personnel = Best. This is an abstract, higher level set and it shapes your mindset in that direction.



I prefer Date's concept of tables as being variables and thus the rows in a table are a single value (a complex value to be sure, but a value nonetheless) Thus it makes sense to think of the Customer table as containing the current value of the customer variable regardless of how many rows there are.



IDENTITY and other auto-increment features are not attributes of an entity or a relationship. In the case of IDENTITY in the SQL Server family, it is a count of the physical insertion attempts -- not even the successes. It is generated by the local computer and totally without verification or validation. It has nothing to do with a logical data model.

In Kabbalah magic, God puts a Hebrew number on all things in his creation. So if your computer is God and has a Hebrew font, then IDENTITY makes sense in a magical world :)

But did you notice that your insurance company, DMV and automobile chassis all were able to agree on the VIN,even tho your automobile does not have a network connection? Logical identifiers are not anything like physical locators.


I don't recall saying anything about IDENTITY...and although I somewhat share your view on system generated identity values, I'm not quite as averse to them as you are. I prefer to use the inherently "identifying" characteristics of a thing to identify it, but in some cases the set of identifying characteristics is large enough to become cumbersome, so using IDENTITY as a surrogate is a viable option, as long as the natural key is still enforced!

I object to data models where every table has an IDENTITY as a PK mostly because it's just wasteful and lazy, and causes unnecessary joins in queries. This type of model is usually built by developers who know next to nothing about data management and who don't even know enough to realize that surrogate keys don't enforce uniqueness, they create it so you wind up with duplicate rows without even knowing it.

If I'm not mistaken, your real beef with IDENTITY is that it is a platform specific function and not part of any ANSI specification. That argument might have some merits, but in my experience, the actual impact of avoiding all platform specific solutions in favor of ANSI compliant features is mostly negative. You buy some portability (which is a terrific concept, but very seldom used) at the cost of a whole bunch of performance.

I know you've been round and round on this subject (and all the others) and I don't expect to change your mind. I do want to make sure that other people know that there are other ideas out there that have merit as well.



/*****************

If most people are not willing to see the difficulty, this is mainly because, consciously or unconsciously, they assume that it will be they who will settle these questions for the others, and because they are convinced of their own capacity to do this. -Friedrich August von Hayek



*****************/
Post #911336
Posted Tuesday, April 27, 2010 2:20 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: 2 days ago @ 6:21 PM
Points: 1,945, Visits: 3,059
>> In my organization, we prefix data element names with a data type abbreviation; e.g., lProjectID is the identifier of a construction project, and the "l" (lower case el) indicates an integer data type. I suppose this is a carryover from an older file-base process where the "l" represented a "long" data type? <<

Actually, it is a carry-over from FORTRAN II. Six character names that begin with I to N are INTEGERs and the rest are FLOATs.


Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #911457
Posted Tuesday, April 27, 2010 2:22 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: 2 days ago @ 6:21 PM
Points: 1,945, Visits: 3,059
>> I must respectfully disagree. A singular name is appropriate because what is being modeled is a single relation. You are naming what every row in a table represents. <<

You should be naming the table as a whole -- "Sherwood Forest" and not "Sherwood Trees" is the mindset. See the difference?


Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #911458
Posted Tuesday, April 27, 2010 2:25 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, October 23, 2012 8:18 AM
Points: 28, Visits: 101
Joe Celko (4/27/2010)
>> I must respectfully disagree. A singular name is appropriate because what is being modeled is a single relation. You are naming what every row in a table represents. <<

You should be naming the table as a whole -- "Sherwood Forest" and not "Sherwood Trees" is the mindset. See the difference?


I see the difference, but I would probably call the table "Sherwood Tree". =)
Post #911459
Posted Tuesday, April 27, 2010 2:44 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: 2 days ago @ 6:21 PM
Points: 1,945, Visits: 3,059
>> VIN is also a system generated key, why do you think it makes a difference when a system generated key is generated? You seem to think that if a key is generated only after information about that thing is recorded in a database, that somehow makes that key invalid. I don't. <<


Most of the parts of a VIN (make, model, etc.) are physical facts which can be verified by looking at the vehicle. The sequence number sub-field in the VIN is physically stamped in the chassis of the vehicle; it is immediately verifiable.

The value of an IDENTITY is non-deterministic. Do an INSERT INTO .. VALUES (), (), ..(); with a list of (n) arguments and you want (n!) possible orderings. I think a key is a property of the entity, which we "discover" before we put it in the database. This is also a property of an OID (Object Identifier).




Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #911473
Posted Tuesday, April 27, 2010 3:33 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, October 15, 2014 7:27 AM
Points: 387, Visits: 435
This is a bit of a side track from the article, however…

Joe Celko (4/27/2010)
IDENTITY and other auto-increment features are not attributes of an entity or a relationship. In the case of IDENTITY in the SQL Server family, it is a count of the physical insertion attempts -- not even the successes. It is generated by the local computer and totally without verification or validation. It has nothing to do with a logical data model.

In Kabbalah magic, God puts a Hebrew number on all things in his creation. So if your computer is God and has a Hebrew font, then IDENTITY makes sense in a magical world :)

I think I finally get it. I’ve known for a while your view on using IDENTITY, but never really understood the reasoning why. I’ve been using it for a while to produce SURROGATE KEYS (I primarily work on data warehousing / reporting stuff).

What you have been saying, which I’ll attempt to restate, is that a SURROGATE KEY must be deterministic! Using fields in the record, which ever combination of them it is, has to reproduce the exact same value each time the process / function to create it is run. That is why IDENTITY fails, because the value produced changes depending on when the data is loaded into the table (in relation to other records), not the value(s) of the fields in the record.

I’m not sure, but producing the surrogate key via IDENTITY (ergo non-deterministically) probably violates Codd’s rule #9, maybe #8?


Beer's Law: Absolutum obsoletum
"if it works it's out-of-date"
Post #911502
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse