Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Stairway to Database Design Level 9: Normalization Expand / Collapse
Author
Message
Posted Thursday, November 17, 2011 11:03 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 10:45 AM
Points: 1,945, Visits: 2,996
archie flockhart (11/14/2011)
Rows, columns, fields, records: Microsoft themselves use the terms pretty interchangeably


That drives me nuts. In ANSI Standad SQL, the term "field" is the temporal subpart of a DATE, TIME or TIMESTAMP. Today, we have more problems with columnar and distributed databases,. A single logical row is implemented in multiple physical records. And those records might be moved all over the network or cloud. Now flip it around; a virtrual row or column has no physical record at all; it just pops into being in the presentaiton layer, not the database.

Here are three (there are plenty more) :

1: Your own example of identifying vehicles. Outside the car repair industry, VIN is not a useful identifier as most people don't know their VIN even if they know where to find it; licence plate may work in some circumstances but there can be complications caused by transferrable number plates.

2: What unique identifier do you use for people - let's say, members of a club ?

3: When you say you created an ID for artworks based on material, artist and a sequence number, were you not just creating a slightly fancier ( and harder to use) form of "count of physical insertion attempts" ?


1) I use the VIN in my car insurance, my DMV forms and when i get a traffic ticket or apply for a parking sticker. Now that the VIN is on the dashboardinstead of the engine block

2) That deponds on the role that the person plays. The two I like is their email address (validation is regular expression; validation is a ping) and the DUNS. I had to get a DUNS to consult at DELL (it is free) and it is used internationally.

3) Unh? My artwork_id had nothing to do with the physical location of the record(s) that contain. And it was much easier to use than a machine generated number of some kind. It was easy to pull out the pieces by artist to compute payout, track sales, etc.

Think about using a random number as the key. I still have to have the artist and the media in the table. The locator is just extra storage. The sequence (called an accession number by librarians) was derived by sorting on submission date and alphabetical title . This made a physical inventory search a bit easier. And all of this was shorter than a GUID!


Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #1207762
Posted Thursday, November 17, 2011 11:43 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, March 12, 2014 2:56 PM
Points: 12, Visits: 107
First off- I greatly admire Joe Celko's work. I have his books. I've conversed with him on a SQL challenge years ago. He is the world's SQL guru!

Two Issues:

First, I echo Tom's comments on being overly harsh on surrogate keys. I guess I just don't see the economic downside to relating entities using an IDENTITY as the primary key (an int is an int on all systems, effectively. And I can count on one finger the number of times over the last 15 years that I've switched RDBMS on a project). How does that approach leave the potential for getting boxed into a corner, down the road, from a modeling perspective?

To me, if there aren't significant economic reasons against using a single unique column as the primary (surrogate) key, that leaves only academic reasons, which few of us have time or cost allowance for in the real world (disk space isn't a factor today except for the largest systems).

It costs me more to code the application to work with multiple column primary keys than to use a single IDENTITY column. And *not once* have I ever exclaimed: "Damn! why oh why didn't I use the natural key instead of the IDENTITY column!" (although I have wondered why I didn't take the blue pill).

When implementing web applications, where a record's detail is viewed using URL parameters, I'd much rather code a URL as

somewhere.org/studentcourses/id/1233

rather than

somewhere.org/studentcourses/student_name/smith/course_nbr/ENG101, section_nbr/2

For my projects, I always use IDENTITY columns as primary keys on every entity (INT or BIGINT or some form of GUID as the case may dictate). I always relate entities using the primary key. If there are natural keys, they get a unique constraint. If the application's primary retrieval query (including any joins) for multiple rows needs to be sorted, and no other requirements create issues, a clustered index goes on the sort columns.

If anyone has a visceral reaction to what I just said, please enlighten me how I'm screwing it up. I've made a lot of money on dozens of projects over 20 years for me to give it up without a clear sense of the economic impact to the client for initial project costs, costs relating to refactoring (which happens not a majority of the time), or costs for migration to new hardware or RDBMS (which statistically never happens).

-------------------------------

The second issue is with NULL columns. Time is money. It is too expensive to split out attributes into distinct entities for a record that has a lifecycle where, at the beginning will have many attributes containing NULL values but where, at the end of the record's lifecycle most attributes will have a domain value.

My current project tracks Ideas. 5 different people will eventually be assigned to work on the idea, but assignment to those 5 people will trickle in over time.

Should I create a separate assignment table with a personId and roleId column? That sure is more extensible if I need to assign a 6th person. But adding a column to Idea is cheaper (on the app dev side).

Or should I just have null-able columns in the Idea table? For economic reasons I'm doing this. No views required. No complex 6 table join (Idea to IdeaAssignment 5 times) to get all assignments for a single row select of the Idea table. No sub form on my Idea form to list assignments in a grid control, or some such. And, since I'm using Visual Studio and ADO.NET for a Win Forms app, I don't have to worry about this join getting executed a bazillion times as a form's underlying DataSet gets refreshed. I'm not getting paid to be pure in my design. I'm getting paid to solve a business problem quickly.


Post #1207799
Posted Friday, November 18, 2011 2:42 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, July 9, 2014 2:52 AM
Points: 1,153, Visits: 1,047

1) I use the VIN in my car insurance, my DMV forms and when i get a traffic ticket or apply for a parking sticker. Now that the VIN is on the dashboardinstead of the engine block

2) That deponds on the role that the person plays. The two I like is their email address (validation is regular expression; validation is a ping) and the DUNS. I had to get a DUNS to consult at DELL (it is free) and it is used internationally.

3) Unh? My artwork_id had nothing to do with the physical location of the record(s) that contain. And it was much easier to use than a machine generated number of some kind. It was easy to pull out the pieces by artist to compute payout, track sales, etc.

Think about using a random number as the key. I still have to have the artist and the media in the table. The locator is just extra storage. The sequence (called an accession number by librarians) was derived by sorting on submission date and alphabetical title . This made a physical inventory search a bit easier. And all of this was shorter than a GUID!


Interesting, but all 3 have some issues:

1 - doesn't work the same internationally. I'm in the UK, and although I know my car has a VIN, I have no idea what it is and have never explicitly been asked for it.

2 - people (especially family members) may share email addresses or have none at all; and >99% certainly won't know what DUNS is. People in the same family can also share all of name, address and phone number. For a club or I'd usually allocate a meaningless membership number , and we're back to IDENTITY again as the simplest way to do that ...

3 - The physical location of the artwork is "in our art collection". The sequence number that you generate is essentially "the order in which we acquired the item" and is not a property of the artwork itself .I don't see the conceptual difference between an essentially meaningless IDENTITY which the database allocates when you get an item, and an essentially meaningless artwork number that you create in a more complex way.

I'm also (partly) with QuickDraw on the balancing the theory with some practicalities: real-world developments need to deal with inconvenient people who don't have a DUNS, or don't know their VIN, or who turn up to start a job without their National Insurance (Social Security) number.
Post #1208164
Posted Friday, November 18, 2011 2:46 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, July 9, 2014 2:52 AM
Points: 1,153, Visits: 1,047
@QuickDraw: Not sure on the multiple fields: I don't see why you'd need multiple joins to the same table with a normalised design; and if you put 5 "Person" fields in the main table, how do you search to find a person who may appear in any one of the fields ?
Post #1208168
Posted Friday, November 18, 2011 10:25 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: 2 days ago @ 4:52 AM
Points: 8,813, Visits: 9,370
quickdraw (11/17/2011)
And I can count on one finger the number of times over the last 15 years that I've switched RDBMS on a project). How does that approach leave the potential for getting boxed into a corner, down the road, from a modeling perspective?

In the last 20 years I've seem more such changes than I could count on 2 fingers using binary notation (indeed more than the smallest number that would allow me to say that). Since I've seen five times as many switches as you in only 133% of the time I doubt if either your frequency of seeing such changes or mine, or even the two put together and averaged, gives us a large enough sample to get any idea of how often such changes do in fact happen - - it's the usual problem, there are lies, damn lies, and statistics based on inadequate or biased samples, and here we have definitely an inadequate sample.

When implementing web applications, where a record's detail is viewed using URL parameters, I'd much rather code a URL as

somewhere.org/studentcourses/id/1233

rather than

somewhere.org/studentcourses/student_name/smith/course_nbr/ENG101, section_nbr/2

I might have an issue with that; why should URLs have any structural relationship to the actual details of the schema? If you let that happen, then if you ever want to chacge the schema even a little bit you are porobably going to have to change masses of application code; if you hide the base schema behind stored procedures and views, the only extras you need to change when you want to change the base schema are the stored procedures and the views - no application code; of course if the change is to provide new features, you will have some new application code - but existing application code will not be affected by the schema change. And I wouldn't feel that asking the student or whoever is using the website to type a URL or to remember that 1233 is the course number for part 2 of ENG01, instead of allowing him to select ENG101 and 2 from drop down boxes, so the application might be happier with a URL like
somewhere.org/studentcourses.aspx?student_name=smith&detcourse_nbr=ENG101§ion_nbr=2 anyway.

For my projects, I always use IDENTITY columns as primary keys on every entity (INT or BIGINT or some form of GUID as the case may dictate). I always relate entities using the primary key. If there are natural keys, they get a unique constraint. If the application's primary retrieval query (including any joins) for multiple rows needs to be sorted, and no other requirements create issues, a clustered index goes on the sort columns.

If I'm using a surrogate key, I tend to make the natural key the primary key in the defining table, and put unique and not null constraints on the surrogate (also of course for each candidate key that isn't the primary key a unique constraint on the column list plus not null constraints on each column). I believe this makes it clear to anyone who looks at the schema (my stuff is often - not always - designed for long life - other people will have to cope with it after I have moved on) that the natural primary key is exactly that - the natural primary key. If I most commonly want to deal in natural key order, I'll cluster on that (and make sure to schedule reorganising that index now an again); if I want to make sure I insert records in increasing primary key order, I'll cluster on the surrogate; and if it's clear that my most common queries will want to scan (either the whole table or a range) in an order based on some other list of attributes, I'll create an index on that list and cluster on it. I guess our only difference really is in not maing the surrogate the primary key in the defining table - it's certainly got to be the key used in all foreign key relationships.

The second issue is with NULL columns. Time is money. It is too expensive to split out attributes into distinct entities for a record that has a lifecycle where, at the beginning will have many attributes containing NULL values but where, at the end of the record's lifecycle most attributes will have a domain value.

Looking at Joe's article, I don't see him saying anything nasty about nulls - he's a little less nasty about them at one point than I think is proper, but if I were writing something like that I might easily (accidentally) be a bit soft on the myself. I can't imagine how it is expensive to splt out entities except in extreme cases (but I know those extreme cases do exist - somewhere on this site is a big fun bust-up between me an one of the loony anti-null fundamentalist brigade).

My current project tracks Ideas. 5 different people will eventually be assigned to work on the idea, but assignment to those 5 people will trickle in over time.

Should I create a separate assignment table with a personId and roleId column? That sure is more extensible if I need to assign a 6th person. But adding a column to Idea is cheaper (on the app dev side).

Not if you have queries like "who is currently assigned to three or more ideas" and "list all the other ideas to which people assigned to this one are also assigned" or "list all the ideas to which at least 3 of this set of people are assigned". The multiple nullable columns in Idea make these a real pain to write, compared to trivially simple queries with the separate assignment table; they will also perform a lot worse than the corresponding queries on the assignment table. And if you want to get "list ideas to which Fred is assigned quickly" you will need to add five indexes to the ideas table if you don't have an assignment table - that slows down all your inserts and some of your updates, as well as making your DDL a bit more complicated.

Or should I just have null-able columns in the Idea table? For economic reasons I'm doing this. No views required. No complex 6 table join (Idea to IdeaAssignment 5 times) to get all assignments for a single row select of the Idea table. No sub form on my Idea form to list assignments in a grid control, or some such. And, since I'm using Visual Studio and ADO.NET for a Win Forms app, I don't have to worry about this join getting executed a bazillion times as a form's underlying DataSet gets refreshed. I'm not getting paid to be pure in my design. I'm getting paid to solve a business problem quickly.

I guess your 6 way join consists of the five outer joins to create the alternative version of the Idea table (the one with the 5 nullable assignment columns), and yes, if you split off the assignment table and need to return that information in that form you are stuck with that six way join; but that is just one query that becomes a bit more complex, and the query is only complex because you are doing formatting (conversion from 5 rows sharing a common first column to one row with the first column and all five of what were second columns) in the database instead of in the application; and of course by not doing the join you can return a smaller record set (more rows, but less total data) to the application for any idea that doesn't yet have more than 3 people assigned to it - which, if new ideas are the most frequently accessed ones, as seems likely, gives a net reduction in bandwidth between db and client.
So I think I disagree with you about where allowing nulls is appropriate - but of course it depends on all the details of the application and how often various different things are done in practise with that particular application, and we can't sensibly make hard and fast rules about it. I'm inclined to think that your example is a violation of 1NF - that this is one of those cases where using NULLs is not the best solution - look at the last paragraph of my 1NF article at on this site and I think you'll see what I'm getting at - but in the partcular cases where you've done it that may not be correct at all.


Tom
Post #1208524
Posted Friday, November 18, 2011 10:49 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: 2 days ago @ 4:52 AM
Points: 8,813, Visits: 9,370
archie flockhart (11/18/2011)
@QuickDraw: Not sure on the multiple fields: I don't see why you'd need multiple joins to the same table with a normalised design; and if you put 5 "Person" fields in the main table, how do you search to find a person who may appear in any one of the fields ?

If he wants to return a single record to the app listing all the people assigned to work on a particular idea he actually needs a query something like this:
 
;
WITH assigned_people(A.pID, P.name) as (
SELECT A.ideaID, name FROM People P INNER JOIN Assignment A ON P.pID=A.pID
)
SELECT I.ideaID, I.ideaname, AP1.name, AP2.name ,AP3.name, AP4.name,AP5.name
FROM idea I LEFT OUTER JOIN assigned_people AP1 on AP1.ideaID = I.idea_id
LEFT OUTER JOIN assigned_people AP2 on AP2.ideaID = I.idea_id
LEFT OUTER JOIN assigned_people AP3 on AP3.ideaID = I.idea_id
LEFT OUTER JOIN assigned_people AP4 on AP4.ideaID = I.idea_id
LEFT OUTER JOIN assigned_people AP5 on AP5.ideaID = I.idea_id

which is indeed a 6-way (actually 7-way, as I've include the People table) join. To get the (unnormalised) result to pass back to the app from the fully normalised database he needs that horrible looking (but actually harmlesds, if the indexing is right) join.


Tom
Post #1208544
Posted Friday, November 18, 2011 12:36 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, March 12, 2014 2:56 PM
Points: 12, Visits: 107
@archie: The need to answer a question like "show me all records that Bob Smith has anything to do with" comes up so rarely, I don't think about it in the design. I use a Query By Example form, where users would enter a name into the Assigned To field and the SQL would search just that attribute for the given name. If the customer insisted on such a unified search (enter any criteria into a single search field), I would have to build a more complex query that added where clauses (logical OR) for each searchable attribute (with any necessary joins).

If you want a single row in a spreadsheet style display (grid control) where one row lists all attributes (a column for each associated person), you have to join the main entity to the person table 5 times to get the people's names:

[Columns all in the main entity]
SELECT * FROM Idea AS i INNER JOIN 
Person AS p_ideasource ON i.ideaSource = p_ideasource.personId INNER JOIN
Person AS p_assigned ON i.assigned = p_assigned.personId INNER JOIN
Person AS p_budgetOwner ON i.budgetOwner = p_budgetOwner.personId INNER JOIN
Person AS p_legalReviewer ON i.legalReviewer = p_legalReviewer.personId INNER JOIN
Person AS p_financialReviewer ON i.financialReviewer = p_financialReviewer.personId INNER JOIN
...snip...

[Separate relation table]
SELECT * FROM Idea AS i INNER JOIN 
Idea_Person AS a1 ON i.ideaSource = a1.personId INNER JOIN
Person AS p_ideaSource ON a1.personId = p_assigned.personId AND a1.role = 'Idea Source' INNER JOIN
Idea_Person AS a2 ON i.assigned = a2.personId INNER JOIN
Person AS p_assigned ON a2.personId = p_assigned.personId AND a2.role = 'Assigned To' INNER JOIN
Idea_Person AS a3 ON i.budgetOwner = a3.personId INNER JOIN
Person AS p_budgetOwner ON a3.personId = p_budgetOwner.personId AND a3.role = 'Budget Owner' INNER JOIN
Idea_Person AS a4 ON i.legalReviewer = a4.personId INNER JOIN
Person AS p_legalReviewer ON a4.personId = p_legalReviewer.personId AND a4.role = 'Legal Reviewer' INNER JOIN
Idea_Person AS a5 ON i.financialReviewer = a5.personId INNER JOIN
Person AS p_financialReviewer ON a5.personId = p_financialReviewer.personId AND a5.role = 'Financial Reviewer' INNER JOIN
...snip...

[Disclaimer: never use SELECT * unless there is a good reason!]

The model of the later query is obviously superior from purely extensibility and academic standpoints.

So, you can take issue with the SQL complexity and extra joins, but really the app coding effort to do all the extra inserts and updates to the relation table is what I am trying to avoid (and mapping role name phrases to magic numbers the code will use to defend against role names being changed) . Instead, I'll just live with null-able columns in the idea table, since there is no effective cost, or none I've ever encountered anyway.




Post #1208627
Posted Saturday, November 19, 2011 12:08 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, March 12, 2014 2:56 PM
Points: 12, Visits: 107
I might have an issue with that; why should URLs have any structural relationship to the actual details of the schema? If you let that happen, then if you ever want to change the schema even a little bit you are probably going to have to change masses of application code; if you hide the base schema behind stored procedures and views, the only extras you need to change when you want to change the base schema are the stored procedures and the views - no application code; of course if the change is to provide new features, you will have some new application code - but existing application code will not be affected by the schema change.


You're absolutely right. Don't tie any part of the app directly to the schema. I'm just talking about keeping simple the code that writes the URLs. A URL for a db web app can get long as it is. Using natural keys all over the place just makes the URLs really long. I don't ever expect someone to type in a URL. But passing them in emails sure happens all the time, and you don't want a 5 line (200 character, whatever) URL in an email.

You *should* use sprocs, views, etc. But the sproc needs to come up with a number to receive as a valid query, and give one for using to create a URL, and that number needs to persist in the database. That means using IDENTITY values as references to entity instances (rows, sorry for the object-speak). So why not just make the IDENTITY value be the primary key?

Depending on the topology of the application<>db landscape, it may be hard (read: more expensive) for the app to be abstracted from the db design. If you have a local, embedded database like sqlite at the client, syncing to a central server (and in my current case, pushing and pulling changes directly between clients until the client apps connects back to internet and syncs with the central server), unifying the model between the disconnected app and the central server seems to me to make a lot of sense.

Not if you have queries like "who is currently assigned to three or more ideas" and "list all the other ideas to which people assigned to this one are also assigned" or "list all the ideas to which at least 3 of this set of people are assigned". The multiple null-able columns in Idea make these a real pain to write, compared to trivially simple queries with the separate assignment table;


I agree. And if the client would pay for it, I would design it with an assignment table. The difference is more than few hours of work, so I give the pros and cons of both ways, and let the client decide, based on their own assessment of future reporting needs. If I perceive they will have more needs than the design will allow smoothly, I try and talk them into a better design. I just don't say "This is definitely how you should do it!". A lot of companies I work with are non-profits, and cash is king! Sorry if I'm wandering off topic, but I think the cost of the application and the budget today play a bigger role than we generally want to allow in these discussions. I guess this site is more about the right, or pure way to do things, for larger companies that have lots of money.

This project and the assignment of people might not have been the best examples to use, as I didn't explain more how the data will be used, and how these (your) queries will never come up. I picked person assignment as an example of multiple null (initial value) columns for a record, but that might not have been the best example to use for making a general case about the use of nulls.

if new ideas are the most frequently accessed ones, as seems likely, gives a net reduction in bandwidth between db and client.


I agree. If bandwidth was king, instead of cash...(boy am I setting myself for beating with all this talk which is anathema in the design community!)

I'm inclined to think that your example is a violation of 1NF -


Ouch! It sounds so dirty to break 1NF! I never took well to authority. Based on the definition in the article you linked to, I'm not sure I'm breaking 1NF:

"A table is in 1NF when every column has a fixed simple data-type that is understood by the database system, and every row has a primary key."

How is having multiple assignment columns, or a surrogate primary key breaking that definition?

-----------------------------

The future scenario* I didn't share for phase two: assignment (person, to the idea) is a relation entity with its own attributes that are specific to the role:

(* Could write an article or three about designing databases where the application is built in phases, and cost dictates that you don't use the mac-daddy db design from day one).

Idea Assigned To:
Idea (key)
Person (key)
Initial Assessment Due Date (date)
Assessment (varchar)
Rating (int)

Idea Legal Reviewer:
Idea (key)
Person (key)
Waldorf Value Sale (decimal) *
Assessment (varchar)
Objections (varchar)

(and so on....)

(* I totally made this attribute up)






Post #1208837
Posted Saturday, November 19, 2011 7:35 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: 2 days ago @ 4:52 AM
Points: 8,813, Visits: 9,370
quickdraw (11/19/2011)
Ouch! It sounds so dirty to break 1NF! I never took well to authority. Based on the definition in the article you linked to, I'm not sure I'm breaking 1NF:

"A table is in 1NF when every column has a fixed simple data-type that is understood by the database system, and every row has a primary key."

How is having multiple assignment columns, or a surrogate primary key breaking that definition?

There is absolutely nothing wrong with surrogate keys, provided they are done properly (when they are not done properly - eg when there is no UNIQUE constraint on the natural key - all hell can break lose; I've had to fix some of those that I inherited).

Multiple assignment columns may of may not be a violation of the spirit of 1NF: are the multiple columns each a genuine attribue of the entity, or are they an attempt to describe a complex non-atomic attribute (the set of assignments)? In the former case they are probably not a violation of 1NF (there are some purists out there who would shoot me down). In the latter, they probably are. Why are these columns nullable? Is it because the value of the corresponding attribute is not recorded in the database, or is it because the attribute may not actually be an attribute of the entity represented by the row? In the former case, there isn't a 1NF issue with the nullability (but there are plenty of anti-null fundamentalists out there who would claim that there is); in the latter case, there certainly is. As I said it isn't clear that in your particular case there is a violation of 1NF - I suspected there was, but did say that there might not be; now that I've seen your code example with separate roles clearly identified as such it seems clear that there isn't (although there are plenty of people out there who will say that even so there is).

One other question: why would you want the surrogate to be the primary key in the table that defines it? That seems just perverse to me unless in a specific case there is no use for an index on the natural key. The surrogate can be be the target of foreign keys if it is subject to a unique constraint, foreign keys are not restricted to pointing to the primary key of the target table, and the primary key doesn't have to be the index you cluster on, so the only good reason for not using the natural key as primary key is that it is not useful to have an index on the natural key (which in my experience is a pretty rare occurrence). Of course there are entities whose natural key is an identifier provided somewhere in the system (not necessarily by the database - it could be generated manually; for example a payroll number could be the identifier for a person, and payroll numbers were used for that purpose long before the invention of databases; maybe current name, name at birth, date and time of birth, and place of birth could work as a natural key, but you are fairly unlikely to have that information about all your employees).


Tom
Post #1208872
Posted Sunday, November 20, 2011 1:59 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Wednesday, March 12, 2014 2:56 PM
Points: 12, Visits: 107
One other question: why would you want the surrogate to be the primary key in the table that defines it?


The best example is a value domain table (I call them lookup tables. Others I've seen: options table, domain table) used for restricting an attribute's value to a pre-defined set of values (CHECK constraint in SQL Server, ENUM in MySQL and other programming languages):

CREATE TABLE [dbo].[zIdeaPriority](
[id] [int] IDENTITY(1,1) NOT NULL,
[shortDescription] [varchar](15) NOT NULL,
[longDescription] [varchar](255) NULL,
[visibleFlag] [bit] NOT NULL, /* for deprecating old values with referencing rows in a main entity */
[ordering] [int] NULL, /* order to display in presentation layer */
[appKey] [varchar](20) NULL, /* magic value hard coded (w/ lookup func) in app */
CONSTRAINT [PK_zIdeaPriority] PRIMARY KEY NONCLUSTERED ([id] ASC),
CONSTRAINT [IXU_zIdeaPriority1] UNIQUE NONCLUSTERED ([shortDescription] ASC)
CONSTRAINT [IXU_zIdeaPriority2] UNIQUE NONCLUSTERED ([appkey] ASC)
)
CREATE CLUSTERED INDEX IXC_zIdeaPriority2 ON [zIdeaPriority]([ordering],[shortDescription] ASC)

[Aside: To separate main entity tables from tables use to support application presentation, I prefix lookup tables with the letter z. Also, the appkey is the magic value in application code so users can change shortDescription and code that events off a priority value doesn't break by not hard coding a shortDescription value, nor hard coding the id value. Not hard coding the id (attribute?) of a domain value allows you to truncate and reload the table with different shortDescription values but the same appKey values with no recompile necessary. This was very important on one project during user testing of various sets of a three tier classification scheme. Note: It costs more to code every time the app needs to "look up" an appKey value to get an id value.]

Here the id column is technically a surrogate. You can have a discussion all day about does the id column really represents an attribute of a domain value. If it does, then it should be a part of the natural key (along with shortDescription, or maybe not). If not, it's a surrogate and does that make shortDescription the natural key?

However, there is a big problem if you say the id column is part of the natural key. For a lookup table, which represents a value domain, a composite natural key of id + shortDescription is bad, bad, bad. {1,'High'} and {2,'High'} isn't allowed for a value domain, but is a perfectly legal natural key! (I don't say the id column should be part of the natural key, so no problem).

Anyway, I guess I consider the surrogate key generated for the value domain to be the domain value (mapping to it in code through the appKey attribute). Maybe you can't call that a surrogate key in that case. Maybe the IDENTITY column *is* the natural key! (I gotta stop here a sec and process that...) In any case, you need a unique on shortDescription.

I guess this gets down to philosophy about how you model a value domain. I'd love to get Joe's opinion on how he fits in value domains in his models. I gotta go thumb through his books again... Is there a natural key for these tables? Or is a value domain a special enough animal to say there is no natural key. Rows in these tables are just tokens. I don't know. I just know I use a lot of these tables in applications I build so that I standardized on this model. I always use this model for value domains, without even thinking about it.

Post #1208896
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse