Best practice question on using a primary key field opposed to using a foreign key field

Question

Best practice question on using a primary key field opposed to using a foreign key field

chocthree

Ten Centuries

Points: 1276
More actions
August 24, 2018 at 8:31 am

#367563

Hi,
Let's keep this simple (for my sake) and say we have two tables.
Table Employee contains:
e.EmployeeID e.Employee e.BadgeAwarded e.CentreID
Table Centre contains:
c.CentreID c.CentreName c.Region
The resultset required is:
EmployeeID CentreID
Should I retrieve the fields from just the Employee single table like:
SELECT e.EmployeeID, e.CentreID
FROM Employee AS e
Or should I use the the Primary Key (well the Primary Key in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) like:
SELECT e.EmployeeID, c.CentreID
FROM Employee AS e JOIN Centre AS c
ON e.CentreID = c.CentreID
The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Centre's version of the attribute as, I expect, back in the SQL db it would've been the Primary Key and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop.
Thanks.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

drew.allen SSC Guru Points: 76944 More actions · Answer 1

chocthree - Friday, August 24, 2018 8:31 AM
Hi,
Let's keep this simple (for my sake) and say we have two tables.
Table Employee contains:
e.EmployeeID e.Employee e.BadgeAwarded e.CentreID
Table Centre contains:
c.CentreID c.CentreName c.Region
The resultset required is:
EmployeeID CentreID
Should I retrieve the fields from just the Employee single table like:
SELECT e.EmployeeID, e.CentreID
FROM Employee AS e
Or should I use the the Primary Key (well the Primary Key in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) like:
SELECT e.EmployeeID, c.CentreID
FROM Employee AS e JOIN Centre AS c
ON e.CentreID = c.CentreID
The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Centre's version of the attribute as, I expect, back in the SQL db it would've been the Primary Key and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop.
Thanks.

Best practice is to only pull in necessary tables. A Foreign Key and a Primary Key will have the exact same integrity if the Foreign Key is trusted. If it's not trusted, you better have a VERY GOOD REASON for it not to be trusted.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

chocthree Ten Centuries Points: 1276 More actions · Answer 2

Ok. Thanks.
I don't have visual on the original SQL db so I can only presume the integrity was/is placed throughout. I'm actually guessing that CentreID was a PK. I would assume so, but I have seen tables without the constraints set.

Steve Jones - SSC Editor SSC Guru Points: 734434 More actions · Answer 3

Agree with Drew. I would only query employee, but I'd have a trusted FK here.

drew.allen SSC Guru Points: 76944 More actions · Answer 4

chocthree - Friday, August 24, 2018 9:12 AM
Ok. Thanks.
I don't have visual on the original SQL db so I can only presume the integrity was/is placed throughout. I'm actually guessing that CentreID was a PK. I would assume so, but I have seen tables without the constraints set.

In order for the FK constraint to be put in place, it has to match a unique constraint on the referenced table. It only matters that the constraint be unique, not that it's the PK.

Drew

J. Drew Allen
Business Intelligence Analyst
Philadelphia, PA

jcelko212 32090 SSCrazy Eights Points: 9294 More actions · Answer 5

>> Let's keep this simple (for my sake) and say we have two tables. <<

You are keeping things too simple. This is why the past 30+ years, posting DDL has been minimal netiquette. You also don’t know some of the basics of data modeling. For example unless you really have only one employee, then the name of your table should be a collective or plural noun because it models a set, and not a single entity. Then, by definition, not an option like your narrative, the table must have at least one key. Your awarded badges in center are not really attributes; they are entities that should have their own tables, and their relationship to your personnel should be expressed in other tables. This is part of normalization.

My spelling checker doesn’t like the French spelling, so I’m getting “center” instead of “centre”, that’s kind of minor.

CREATE TABLE Personnel (emp_id CHAR(10) NOT NULL PRIMARY KEY, ---Required! emp_name VARCHAR(35) NOT NULL, awarded_badge_nbr CHAR(5) NOT NULL, center_id CHAR(5) NOT NULL REFERENCES Centers (center_id)); --- Important!!

CREATE TABLE Centers (center_id center_id CHAR(5) NOT NULL PRIMARY KEY,---Required! center_name VARCHAR(25) NOT NULL, region_name CHAR(10) NOT NULL);
>> Should I retrieve the fields [sic] from just the Personnel single table like:

SELECT P.emp_id, P.center_id FROM Personnel AS P;
<<

Yes. It involves no joins,and the references clause that you didn’t show in your non-DDL narrative, will ensure that any center_id that is used has a match. In other SQL products, this would be enforced with pointer chains to a single occurrence of each center_id in the referenced table.

>> .. Or should I use the PRIMARY KEY (well the PRIMARY KEY in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) <<

The lack of data integrity in Hadoop is one of the many reasons I never recommend it to a client. Those products are strictly for queries but have nothing to protect data integrity. My attitude is that if it doesn’t have to be right, then let’s do a Douglas Adams and always return 42 🙂

>>
SELECT P.emp_id, C.center_id FROM Personnel AS e INNER JOIN Centers AS C ON P.center_id = C.center_id;

The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Center's version of the attribute as, I expect, back in the SQL db it would've been the PRIMARY KEY and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop. <<

Actually, if you have a decent SQL optimizer, it will detect that the INNER JOIN is redundant and remove it from the execution plan. All you’ve done is replace a simple query with a more complicated one that gained you absolutely nothing.[/code]>> Should I retrieve the fields [sic] from just the Personnel single table like:SELECT P.emp_id, P.center_id FROM Personnel AS P; <<Yes. It involves no joins,and the references clause that you didn’t show in your non-DDL narrative, will ensure that any center ID that is used has a match. In other SQL products, this would be enforced with pointer chains to a single occurrence of each center_id in the referenced table.>> .. Or should I use the PRIMARY KEY (well the PRIMARY KEY in the original SQL db but it's since been loaded to Hadoop and now I'm extracting from Hadoop which has schema on read) <<The lack of data integrity in Hadoop is one of the many reasons I never recommend it to a client. Those products are strictly for queries but have nothing to protect data integrity. My attitude is that if it doesn’t have to be right, then let’s do a Douglas Adams and always return 42 :-)>> SELECT P.emp_id, C.center_idFROM Personnel AS e INNER JOIN Centers AS CON P.center_id = C.center_id;The latter uses a join which adds to the response time. However, I'm wondering if it is best practice to use the Center's version of the attribute as, I expect, back in the SQL db it would've been the PRIMARY KEY and therefore better integrity against it. Plus it would've had an index being a PK but I'm not sure if an index has been placed against it in Hadoop. <<Actually, if you have a decent SQL optimizer, it will detect that the INNER JOIN is redundant and remove it from the execution plan. All you’ve done is replace a simple query with a more complicated one that gained you absolutely nothing.

Please post DDL and follow ANSI/ISO standards when asking for help.