Denormalization Strategies

Question

Post reply

Denormalization Strategies

Viewing 15 posts - 31 through 45 (of 45 total)

You must be logged in to reply to this topic. Login to reply

Les Cardwell Ten Centuries Points: 1251 More actions · Answer 1

Thx Lynn... good to know. Still, alligators and all that... :unsure:

Dr. Les Cardwell, DCS-DSS
Enterprise Data Architect
Central Lincoln PUD

Robert Frasca SSCertifiable Points: 6026 More actions · Answer 2

I'm not a big fan of de-normalization in OLTP databases. If I need to run queries I'll run them against a structure that's is optimized for queries such as a dimensional model or a heavily indexed Operational Data Store. I think an OLTP database exists for one reason, to capture (and validate) transactional data as quickly as possible. The emphasis is as much on write performance as it is on read performance. Anything that gets in the way of that mission such as locking associated with read-only type queries should be consciously reduced if not completely eliminated.

I've actually found that it's easier and faster to develop the OLTP model and the ODS in parallel. It also allows you to build a data source for your down-stream data warehouse that can be accessed without impacting the OLTP database. This is, after all, why so many companies use transactional replication, i.e. in order to populate their ODS for querying/reporting purposes.

One of the reasons I don't like de-normalization in the OLTP database is that the business intent of the logical data model can become obfuscated by the perceived need for short-cuts in the physical data model. There is a reason why each entity exists in the normalized data model. Referential integrity demonstrates how those entities are related to each other. When those relationships are eliminated and attributes are scattered redundantly around the model the reasoning behind those relationships is no longer clear. A normalized data model is very easy to "read". Once de-normalization occurs a level of unnecessary complexity is added.

There was a time when the cost of physical referential integrity was quite high and locking algorithms were relatively primitive so de-normalization was a means of overcoming the shortcomings of the engine. That time is past.

"Beliefs" get in the way of learning.

timclaason SSC-Addicted Points: 486 More actions · Answer 3

Robert Frasca (3/15/2010)
I've actually found that it's easier and faster to develop the OLTP model and the ODS in parallel. It also allows you to build a data source for your down-stream data warehouse that can be accessed without impacting the OLTP database. This is, after all, why so many companies use transactional replication, i.e. in order to populate their ODS for querying/reporting purposes.

I appreciate your purist view of this, and I agree that a transfer from OLTP to a OLAP/Warehouse environment is optimal, and best-practice for reporting and other analyses, and if the resources are available, I totally agree with that notion; however, I have a number of customers and environments where operating in that environment is quite a challenge because of financial resources, and support and maintenance concerns (ie not having a DBA to support it).

Robert Frasca SSCertifiable Points: 6026 More actions · Answer 4

I feel your pain. I should probably have prefaced my remarks with "when it's possible". It's much easier to take the approach I recommended at the beginning of a project. It's much more difficult to retrofit; however, one of the things I'm working on now is to create that ODS and gradually move, starting with the longest running, queries off of the production OLTP database and on to the ODS to reduce stress. It'll probably take a couple of years and only then will we be able to think about tweaking the underlying OLTP physical model. It's a cool project however with a number of interesting facets to it.

"Beliefs" get in the way of learning.

ben.sapp Valued Member Points: 73 More actions · Answer 5

Sometimes when you identify a one to many relationship there is a tendency toward dogmatic decision-making.

I wrote code not long ago to help users categorize records. Instead of using a lookup table i used a varchar field and an int field with a binary hash function triggered on insert or update. As users enter data, previous entries are distinct selected on that binary hash, in order to give a drop-down option based on what they're typing. It seemed easier to me than creating a lookup table and writing code to insert or update to the lookup table.

I would only denormalize in rare cases though.

trubolotta SSC Veteran Points: 219 More actions · Answer 6

I don't want to hit this too hard but I see the example more as a case for improved design versus dernomarlization. Just a few obervations if I may:

1) The Pet_Medication table should include the date the med was administered and the dosage, whether it's a one time event or part of a series. The Pet_MedicationSeries table can be elimniated and the Pet ID and Med ID used as a composite key. That change would seem to me to satisfy a rather obvious business requirement.

2) I would speculate this hypothetical business would want some measure of inventory and there is no better place to put that than the Medication table. But that table should also include any requirement for readministering the medication. That would eliminate the need to write a separate select statement for each med and allow for changes without requiring a developer to write more SQL.

I know the example is hypothetical, but this cat can be skinned a much better way. I think the changes I have suggested would eliminate any need for denormalization and provide a far more flexible application that is less DB developer dependent.

dbishop Mr or Mrs. 500 Points: 565 More actions · Answer 7

I agree more with trubolotta than the contents of the article. I do not see why there needs to be separate table for one-time meds vs. recurring meds. In fact Pet_Medication could contain a bit column to indicate it is recurring. The overhead of one bit is much less than an entire table along with the foreign keys to maintain it. Possibly another tinyint column to indicate recurring frequency (e.g. 1= daily, 2-weekly, 3=monthly...)

A properly designed structure with proper indexes will perform amazingly fast, especially in this environment. Unless they are a verterinary clinic with 50 doctors seeing 30 patients each a day, six days/week and you have 20 years of medical history with each patient having an average of 2 medications per visit.

Even then, you have to realize that a large percentage of patients may not even require medication. But even so, this is very doable with good response time in SQL Server.

I have designed HR systems that join up to 20 tables in certain queries with sub-second response time reading the data.

Paul White SSC Guru Points: 150467 More actions · Answer 8

Les Cardwell (3/15/2010)

Since 'getdate()' is a non-deterministic function, like all non-deterministic functions, we've always assigned them to a scalar variable to ensure the dbms won't perform a table-scan...although admittedly, these days they seem to be more implementation dependent.

Non-determinism isn't the issue. SQL Server can evaluate the expression once and seek on the result.

Assigning to a variable might cause a poor plan since SQL Server cannot 'look inside' the variable in many cases.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

trubolotta SSC Veteran Points: 219 More actions · Answer 9

While I find the discussion of how to best use the getdate() function interesting, I also find it distressing that the pursuit of code efficiency brushes aside the more important issue of business practicality. As I mentioned in a previous post, the sample presented to demonstrate a "need for denormalization" more aptly demonstrates the need for better design, and that includes how the application might be used in the real world.

The reason I would run such a report is to notify customers that their pets are due for a particular medication, probably a month in advance. The way the code was written to produce that list requires it be run every day so as not to miss any pet due to date gaps. More likely, the business will run that report once a month or once week, but not always on some developer conceived target due date. People do things early or late and those gaps must be covered. Hence getdate() is totally inappropriate for use in that report.

I know my comment digresses from the topic, but I think it is import to use examples that illustrate very practical business considerations, especially those covering fundamentals of design that n00bs wish to learn. Someone reading the article or comments regarding the use of getdate() is going to walk away with the wrong lesson. They may write more efficient code as a result, but they will also write an application that puts undue burden on the business that uses it.

tim-605484 SSC Enthusiast Points: 133 More actions · Answer 10

First: Great article, interesting replies.

My question:

This query does not work in MsSql 2000:

SELECT text, total_worker_time/execution_count AS [Avg CPU Time], execution_count

FROM sys.dm_exec_query_stats AS qs

CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st

ORDER BY execution_count DESC

Is there some kind of alternative for MsSql2000?

(Sorry if this has been asked before, I have read the replies rather quickly)

timclaason SSC-Addicted Points: 486 More actions · Answer 11

In SQL 2000, it's the syscacheobjects table. There is a sql column and a usecounts column that can give you insights.

TomThomson SSC Guru Points: 104773 More actions · Answer 12

Gilles Willard (3/15/2010)
Maybe this is due to my lack of experience, or the fact that it's been several years since I last studied database normalization, but:
It seems to me that what Timothy is proposing is simply adding a few FKs to his database design to cut down on the JOINs required to get at some data.
How is that denormalizing?
I went back to my database design textbook (as well as the paper I wrote then on denormalization...)and could not figure out which form (1NF, 2NF or 3NF) was being violated?
Also, and please forgive my fussiness with this but I have teachers in my family going back three generations, Timothy's article, while being interesting, could have used a quick run through spell-checking. This is not just a "purity" thing: it is a little jarring to find typos or spelling mistakes in a technical article while you're concentrating on understanding the author's point, something like stumbling over a rock while on a good walk.

I agree about the rocks!

Maybe your problem with the denormalisation is the way he describes it: he talks about "adding Foreign Keys" when what he actually means is "adding extra columns to support extra Foreign Key relationships" (and he has a barbarian naming convention for the columns which support foreign keys, so that they sound as if they were foreign keys, not columns). When he adds the column called "FK_Medication" to his Pet_MedicationSeries table that introduces a new functional dependency within that table: the value of the column FK_Medication is determined (according to the schema design) completely by the value of the column FK_PetMedication; but FK_PetMedication is not a key, so the table is now not in 3NF.

Of course the proposed denormalisations are probably not the right way to solve the problem described - the first issue appears to arise largely because the date is not recorded in the pet_medication table (one consequence of this is that for medication that is not part of a series there is no means of discovering the date, so neither the query before the denormalisation nor the query after it can deliver the answer it is claimed to deliver). The solution is to put the date into the Pet_Medication table, and remove it from the Pet_MedicationSeries table. The first query now doesn't need to use the Pet_MedicationSeries table at all (and is now able to deliver what it was claimed to deliver) so there's no need to denormalise that table to help this query, and if it performs badly a clustered index on (date, id) [actually just on (date) would work too, but why not make sure it's unique and use it as the primary key?] will probably work wonders.

Denormalising to fix a consequence of a schema design that puts attributes in the wrong tables is not a sensible use of denormalisation - it's just a deeper burying of the underlying design problem.

I think Paul's 3 stars were over-generous.

(edited - a spelling error)

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 13

Robert Frasca (3/15/2010)
One of the reasons I don't like de-normalization in the OLTP database is that the business intent of the logical data model can become obfuscated by the perceived need for short-cuts in the physical data model. There is a reason why each entity exists in the normalized data model.

If by denormalisation you mean violating EKNF (or violating 3NF) I agree with you, but if you mean violating one of the higher levels of normalisation (BCNF, 4NF, 5NF) while maintaining EKNF (or even just maintaining 3NF) I have to disagree. The higher normalisations can destroy the ability to use primary key constraints to enforce all the functional dependencies of business data, and denormalisation from them (in cases where they have resulted in something not already required by EKNF) is more often right than wrong. Long live the principle of representability!

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 14

trubolotta (3/16/2010)
While I find the discussion of how to best use the getdate() function interesting, I also find it distressing that the pursuit of code efficiency brushes aside the more important issue of business practicality. As I mentioned in a previous post, the sample presented to demonstrate a "need for denormalization" more aptly demonstrates the need for better design, and that includes how the application might be used in the real world.

I agree completely.

Tom

Gilles Willard SSChasing Mays Points: 658 More actions · Answer 15

Tom.Thomson (3/22/2010)
Gilles Willard (3/15/2010)
Maybe this is due to my lack of experience, or the fact that it's been several years since I last studied database normalization, but:
It seems to me that what Timothy is proposing is simply adding a few FKs to his database design to cut down on the JOINs required to get at some data.
How is that denormalizing?
I went back to my database design textbook (as well as the paper I wrote then on denormalization...)and could not figure out which form (1NF, 2NF or 3NF) was being violated?
Also, and please forgive my fussiness with this but I have teachers in my family going back three generations, Timothy's article, while being interesting, could have used a quick run through spell-checking. This is not just a "purity" thing: it is a little jarring to find typos or spelling mistakes in a technical article while you're concentrating on understanding the author's point, something like stumbling over a rock while on a good walk.
I agree about the rocks!
Maybe your problem with the denormalisation is the way he describes it: he talks about "adding Foreign Keys" when what he actually means is "adding extra columns to support extra Foreign Key relationships" (and he has a barbarian naming convention for the columns which support foreign keys, so that they sound as if they were foreign keys, not columns). When he adds the column called "FK_Medication" to his Pet_MedicationSeries table that introduces a new functional dependency within that table: the value of the column FK_Medication is determined (according to the schema design) completely by the value of the column FK_PetMedication; but FK_PetMedication is not a key, so the table is now not in 3NF.
Of course the proposed denormalisations are probably not the right way to solve the problem described - the first issue appears to arise largely because the date is not recorded in the pet_medication table (one consequence of this is that for medication that is not part of a series there is no means of discovering the date, so neither the query before the denormalisation nor the query after it can deliver the answer it is claimed to deliver). The solution is to put the date into the Pet_Medication table, and remove it from the Pet_MedicationSeries table. The first query now doesn't need to use the Pet_MedicationSeries table at all (and is now able to deliver what it was claimed to deliver) so there's no need to denormalise that table to help this query, and if it performs badly a clustered index on (date, id) [actually just on (date) would work too, but why not make sure it's unique and use it as the primary key?] will probably work wonders.
Denormalising to fix a consequence of a schema design that puts attributes in the wrong tables is not a sensible use of denormalisation - it's just a deeper burying of the underlying design problem.
I think Paul's 3 stars were over-generous.
(edited - a spelling error)

Tom, thank you.

After a week and several pages of comments to this article, I was starting to despair of ever getting a clear, to-the-point answer to my question.

You've now done that.

You are of course correct in your assessment of how the author's new schema is not in 3NF, and also in describing the fundamental flaw that lead to his denormalizing and how it should be resolved.