Database Design Theory regarding best practices for querying tables

  • Hi all,

    Up until now, I've largely operated on the principle that any database-related work should be done 100% on the database server, and the application should query the database through specially-constructed procedures to access any information.

    My question is, for a large-scale application with a reasonable level of abstraction, is this really the ideal approach?

    Here's an example scenario:

    Let's say you could abstract out your general base-model for a table to the following:

    CREATE TABLE Generic
    (
     ID INT IDENTITY PRIMARY KEY,
     [Name] VARCHAR(200) NOT NULL,
     [Description] VARCHAR(500) NULL,
     [DateCreated] SMALLDATETIME DEFAULT GETDATE(),
     [DateModified] SMALLDATETIME,
     [DateDeleted] SMALLDATETIME
    )

    And let's say you had a specific implementation of that class as follows:

    CREATE TABLE [Project]
    (
     ID INT IDENTITY PRIMARY KEY,
     [GroupID] INT,
     [CreatorID] INT,
     [Name] VARCHAR(200) NOT NULL,
     [Description] VARCHAR(500) NULL,
     [DateCreated] SMALLDATETIME DEFAULT GETDATE(),
     [DateModified] SMALLDATETIME,
     [DateDeleted] SMALLDATETIME
    )

    Given that structure, I would, generally speaking, have created at least three different stored procedures, GetProjectList, GetProjectList_ByGroupID, and GetProjectList_ByMemberID. Each would, broadly speaking, implement a query of the form SELECT [Columns] FROM Project WHERE [Search Predicates], with the latter two filtering by GroupID and MemberID respectively.

    Now, an alternative approach that could be used is, to instead have just one procedure, GetProjectList, with the _ByGroupID and _ByMemberID versions being implemented using LINQ or some other application-level filtering of the resultset obtained from GetProjectList.

    This approach has a few advantages. It reduces the number of indices required on the tables to handle SARGable querying, and it reduces the number of procedures that need to be maintained. It of course, has the distinct disadvantage of returning potentially very large objects from the database server to the application level.

    So, for a well-designed, scalable, and maintainable application and database, which of those options is recommended for best-practices? Or are there others that should be considered?

  • In short, one size doesn't fit all. Situationally, a single query returning larger data sets may be the better way to go. Situationally, that same architecture could result in horrific performance headaches. There isn't a single, one-size fits all, approach I would take to designing how a system is queried. By and large, I tend to let the development team take the lead on this as long as I can have input on what they do and how they do it. I'm a pretty active proponent of Entity Framework and similar ORM tools. However, that's with full recognition that, again, situationally, the queries generated by ORM tools are woefully inadequate so we're going to need to write specific queries by hand and possibly put them into the database as stored procedures. I try to be flexible in the approach and demand that same degree of flexibility from the dev team. 

    The single glaring exception to this is allowing the ORM tool to design the database as if the database were objects. That doesn't fly. It leads to failed projects. A relational database system is not object storage and attempting to force that mechanism into the database leads to failure. I've seen it. It's ugly.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Hey Grant,

    I'm sure you've worked on some large-scale projects. My main concern with the idea you've suggested is that it seems to not be consistent across the board. Is that not a problem that you've faced? If certain tables use the multiple-procedures route while others use the return-all-records route, does it not become difficult to manage through a uniform policy for larger projects? Do the advantages of having the approach tailored to the particular set of data being returned outweigh the costs of not having a uniform policy?

  • kramaswamy - Monday, July 17, 2017 1:35 PM

    Hey Grant,

    I'm sure you've worked on some large-scale projects. My main concern with the idea you've suggested is that it seems to not be consistent across the board. Is that not a problem that you've faced? If certain tables use the multiple-procedures route while others use the return-all-records route, does it not become difficult to manage through a uniform policy for larger projects? Do the advantages of having the approach tailored to the particular set of data being returned outweigh the costs of not having a uniform policy?

    It gets down to doing what works. Having a single solution for everything is easier. It's just inflexible. It's been my experience that too dogmatic an approach leads to a lot more problems than allowing for flexibility. There is an assumption of competence. If your team is terribly inexperienced a dogmatic approach is better, at least in terms of delivery. Just know that some of what you deliver is not going to function very well. You'll be building technical debt that you'll have to pay for later.

    Same approach for the database system itself. All problems aren't solved by SQL Server. Some are better solved with a document database or a distributed data system. Understanding the problem you're solving is fundamental not only to code choice, but architecture choice.

    I've worked on large projects that failed. Most of these thoughts come from those, not the successful projects. Screwing up teaches me more than getting stuff right it seems.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore Roosevelt
    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who give you the shirt off *someone else's* back.

  • ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

  • roger.plowman - Wednesday, July 19, 2017 7:06 AM

    ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

    I did say "almost always" 🙂.  There will no doubt be a few rare exceptions.

    As to the perceived advantages, the first is generally met well enough by the parent-as-clustering-key approach.  Order#s are created in order by definition, so clustering an Items table on order# first is generally in sequence as well.  Besides, minor fragmentation is inevitable and is not as big a deal as some people perceive it to be.  As another example, datetime for audit/log tables is often vastly more useful than identity, even simplifying it when the table eventually needs purged.  You can still use an identity column in the table, as the PK even too if you prefer, while still clustering by order# first.

    I will say again, however, that under no circumstances should you cluster on identity "by default".  Clustering is far too important to overall performance to blindly follow rote "rules".

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who give you the shirt off *someone else's* back.

  • ScottPletcher - Wednesday, July 19, 2017 8:01 AM

    roger.plowman - Wednesday, July 19, 2017 7:06 AM

    ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

    I did say "almost always" 🙂.  There will no doubt be a few rare exceptions.

    As to the perceived advantages, the first is generally met well enough by the parent-as-clustering-key approach.  Order#s are created in order by definition, so clustering an Items table on order# first is generally in sequence as well.  Besides, minor fragmentation is inevitable and is not as big a deal as some people perceive it to be.  As another example, datetime for audit/log tables is often vastly more useful than identity, even simplifying it when the table eventually needs purged.  You can still use an identity column in the table, as the PK even too if you prefer, while still clustering by order# first.

    I will say again, however, that under no circumstances should you cluster on identity "by default".  Clustering is far too important to overall performance to blindly follow rote "rules".

    Well, audit tables are a special case, since the change date and the identity would automatically synchronize. 🙂 In which case the identity only has a slight advantage over date/time (identities have an absolute precision date/time doesn't, i.e. two changes could happen within .333 seconds of one another).

    I'm a big fan of clustering on identities by default for just about every table. (always acknowledging rare exceptions). Take, for example, a table of patients at a dental office. Clustering them by name means random insertions which will thoroughly (and quickly) fragment the patient table. Clustering by just about anything except an identity would quickly fragment the table.

    Given that your PK is probably going to be an Int32 for efficiency anyway... 🙂

    Can you give me examples where non-identity clustering would make a significant impact on performance. At what scale would it become important (1K recs, 10k recs, 100k recs, etc.)

  • roger.plowman - Wednesday, July 19, 2017 8:23 AM

    ScottPletcher - Wednesday, July 19, 2017 8:01 AM

    roger.plowman - Wednesday, July 19, 2017 7:06 AM

    ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

    I did say "almost always" 🙂.  There will no doubt be a few rare exceptions.

    As to the perceived advantages, the first is generally met well enough by the parent-as-clustering-key approach.  Order#s are created in order by definition, so clustering an Items table on order# first is generally in sequence as well.  Besides, minor fragmentation is inevitable and is not as big a deal as some people perceive it to be.  As another example, datetime for audit/log tables is often vastly more useful than identity, even simplifying it when the table eventually needs purged.  You can still use an identity column in the table, as the PK even too if you prefer, while still clustering by order# first.

    I will say again, however, that under no circumstances should you cluster on identity "by default".  Clustering is far too important to overall performance to blindly follow rote "rules".

    Well, audit tables are a special case, since the change date and the identity would automatically synchronize. 🙂 In which case the identity only has a slight advantage over date/time (identities have an absolute precision date/time doesn't, i.e. two changes could happen within .333 seconds of one another).

    I'm a big fan of clustering on identities by default for just about every table. (always acknowledging rare exceptions). Take, for example, a table of patients at a dental office. Clustering them by name means random insertions which will thoroughly (and quickly) fragment the patient table. Clustering by just about anything except an identity would quickly fragment the table.

    Given that your PK is probably going to be an Int32 for efficiency anyway... 🙂

    Can you give me examples where non-identity clustering would make a significant impact on performance. At what scale would it become important (1K recs, 10k recs, 100k recs, etc.)

    I have very little time now, but briefly:
    Identity never automatically actually syncs with datetime, people just assume it does.  First, the first row to arrive to SQL is not absolutely guaranteed to get an earlier identity value.  Second, any identity value can be inserted at any time.  SQL must take that into account.

    >>I'm a big fan of clustering on identities by default for just about every table.<<
    Horrible mistake.  Don't let laziness and complacency destroy your overall performance.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who give you the shirt off *someone else's* back.

  • roger.plowman - Wednesday, July 19, 2017 8:23 AM

    ScottPletcher - Wednesday, July 19, 2017 8:01 AM

    roger.plowman - Wednesday, July 19, 2017 7:06 AM

    ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

    I did say "almost always" 🙂.  There will no doubt be a few rare exceptions.

    As to the perceived advantages, the first is generally met well enough by the parent-as-clustering-key approach.  Order#s are created in order by definition, so clustering an Items table on order# first is generally in sequence as well.  Besides, minor fragmentation is inevitable and is not as big a deal as some people perceive it to be.  As another example, datetime for audit/log tables is often vastly more useful than identity, even simplifying it when the table eventually needs purged.  You can still use an identity column in the table, as the PK even too if you prefer, while still clustering by order# first.

    I will say again, however, that under no circumstances should you cluster on identity "by default".  Clustering is far too important to overall performance to blindly follow rote "rules".

    Well, audit tables are a special case, since the change date and the identity would automatically synchronize. 🙂 In which case the identity only has a slight advantage over date/time (identities have an absolute precision date/time doesn't, i.e. two changes could happen within .333 seconds of one another).

    I'm a big fan of clustering on identities by default for just about every table. (always acknowledging rare exceptions). Take, for example, a table of patients at a dental office. Clustering them by name means random insertions which will thoroughly (and quickly) fragment the patient table. Clustering by just about anything except an identity would quickly fragment the table.

    Given that your PK is probably going to be an Int32 for efficiency anyway... 🙂

    Can you give me examples where non-identity clustering would make a significant impact on performance. At what scale would it become important (1K recs, 10k recs, 100k recs, etc.)

    That's like asking at what point do we need to worry about performance at all?  And sure to your example a dentist office with a few hundred or a thousand patients might not see any benefit from indexing anything at all.  Setting an identity as the primary key and letting SQL server cluster on it is the lazy way to have both a primary key and an ever increasing clustered index, it just happens to be effective a lot of the time.

    Now on the other hand think about a table that has a clustered index that is not always increasing on a table with hundreds of millions rows receiving millions of inserts a day.  Or a table with an identity set as the clustered that is often queried in large ranges that aren't defined by the identity.

  • ZZartin - Wednesday, July 19, 2017 8:42 AM

    roger.plowman - Wednesday, July 19, 2017 8:23 AM

    ScottPletcher - Wednesday, July 19, 2017 8:01 AM

    roger.plowman - Wednesday, July 19, 2017 7:06 AM

    ScottPletcher - Tuesday, July 18, 2017 11:12 AM

    The first required step in getting reasonably decent table designs is to get rid of the myth that every table has to have an identity column and, far worse, should always be clustered on that identity column ("ID INT IDENTITY PRIMARY KEY").  Indeed, it's most important to do logical data modeling before physical data modeling and not be in such a rush to get to tables and their physical requirements rather than just entities and the business data requirements.  Tables come and go, and undergo huge changes, but core business data lives on!

    In particular child / intersection tables should almost never be clustered on their own identity column, but on their parent(s) key column(s).

    For example, the Orders table would require an order_id that's identity and the table would likely -- though not always -- be clustered on order_id.  But the Items table would almost always be better clustered first on order_id and not on item identity.

    Um, your objection to intersection tables should probably be qualified with "usually" or "most of the time". Physical clustering using identities does have two advantages to consider. First, lack of fragmentation of the underlying table. Second, it makes auditing using a trigger simpler in cases where you need to track who changed it, when, and what the old value was. In those cases every time probably can and should be clustered on an identity. YMMV, of course.

    Depending on the application the key pairs in an intersection table probably aren't that large, nor are the parent groups likely to contain humongous numbers of children (usually). Given this, the lack of clustering isn't generally a big deal, although I imagine if you're dealing with 6 figure or greater scales of parents at a time the disk read overhead might need addressing.

    As always, scaling is a bitch. 😀

    I did say "almost always" 🙂.  There will no doubt be a few rare exceptions.

    As to the perceived advantages, the first is generally met well enough by the parent-as-clustering-key approach.  Order#s are created in order by definition, so clustering an Items table on order# first is generally in sequence as well.  Besides, minor fragmentation is inevitable and is not as big a deal as some people perceive it to be.  As another example, datetime for audit/log tables is often vastly more useful than identity, even simplifying it when the table eventually needs purged.  You can still use an identity column in the table, as the PK even too if you prefer, while still clustering by order# first.

    I will say again, however, that under no circumstances should you cluster on identity "by default".  Clustering is far too important to overall performance to blindly follow rote "rules".

    Well, audit tables are a special case, since the change date and the identity would automatically synchronize. 🙂 In which case the identity only has a slight advantage over date/time (identities have an absolute precision date/time doesn't, i.e. two changes could happen within .333 seconds of one another).

    I'm a big fan of clustering on identities by default for just about every table. (always acknowledging rare exceptions). Take, for example, a table of patients at a dental office. Clustering them by name means random insertions which will thoroughly (and quickly) fragment the patient table. Clustering by just about anything except an identity would quickly fragment the table.

    Given that your PK is probably going to be an Int32 for efficiency anyway... 🙂

    Can you give me examples where non-identity clustering would make a significant impact on performance. At what scale would it become important (1K recs, 10k recs, 100k recs, etc.)

    That's like asking at what point do we need to worry about performance at all?  And sure to your example a dentist office with a few hundred or a thousand patients might not see any benefit from indexing anything at all.  Setting an identity as the primary key and letting SQL server cluster on it is the lazy way to have both a primary key and an ever increasing clustered index, it just happens to be effective a lot of the time.

    Now on the other hand think about a table that has a clustered index that is not always increasing on a table with hundreds of millions rows receiving millions of inserts a day.  Or a table with an identity set as the clustered that is often queried in large ranges that aren't defined by the identity.

    Ok, let's posit a 100M rec table clustered on a non-increasing index and gets, say, 5M recs inserted (randomly) per day. The question becomes, at what point does the operation of block splitting and disk-read overhead become a bigger performance issue than large scale queries that can take advantage of clustering for performance?

    It was my impression that new blocks of records are NOT issued in physically contiguous chunks--which would mean logical clustering does nothing to aid the speed of drive seeks, especially at 100M rec scale. Unless the table in question was isolated to its own partition (database?) on a dedicated physical drive? In case you would still need to defragment the table on a regular basis to receive the benefit of clustering, right?

    So the issue becomes, at what point is an ever-increasing clustered key no longer superior to clustering by a key that causes fragmentation of the physical database?

    Speaking in general terms, isn't an identity column a superior choice for clustering in the vast majority of cases, making the mega-row table with large fragmenting insertions the rare exception?

Viewing 11 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply