SQL database practices

  • Hello friends, I am a casual learner and right now exploring the DBA field (I am a database developer). Now I recently had a discussion with my group of friends and we discussed about a problem where, in a dB server, there are some 8-10 databases for multiple projects.

    One such database is having multiple bulkier tables with years of data. Users complain at times about very slow processing, specially during month ends. I was wondering if such scenario comes to check such dB, what should be the approach ? What dB server and dB level configurations to check ? What should be MDF/ldf file strategy ? Should creating read replicas be any good ? Should partitybe introduced ? Should purging data be seen as the best option ?

    I know it's open ended question but I just want to know ur approach. (Please don't mind asking an open ended question and treat this as a question from buddy who's curious about learning this fascinating field of dB improvement of existing setup facing performance issues).

  • ps In some situations the following may be worth looking at:

    https://www.sqlshack.com/introduction-to-sql-server-filtered-indexes/

  • A large table, even a very large table, should not necessarily be slow.

    You need to review the clustering key(s) on those tables.  The clustered index key selection is the single most important factor for overall table performance.  Hint: the best clus key is most often *not* an identity column.

    Data compression can also be a huge help, but in SQL Server 2014, that's available to you only if you are on Enterprise Edition.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Ken McKelvey wrote:

    multiple bulkier tables with years of data

    https://www.sqlshack.com/database-table-partitioning-sql-server/

    https://www.brentozar.com/archive/2012/08/potential-problems-partitioning/%5B/quote%5D

    Partitioning is not a method of increasing performance of code.  In fact, it usually slows code down, sometimes by a far amount.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sqlenthu 89358 wrote:

    Hello friends, I am a casual learner and right now exploring the DBA field (I am a database developer). Now I recently had a discussion with my group of friends and we discussed about a problem where, in a dB server, there are some 8-10 databases for multiple projects.

    One such database is having multiple bulkier tables with years of data. Users complain at times about very slow processing, specially during month ends. I was wondering if such scenario comes to check such dB, what should be the approach ? What dB server and dB level configurations to check ? What should be MDF/ldf file strategy ? Should creating read replicas be any good ? Should partitybe introduced ? Should purging data be seen as the best option ?

    I know it's open ended question but I just want to know ur approach. (Please don't mind asking an open ended question and treat this as a question from buddy who's curious about learning this fascinating field of dB improvement of existing setup facing performance issues).

    The first thing to look at is the Actual Execution Plan for the code that has performance issues.  Then, realize that about 90% of the performance of code (or lack of) is in the code and related indexes... usually Non-Clustered Indexes.

    While the Clustered Index can certainly help with query performance, be very care if you decide to change the Clustered Index.  If you don't consider everything, you could be setting yourself up for a page splitting, fragmentation cause, excessive log file usage, major blocking nightmare.  Also remember that every non-clustered index also contains the keys of the Clustered Index.

    Also be very wary of any form of compression.  In some cases, it can make a remarkable change for the better.  It can also make things a whole lot worse.  And, because no two indexes are alike, it can cause both for queries and will certainly slow down index maintenance by a factor of 3 or 4 times.  While SELECTs may be used hundreds of times more than an insert or an update, remember that page splits (especially if on compressed data) are done in a system level transaction that will stop all access to given pages until the split(s) are complete.

    The bottom line is that you have to remember that a "Best Practice" is no guarantee of anything and there is almost never such a thing as a cure-all panacea.  Like Scott said, just because a table is big doesn't mean it should be a performance issue.

    A point there is that a whole lot of people think that partitioning is a good way to improve performance especially after they got a performance improvement after doing the partitioning.  It turns out that it wasn't the partitioning that did the trick... it was the change in indexes that did the trick and those changes are also appropriate for monolithic structures and will almost always be faster on such monoliths if the code is good because you won't suffer the overhead that partitioning places on tables and indexing.

    The BEST part of partition a table is if it's a history table of some sort where older data is never again changed.  If that's true, each partition or a set of partitions can be written to a separate file in a separate file group with optimal compression, etc, etc, and then set to READ ONLY.  That means you'll never have to even consider the indexes for any kind of index or statistics maintenance and, the BEST part is, you can take one final backup and never have to back that partition again.  Before you leap into that, remember that MS gives us nothing for free.  You seriously need to read up about how to do restores on partitioned tables and consider that it might actually be better to have partitioned views or even a hybrid of partitioned tables and views.  If it only take you an hour to learn everything about partitioning, then you don't actually know much about partitioning.

    As for the actual questions you ask, those are all great questions.  The first thing I'd do is hit the MS documentation on each subject and do some experiments.  The Google for the same things and try other people's opinions.  IMHO, Brent Ozar has some of the better "opinions" because he actually proves things with code on large databases.

    And be really careful about anyone saying "Well, it's a 'Best Practice" if you..."  Understand that half of all that is written is wrong and the other half is usually written in such a fashion where you can't tell.  I've seen a few famous and very well respected authors write "Holy Grail" articles that are incredibly well written and have test data and all ... and still be absolutely incorrect because the test data they created had an issue like ultra low cardinality, etc, etc.  And, over the last couple of years, I've destroyed many supposed "Best Practices" that that more 98% of the world has been following for more than 2 decades.

    If you want to see some of that, watch the following 90 minute video including the time after the Q'n'A slide.  It demonstrate a whole lot of what I just got done talking about.

    https://www.youtube.com/watch?v=rvZwMNJxqVo

    As with all else, only two things are true... "It Depends" and "Must Look Eye"! 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's another example where it appears to be a well supported "Holy Grail" article.  But, it produces "week start dates" that are greater than the original date.  That's means it's incorrect even if you don't realize that the "week start dates" are actually incorrect to begin with.

    The only indication that there might be a problem is a comment that someone made in the discussion and yet the author nor any of the people that made 4 edits to the article made any corrections to the code to fix it even after being published about 3.5 years ago.  It's another fine example of "Half of all that has been written is wrong and the other half is written in such a fashion that you can't tell". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There's different school of thoughts over whether the identity column should be a candidate of clustered index or not (mostly in favour of having it as clustered index).

    What's your generic thought of the same @jeff ?

  • sqlenthu 89358 wrote:

    There's different school of thoughts over whether the identity column should be a candidate of clustered index or not (mostly in favour of having it as clustered index).

    What's your generic thought of the same @jeff ?

    I'll repeat the only generic truth there is in SQL Server... "It Depends".

    I first have to agree with a lot of people, including our own Scott Pletcher, that's it's a bit crazy to just up and add a numeric "ID" column to EVERY table.

    A good example of a table that doesn't need one is a "State" table.  You only need the abbreviation and the state name.  Technically, I wouldn't do that either because ISO 3166-2 lists 57 "subdivisions" of the U.S.A. as 50 States, 1 District, and 6 "Outlying areas".  The abbreviation for those 6 outlying areas is also the "country code" for the "subdivisions".  And, the abbreviations are all unique which qualifies them as a non-numeric code/"Natural Key" that can and should be used as both the PK and the Clustered Index.

    It would also be a bit ridiculous to add a numeric "ID" to a table that held the NPA (Area Code) and NXX (Exchange) for telephone numbers in the North American Numbering Plan for Canada, the U.S.A, DC, and the "territories" (outlying areas according to ISO) for telephone numbers.

    There are a whole lot of places where it just doesn't make sense to add a numeric (or other) "Identifier" to a table because of some very well though out natural keys.

    On the other hand, things like Employee and Customer tables are good places for numeric identifiers because the natural keys would be just to wide to include FirstName, LastName, MiddleName, and something else to break the ties.  Even SSNs wouldn't make a good key because they're not actually unique in some places where some pretty nasty mistakes were made.  And, just to say it out loud, they should never be "in clear text".  That means they need to be encrypted with a "salt", etc, and that's going to turn them into a pretty wide (a 32 byte encryption is the minimum suggested by MS and I don't care for minimums in such cases and would use 64 bytes).

    As for when to also make such a numeric identifier the Clustered Index Key, that seriously depends and there are a whole lot of opinions on the subject such as to put it on the columns most used for queries.  That can actually be an extraordinarily good idea or, just as extraordinarily, a bad idea typically depending on the nature of most of the queries (you can't rely on just one or a couple just because they're the most used although I've frequently used a combination of date and a "Identity" column) including the number of columns to be returned for any query, the relative size of each row, and a whole bunch more.  The "sort order" of the most queries may come into play but a lot of people over-play that without considering page splits and the massive blocking that may occur during INSERTs and UPDATEs, especially when "ExpAnsive" updates (which includes expansion of certain previous fixed with datatypes such as NULL dates that are later populated by an UPDATE) come into play.

    Even how "proper" you want your "normalization" to be and the subject of "alternate keys" can come into play but we'll leave those out of this conversation.

    And then there are "numeric" Identifiers such as Random GUIDs.  I've proven that fragmentation isn't really a problem with those by learning HOW to maintain them correctly when they're the Clustered Index BUT... should they be used as the Clustered Index?  The answer is frequently "Yes" (despite what some claim) but, just as frequently, the answer is "No" for the same reason that other numeric "Identifiers" might not be.

    On top of all of that, you have to remember the purpose of non-Clustered Indexes and other things like why adding INCLUDE columns was both a miracle breakthrough and the horror they can cause as well as the fact that the keys for the Clustered Index are added to every non-clustered index and whether or not to explicitly name them to make the non-clustered index unique or not.

    The bottom line is there is no panacea for when numeric "Identifiers" that are being used as the PK should also be the Clustered Index or not.  "It Depends" and "Must Look Eye".

    As to my "generic" thought on it, I've found that, more often than not, a numeric "Identifier" is needed to guarantee that a row is unique without have a Clustered Index that is tens or sometimes hundreds of bytes wide but I'll never call such a trend a "general recommendation" because I do agree with and follow the bottom line I previously stated..."It Depends" and "Must Look Eye".

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply