Is there any way to improve statistics on a table like this?

  • Hi there!

    After seeing a great session on statistics by Grant at SQL Saturday, I've been eagerly examining statistics on many of the vendor-created tables in my current workplace. Unfortunately, the statistics (and the indexes they spawned from) are fairly horrid.

    As a case in point, here's one of the main tables used by the application:

    CREATE TABLE [dbo].[tblObjects](

    [k_Id] [int] NOT NULL,

    [ParentId] [int] NULL,

    [Type] [int] NOT NULL,

    [Name] [nvarchar](255) NOT NULL,

    [Note] [nvarchar](255) NOT NULL,

    [State] [int] NOT NULL,

    [Lock] [smallint] NOT NULL,

    [LanguageId] [nchar](2) NULL,

    [Creation] [datetime] NULL,

    [Modification] [datetime] NULL,

    [Description] [nvarchar](100) NULL,

    CONSTRAINT [PK_tblObjects] PRIMARY KEY CLUSTERED

    (

    [k_Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 95) ON [PRIMARY]

    ) ON [PRIMARY]

    Chunk of sample data is attached as a text file; it's just under a thousand rows.

    Rest of the indexes:

    CREATE NONCLUSTERED INDEX [IX_IDType] ON [dbo].[tblObjects]

    (

    [k_Id] ASC,

    [Type] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [missing_index_13_12_tblObjects] ON [dbo].[tblObjects]

    (

    [Type] ASC

    )

    INCLUDE ( [k_Id],

    [Name]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [missing_index_1729_1728_tblObjects] ON [dbo].[tblObjects]

    (

    [Type] ASC,

    [Name] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [missing_index_234_233_tblObjects] ON [dbo].[tblObjects]

    (

    [Name] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [missing_index_74_73_tblObjects] ON [dbo].[tblObjects]

    (

    [ParentId] ASC,

    [Type] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    CREATE NONCLUSTERED INDEX [ParentID] ON [dbo].[tblObjects]

    (

    [ParentId] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    And yep, those indexes are pretty darn horrible :-P.

    I have free reign to alter and toss indexes as needed, though, so any (or all!) of these could be vaporized, but the main problem here is the utter lack of selectivity anywhere in the table.

    In order, the columns are for the following purposes:

    k_id: It's an IDENTITY column, of sorts; the app stores the last-manufactured k_id in a table, adds one, and inserts it every time a new row is made.

    Parentid: It's the k_id of the "parent" object for the given row; more details later.

    Type: An INT that holds one of a few values; it basically denotes if an item is a project, a questionnaire, the system root objects, and a bunch of other stuff. If an item is a questionnaire, it's the child of a project, and so its Parentid will be the project's k_id.

    Name: Name of the object. There's redundancy, both partial and complete, all over the place here. If a project has specific workers assigned to it, their names will be part of the children for that project; thus, for each project, there's a copy of the worker names.

    Note: Pretty much entirely useless. Users can enter notes for a project here, but they don't show up anywhere in the app, just in the database.

    State: Has a value of 1 or 3; 1 means the item is locked on the front-end, 3 means it's usable. (Nice numbering scheme!)

    LanguageId: Denotes the object's language.

    Creation: Denotes when the object was made. Not used by the front-end at all.

    Modification: Denotes the last time any part of an object was modified. Not used by the front-end.

    Description: Pretty much the same thing as the Note field, though front-end users can actually see these.

    So... The basic problem is that I have no real selectivity here. The app pretty much just fires a SELECT * FROM tblObjects WHERE Name = ... for most of its queries, fetches the ParentId, and runs from there.

    The main thing I was interested in (and I'm not even sure there's a real solution) is how to get a good histogram on this table. All of the indexes, aside from the one with Name and the one with Parentid as the leading edge, have 200 steps with AVG_RANGE_ROWS counts of 1. This causes estimates in the execution plans to assume 1 row on every query, no matter what. Even doing a SELECT Name FROM tblOBjects WHERE k_id BETWEEN 20000 AND 20002 gives an inaccurate estimate, painfully enough!

    I know the design of this whole thing is probably beyond my current station's ability to repair, but... Is it possible to potentially fix this by using my own statistics somehow? Could I create a gigantic pile of statistics to segment the k_id values out and restore some fragment of selectivity? I'm guessing doing this would have massive overhead, though, so that's probably not a great idea, either.

    Thanks for any help on this! It's really more of a conceptual question than anything; I'm interested in the mechanics of any sort of potential solution.

    - 😀

  • How many rows in the table?

    Can you post the execution plan (actual please) of that simple query with the inaccurate estimations?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for the interest in the problem, Gail!

    The table has about 160,000 rows right now; the number of rows will vary on any given day, as they're added and deleted when users and projects are created and removed.

    I'll be out of the office until Monday, so I won't be able to provide an actual execution plan until then; I really should have done that initially. Bad omission on my part, but I'll supply it as soon as I can. The actual plan for a query looking for a Name with k_id BETWEEN 20000 and 20002, or a k_id with Name = @Name ends up doing an index seek on k_id or Name, but the estimated number of rows for the seek always has a single row, whereas the actual number of rows will end up being whatever number of rows is actually returned by the query. Seeing the actual plan will undoubtedly be more useful than just this clunky summary, I'd imagine 😛

    - 😀

  • First, please don't post a thousand rows in a code window. It drove even my 4 processor 6GB ram laptop a little nuts because of the "prettifier" that SQL Server Central runs on such code windows. My older machine was absolutely paralyzed when I opened this thread never mind trying to copy the code. For such large bits of code, attached them as a text file. If you can, you might want to go back and edit your original post to do that. 😉

    Second, (and I'm admittedly assuming that this is supposed to be a true "Adjacency List" where each node has one and only one parent) if the data is a representation of actual data, you might have a serious problem with the data. For example, a ParentID of "12" is rather prominent in your test data but is not to be found as a k_ID. If the real data has a similar problem, people should spend some time cleaning up the data for that system before they worry about such things as indexes and stats. It may be a simple omission on your part but I don't see a self referencing constraint (FK, in this case) that requires the ParentID to exist as a K_ID nor do I see a unique constraint k_ID/ParentID as a combination. I also don't see a constraint that requires that the k_ID not be equal to the ParentID. Based on those "findings", I'd also check to see if there are any "loops" built into the data and, again, I'd do all of that before I even thought of looking at indexes or constraints. This also requires that you change the ParentID of the row that has a k_ID of "1" to NULL.

    Once all of that is resolved (again, only if this is meant to be a true "Adjacency List) and the correct constraints and other safe guards have been put in place to prevent orphans (such as any rows that have the ParentID of "12" in your test data) and "loops" in the data, then you can start on the indexes and stats. As Gail would remind us, the first question to ask is "is there a performance problem". If there isn't, then you might not have to worry so much about indexes and stats. At that point, the only thing you might want to consider is a little index consolidation and, possibly, the removal of unused indexes for the sake of nightly maintenance and a bit of savings on hard-disk space used.

    I'd also consider changing Type, State, and Lock to TINNYINT so save on index size and the overall footprint of the table not only on the hard-disk, but for backup and restore sizes. I also don't see anything in the data that requires the use of NCHAR or NVARCHAR data-types. If that's true and there's no intent that there ever will be such a requirement, you're just wasting 50% of the space on those columns and you might want to consider changing them to just CHAR or VARCHAR. Of course, THAT will require a change on the front end because the implicit conversions of searching for NVARCHAR literal names (for example) on VARCHAR columns will prevent the use of indexes altogether. Disk space might be cheap but the reduction of size of the underlying data-types could reduce the size of each row to a half or even a 3rd of it's current size, which means that the table will require less memory, indexes will be smaller and also require less memory, performance could go up a fair bit just because more rows will fit on each page, backup sizes will decrease, and a panic DR restore will take less time.

    If you're traversing the tree to get other than just one row lookups (for example, you want to return a whole "set" starting with a given ParentID and want to return the entire down-line of the parent), consider the fact that an "Adjacency List" (Parent/Child table) is great for easy maintenance but relatively suck for hierarchical traversals and aggregations. They're also absolutely terrible for anything that might require traversals by level. If you're having performance problems with such things or the code has become a maintenance nightmare, consider combining the ease of maintenance of the "Adjacency List" with the blinding speed of "Nested Sets" for traversals and aggregations.

    Most methods for maintaining "Nested Sets" are terrible in their complexity and most people use a slow RBAR "push-stack" method to do an initial conversion from "Adjacency List" to "Nested Sets". Please see the following two articles for a much better way to handle all of that. The initial conversion is so fast (4-5 seconds for 160K nodes as compared to nearly 50 minutes using the push-stack method) that some folks have taken to keeping the "Adjacency List" for ease of maintenance (each node is aware of one and only one other node) and just rebuilding the entire "Nested Sets" when there's a change rather than using any of the more complicated maintenance methods.

    Here are a couple of articles complete with code that will build a test "Adjacency List" (a million nodes in the article but is programmable for any size) and demonstrate how it all would work. Of course, the front end code would need to be modified to take full advantage of the "Nested Sets" (well, unless stored procedures where used to abstract the lookup functionality to begin with in which case you'd be "golden").

    http://www.sqlservercentral.com/articles/Hierarchy/94040/

    http://www.sqlservercentral.com/articles/T-SQL/94570/

    Last but certainly not least and, again, if you're having performance problems, consider splitting all the variable width "note" and similar columns off into a "sister table", which would have a PK of k_ID from your main table. That would do a couple of things... 1) you wouldn't experience page splits when someone made one of those "fields" larger with extra characters and 2) the basic traversals would run at light speed because it would greatly reduce the row size allow many more rows per page where it's important.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the reply, Jeff!

    I promptly nixed the code in the original post and tossed it into a text file; sorry about that! I didn't realize the IFCode window would cause chaos like that; noted, and I'll use text files for large data blobs in the future.

    As far as the table design goes, yep, it's all kinds of wonky. I can't really make any adjustments to it, though; it's vendor-designed, and they've got a view on the table with SELECT * in every database they make, which is referenced quite often :(. There's certainly some gaps, missing data, and other nonsense, as well; the data is only cleaned by an ON DELETE CASCADE foreign key on a few other tables, but specific types of rows are missed by the CASCADE, hence the orphans. Fun!

    Technically, there isn't a performance problem here at present. There was, however, a performance problem when the table had about 300,000 rows in it; at that point, I believe the combination of faulty statistics and horrible queries was starting to cause some mayhem. Reports run through the front-end app, which are littered with RBAR everywhere, were taking 20-30 minutes to return the results of what should have been a simple SELECT with a JOIN or two. It wasn't nearly that simple, sadly, hence the horrendous execution time!

    We've since trimmed the table down indirectly by archiving a slew of projects, which are the main source of rows in the table; since then, we fell from 300,000 rows to the aforementioned 160,000ish. With this, execution times on reports are... Slightly less horrible, I guess. It takes 15-18 minutes for a report, instead of 20-30.

    The main concern I have is the problem of the table's growth; if it gets back to 300,000 rows, performance will degrade, and beyond 300,000, I can't imagine how much worse it will get. I know this definitely falls into Dwain's line of thinking about sugary cat food, but, well... I've got a giant vat of sugar, and a ton of cat food, and somehow I have to make the users believe it's delicious :-P.

    That pretty much sums up my curiosity about somehow bending the statistics into working better for queries on the table; if I could help this rickety system out somehow, anyhow, anywhere I can, it's bound to be better than what it's currently doing. I probably won't be able to get the system into shape well enough that it can actually be a pleasant experience to use it, but if I can future-proof it well enough that the performance is at least consistently bad instead of continually worsening, all the better. I don't know if this company will find a DBA with the mindset necessary to endure the craziness of this system if I should move on at some point, so I'd like to set things up so things don't get worse, at the least.

    - 😀

  • In that case, the best you can do is start monitoring and making a plan for the indexes and stats. Check which indexes are used or not and how many times they're used with the idea of possibly consolidating or just dropping some of the indexes. Maybe run a server side trace or use extended events to monitor for queries that are hitting the table so that you can test them for possible performance problems and as possible candidates to help identify any new indexes that may be needed.. Even though the clustered index might not get fragmented, make sure that you at least update stats on it. And, of course, remember the REORGs don't update stats and that REBUILDs don't necessary have a large enough sample on the stats. If the only thing that you can touch is the indexes and the related stats, then you really need to get used to eating cat food with just the right kind of sugar on it. 😉

    Also remember that REORGs only work on the leaf level of indexes. If you want to defrag the BTREE, then you'll need to do a REBUILD. Oddly enough, I've found that REBUILDs cause some fragmentation insofar as the number of fragments go and if I really need the indexes to be absolutely air-tight, I'll follow a REBUILD with a REORG. Sounds strange but it has worked for me on several edge cases.

    As for bullet-proofing and ensuring consistent performance as the table grows, you're intentions are certainly honorable but performance is in the design of the table and the code that uses the table. Indexes can help if the code is capable of using indexes but not always. There's only so much you can do with indexes and stats. Of course, you already knew that.:-)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Alright, back in the office today, so I'm attaching two query plans here.

    The first one is for a query like so:

    SELECT Name FROM VoxcoSystem.dbo.tblObjects

    WHERE ParentId = 137402

    It pulls all workers that are working on a given project; it runs fairly frequently, in order to determine whether a user's authorized to be on a given project. The estimates on this one are actually fairly spot-on, but that's probably because the histogram for ParentId is actually very fleshed-out and usable, at least in this case.

    This next query... Not so much.

    SELECT Name FROM VoxcoSystem.dbo.tblObjects

    WHERE ParentId = 195150

    This will determine what projects are in a given directory; here, ParentId's histogram isn't quite so useful, because the AVG_RANGE_ROWS for the segment this ParentId is in adds up to 14.4, the number of the estimated rows; however, the actual number of rows is 113.

    I could provide some more examples if needed; basically, though, the estimates are all over the place, because the number of objects per parent can vary wildly, and, as Jeff noted, there's orphans and gaps all over the place, which further muddles the issue. In almost every case, the estimates are just going to worsen the problem rather than help, at least from my amateurish guesses :-P.

    Jeff, there's definitely maintenance being done routinely on the tables. It's not a vendor suggestion, and, in fact, we have to keep it a secret from them; they only tolerate their own maintenance solutions being on the server, which they charge a hefty price for :w00t:. Thankfully, the sorta-kinda-DBA before me ignored their demands and just made some maintenance plans himself. I've since replaced the maintenance plans with Ola Hallengren's awesome maintenance routines instead, so all the better!

    But yep, we update statistics nightly, and rebuild indexes on the weekends when the business is closed. Fragmentation, at least on this table, is fairly low, since it doesn't get too many inserts per week. There are some other tables that get pretty fragmented, though, since they experience thousands of inserts per day; however, the rate doesn't get much higher than 40% before the rebuilds hit. The histograms on those are absolutely horrid, though, since they're literally just a table with four INT columns with about 1.5 million rows each.

    I know indexing and statistics are a fairly moot point if the queries are horrible, but I'm definitely trying to do what I can to fix up even the slightest little bits. I'll resign myself to serving the fanciest, most delightful cat food this business has seen, if it's the best I can do! 😛

    EDIT: Actually adding the attachments this time!

    - 😀

  • It's odd how SQL Server sometimes recommends an index that already exists. :crazy:

    If that's a very common set of queries, I'd drop the index that contains only the ParentID and add an index that has the ParentID and INCLUDE the Name column. From the looks of things, I'd be tempted to drop the index on the NAME column but I don't know how often it's used.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually, I was already considering doing some index consolidation since you mentioned it in your last post, Jeff :-). Scripted out all of the indexes, and ran some tests.

    I ended up tossing the Name index, which the system doesn't use anyhow; I use it ad-hoc sometimes if I can't figure out the crazy k_id numbering scheme, which is never really consistent. In fact, after some testing, I found I could actually fling out all of the indexes and just create a single on on ParentID, with Name and Type as INCLUDED columns, and the results seemed to be just about as good as they were before the indexes were dropped.

    Now, the really weird thing is that, after I tossed all the duplicate indexes, the reporting side of the app suddenly got quite a bit faster :crazy:. It usually takes about 5 minutes to get through the navigation screens to set up a report, but now it takes maybe a minutes and a half. But removing duplicate indexes shouldn't have done that, right? They only really slow down INSERT and DELETE operations because of the necessity to INSERT and DELETE on each index, if I recall correctly. It doesn't seem like the crazy SELECT operations in the reporting process should have been affected.

    Then again, the app does rely on the client's computer, rather than the server, for all of the heavy lifting, so that just confounds the issue further :w00t:. I'll ask the users to try it and see if they notice any improvements.

    Thanks for the pointers, Jeff! I'd been considering doing some index consolidation here, and having someone affirm my thoughts makes me feel a little less novice-y.

    - 😀

  • It's more likely that the new index is quite a bit better than the others and, since it was just built, has a recently updated set of stats.

    Thanks for the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply