Age-old question: Cluster on Date or ID?

Question

Age-old question: Cluster on Date or ID?

jordantx

Mr or Mrs. 500

Points: 588
More actions
November 17, 2010 at 12:45 pm

#231715

I'm becoming more & more impressed with the community here. I've been reading and learning quite a bit. Since you've all been so helpful & gracious, I would like to ask a followup question to my previous thread: http://www.sqlservercentral.com/Forums/Topic993129-338-1.aspx (Please note that the "Widgets" table I mention in that thread was actually a pseudonym for the table below.)
I have a Schedule table which looks something like this:
CREATE TABLE Schedule (
ID int IDENTITY(1,1) NOT NULL,
ClientID int NULL,
EmployeeID int NULL,
OfficeID int NULL,
ActiveSchedule bit,
Status varchar(5),
SchedDateTime smalldatetime NULL,
DataField1 varchar(50) NULL,
DataField2 varchar(50) NULL,
etc.
CONSTRAINT [PK_Schedule] PRIMARY KEY CLUSTERED
( [ID] ASC
) PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
CREATE NONCLUSTERED INDEX [IX_Schedule_SchedDateTime] ON [dbo].[Schedule]
( [SchedDateTime] ASC ) INCLUDE ( [ActiveSchedule], [Status], [ID])
For the majority of queries against this table, we're looking at a date range with SchedDateTime, but with several other criteria -- such as JOINs against Client, Employee and Office tables (yes we have FK constraints), plus WHERE clauses with ActiveSchedule flag, and often several more criteria. There are also many procs where we query the table without using SchedDateTime.
The table has 5.6 million rows and is growing quickly. It's also very wide with 75 columns, many of which are VARCHARs. I'm looking into partitioning and normalization (removing some columns into sub-tables), but the table is referenced in over 1000 pieces of code, so these changes will take awhile to implement.
The majority of days have at least 4000-5000 schedule records. The days with the most have over 9000 records. Many of these records will have the same exact value for SchedDateTime because the Employee begins scheduled work at, say, 9:00 A.M.
After posting the thread referenced above, I added the primary key on ID, because there was no clustered index at the time. Now I am wondering if I should've instead made a clustered index on SchedDateTime, or perhaps on (SchedDateTime, ID).
I know one of the criteria for a clustered is selectivity/uniqueness... Does 5000-9000 records count as unique? Oh also, the user can change the SchedDateTime value at any time, and while it's not extremely frequent, the values are modified somewhat often. Yet another reason not to use a clustered on date, I realize -- but I really just want to confirm that I've made the right choice (with clustered on ID).
I'm also wondering if I should've made the clustered index on ID as an index, not a primary key, so I could add more covering/include columns to it (EmployeeID, ClientID, etc.)
To optimize existing queries which perform date range lookups, I've been coding something like this:
CREATE TABLE #tmpSchedIDs (ID int not null)
INSERT #tmpSchedIDs
SELECT s.ID
FROM Schedule s
WHERE
s.SchedDateTime >= @FromDate AND
s.SchedDateTime < @ToDate
CREATE NONCLUSTERED INDEX IX_tmpSchedIDs ON #tmpSchedIDs (ID) WITH FILLFACTOR=100
-- (tried primary key clustered also; it performs about the same)
SELECT field1, field2, etc.
FROM#tmpSchedIDs tmp
JOIN Schedule s ON (tmp.ID = s.ID)
(plus other joins)
WHERE (lots of criteria)
...the idea being that the date range is (probably) the slowest and most selective of my criteria, so once that's out of the way, the smaller rowset will perform better. My criteria in the latter query will often look at a number of columns, many of which are individually indexed. There are too many columns and too many different combinations to successfully cover them all with covering indexes. I suppose SQL looks up the IDs in my clustered index and then for each row, pulls out the columns it needs for the WHERE.
So, to summarize my questions:
- For my situation, does it seem like a clustered on SchedDateTime or (SchedDateTime, ID) would be helpful, instead of on ID?
- Should I move my primary key into a regular clustered index so I can cover additional columns?
- Is my nonclustered on SchedDateTime set up correctly? For example, should I change it to (SchedDateTime, ID, ActiveSchedule, Status)? I would think SQL needs to look up IDs by date range for my queries, and the other fields may be getting in the way with the current ordering.
- Is there a better way than using my temp table code above -- does SQL 2005 do something like this automatically under-the-covers, if I use a certain syntax?
Any thoughts are welcome. Thanks a bunch!
Jordan
P.S. I realize that the SchedDateTime is a NULLable column, so I would need to change that in order to make it clustered. There are only 2 rows (out of millions) which have nulls, and that can be corrected.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

jordantx Mr or Mrs. 500 Points: 588 More actions · Answer 1

I'd also like to invite comments on the "hot spot" issue I've read about elsewhere: With a clustered index, all new rows are written to the same page, causing page locks and holding up all CRUD on the most recent records until the INSERT completes. In my case, a lot of users are simultaneously adding new Schedule data to the system, and performing updates to the most recent Schedule records. Could this explain why my new clustered on ID seems to be causing more timeouts & deadlocks than the previous nonclustered on ID? (I can't prove this, because I don't have a baseline for the timeouts before my index was added.)

GSquared SSC Guru Points: 260824 More actions · Answer 2

The "hot spot" thing has been fixed for a while. I think since SQL 2000. Not so much of an issue these days.

I'd say the bigger issue to avoid in a table with that many inserts per day and significant column width variability, is page splits, not page writes. Clustering on an identity value will probably help with that.

Are you in a position to create a test copy of the table and run load testing on it? Replay traces from production on a test box, after trying various configurations on the clustered index, and see what you get. All those varchar columns are going to create all kinds of page allocation randomity, so a real copy of the table with real transactions in it is going to be your best bet for optimizing.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

Welcome back, Jordan. You've come a long way since your last discussion. Glad to see your interest hasn't waned when you saw the size of the hill you've elected to climb. 🙂

Let's see what we can do for ya. Regarding hotspots: As mentioned above, don't worry about it so much. There's not a lot you can do about it anyway, even if it were still an issue. It was more an awareness then it was a reasonably fixable problem. (Breaking up your cluster to force page splits was not a reasonable fix to me. :hehe: )

jordantx (11/17/2010)
I have a Schedule table which looks something like this:
CREATE TABLE Schedule (
ID int IDENTITY(1,1) NOT NULL,
ClientID int NULL,
EmployeeID int NULL,
OfficeID int NULL,
ActiveSchedule bit,
Status varchar(5),
SchedDateTime smalldatetime NULL,
DataField1 varchar(50) NULL,
DataField2 varchar(50) NULL,
etc.
CONSTRAINT [PK_Schedule] PRIMARY KEY CLUSTERED
( [ID] ASC
) PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
CREATE NONCLUSTERED INDEX [IX_Schedule_SchedDateTime] ON [dbo].[Schedule]
( [SchedDateTime] ASC ) INCLUDE ( [ActiveSchedule], [Status], [ID])

Let's start with this. I personally would not make your Primary Key clustered here. I *would* create a clustered index on just the ID field, especially with the # of indexes you describe you carry. I would also create a non-clustered Primary Key that makes sense. In this case I believe it's ClientID, EmployeeID, OfficeID, and SchedDateTime.

The reason for this is your Primary Key is a business key (there's a lot of terms for the same thing, bear with). While you *can* in theory link all over the place with this, it's primary purpose here it to identify a row with business logic. It's also to help you make sure you don't dupe someone's schedule somewhere. Two for one in this instance.

Another thing I noticed is there's no explicit padding for this table. This means it's using default (90%). If your structure is as wide as you describe, how many rows are you really expecting to get per page? What's your row-width, max and average? That'll help with that component, especially if we start considering a Clustered Index that's not just the ID column. However, that's blind without usage information. I'm going to try to explain my thought process, even though it's somewhat spagetti-ized, here.

For the majority of queries against this table, we're looking at a date range with SchedDateTime, but with several other criteria -- such as JOINs against Client, Employee and Office tables (yes we have FK constraints), plus WHERE clauses with ActiveSchedule flag, and often several more criteria. There are also many procs where we query the table without using SchedDateTime.

Under 99% of the time, I could see you querying this table and having to link to these three tables, which selections of data from the sub tables (where claused). Specific employee, or specific office, or specific client... and/or combination of the above. What you want to do is try to figure these out in 1st, 2nd, and 3rd priority order. If you have a consistent one, and a consistent calling method, these three fields become (in my mind) viable as the clustered key. If you're coming in 1/3 of the time against each and ignoring the other two... not so much. Remember order matters.

I would however contemplate a tripleset of non-clustered, depending on the circumstances. Again this is a usage evaluation scenario. If every time I looked up an employee I also used a date range, I'd create a non-clustered in that combination (employee first). If I occassionally called Attribute4 while I did it, I wouldn't include it. If I always wanted Attr3 and Attr5 though, I would at least include them in the non-clustered, or perhaps even in the B-Tree if they got selected on sometimes, too.

You'll note this comes down to evaluating your actual data calls. Theory is all well and good, but you have to have hard statistics of access methods to truly decide what's best. Which means, you run a trace on the system and trap the calls. Then you evaluate them to figure out who's doing what, when, and how often.

The table has 5.6 million rows and is growing quickly. It's also very wide with 75 columns, many of which are VARCHARs.

This alone isn't a problem, either the statistics or the columns. The next part though...

I'm looking into partitioning and normalization (removing some columns into sub-tables), but the table is referenced in over 1000 pieces of code, so these changes will take awhile to implement.

Implies either data-duplication or that there's load concerns. Load concerns can be handled with better index usage (which, btw, does not fall under the header of 'one NC index per column'). It's not a bad idea though if you move columns with less than 5% usage off to another structure. IE: Out of 100 calls, this piece of data is only seen 5 times. When you use that thought process, though, keep ALL the items down to 5%. So if you shove 20 columns in there, make sure it's 5% across all 20 together, not 5% each... or you've defeated your purpose.

Btw, 5% is a personal rule of thumb, your milage may vary. You won't find that anywhere but my head.

The majority of days have at least 4000-5000 schedule records. The days with the most have over 9000 records. Many of these records will have the same exact value for SchedDateTime because the Employee begins scheduled work at, say, 9:00 A.M.

5k out of 5M? 1/1000... 0.1% selectivity on date. Might be more selective if you include times, but as you mention, a lot of repeats. Doubled to 0.2% selectivity for your 9-10k record days. A date alone will not guarantee seeks here. Most likely, yes, but not guarantee, especially since you'll likely pull a range.

After posting the thread referenced above, I added the primary key on ID, because there was no clustered index at the time. Now I am wondering if I should've instead made a clustered index on SchedDateTime, or perhaps on (SchedDateTime, ID).
I know one of the criteria for a clustered is selectivity/uniqueness... Does 5000-9000 records count as unique?

No. You're discussing two pieces at once here. Selectivity is the density of the same value across the recordset. It's how high a percentage does *1* value (or value combination, in the case of a composite key that's fully used) take up of the entire dataset. Uniqueness is literally that. Is the column forced unique. If not:

(From Chad Boyd's blog at msdn. http://blogs.msdn.com/b/chadboyd/archive/2007/04/08/non-unique-clustered-index-and-duplicate-value-limits.aspx

Many of you probably already realize that if you don't explicitly create a unique clustered index in Sql Server, the engine will unique-ify the clustered index for you anyhow, since it's used for pointers to data in nonclustered indexes.

You could leave it as a heap for the same results from the concept of the lookup value width. Actually, the heap would be a smaller lookup key. So realize the two concepts are different. That uniqueness is why you want to stick with the surrogate key (the Identity) unless you're ready to radically depart from it for your clustered index and live with the results for performance reasons... and get rid of 75 probably barely used single column indexes that have to stay maintained with your clustered index.

Oh also, the user can change the SchedDateTime value at any time, and while it's not extremely frequent, the values are modified somewhat often. Yet another reason not to use a clustered on date, I realize -- but I really just want to confirm that I've made the right choice (with clustered on ID).

Until you have solid usage statistics, I would say yes.

To optimize existing queries which perform date range lookups, I've been coding something like this:
<snip some Temp table reusage code>
...the idea being that the date range is (probably) the slowest and most selective of my criteria, so once that's out of the way, the smaller rowset will perform better.

Than what? How is it both the slowest, and the most selective? You've got me confused with this sentence. In theory, yes, the smaller rowsets will behave better after that. However, if your execution plan behaves itself, the query itself will be working only with the smaller rowset, so you could skip the IO step to the Temp Table.

My criteria in the latter query will often look at a number of columns, many of which are individually indexed. There are too many columns and too many different combinations to successfully cover them all with covering indexes.

But how often are they being used then ANYWAY? My guess is you're doing heavy scans under almost all circumstances, instead of the optimizer choosing a single column of these and heading back to the Heap/Cluster for the rest and then re-filtering. A few of these columns are probably highly selective enough for the optimizer to think a bookmark lookup is worth it. All of them? Highly unlikely. Something else to analyze via usage statistics.

I suppose SQL looks up the IDs in my clustered index and then for each row, pulls out the columns it needs for the WHERE.

After a fashion.

So, to summarize my questions:
- For my situation, does it seem like a clustered on SchedDateTime or (SchedDateTime, ID) would be helpful, instead of on ID?

No, not yet. If you DID do it, I would see a clustered on the Primary Key described above with SchedDateTime as the last component being the alternative to consider.

- Should I move my primary key into a regular clustered index so I can cover additional columns?

No. Primary Key and Clustered index may be the same thing, but should be evaluated independently.

- Is my nonclustered on SchedDateTime set up correctly? For example, should I change it to (SchedDateTime, ID, ActiveSchedule, Status)? I would think SQL needs to look up IDs by date range for my queries, and the other fields may be getting in the way with the current ordering.

For the way you're using it with the temp table structure and selections... yes. The other fields will not get in the way of the order, they are simply stored along with the data at the leaf level of the index. If it needs a column not listed in the B-Tree (indexed columns), the Clustered Index (which travels around all the indexes), or the INCLUDE (leaf level storage only), then yes, it will lookup back to the Cluster.

- Is there a better way than using my temp table code above -- does SQL 2005 do something like this automatically under-the-covers, if I use a certain syntax?

Yes, it will, but it depends on what you're trying to do against which index, and the vagaries of the optimizer. It's usually taken on a case by case basis after you've done generic evaluation.

P.S. I realize that the SchedDateTime is a NULLable column, so I would need to change that in order to make it clustered. There are only 2 rows (out of millions) which have nulls, and that can be corrected.

Nulls are allowed in a clustered index, just not preferred. Since you're removing your NULLs anyway, NOT NULL the column with an ALTER TABLE ALTER COLUMN statement and get a little better performance.

To prove it to yourself:

CREATE TABLE #blah ( xyz INT NULL)

INSERT INTO #blah VALUES (NULL)

INSERT INTO #blah VALUES (2)

CREATE CLUSTERED INDEX idx_blah on #blah (xyz)

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

jordantx Mr or Mrs. 500 Points: 588 More actions · Answer 4

Craig Farrell (11/17/2010)Welcome back, Jordan. You've come a long way since your last discussion. Glad to see your interest hasn't waned when you saw the size of the hill you've elected to climb. 🙂

Thanks! Like Pinnochio, I want to be a "real" DBA someday. 🙂 It seems there's a lot more to learn than I realized. I've been working with MSSQL since version 6.0 and have substantial experience, but you folks take it to a whole new level entirely.

Another thing I noticed is there's no explicit padding for this table. This means it's using default (90%). If your structure is as wide as you describe, how many rows are you really expecting to get per page? What's your row-width, max and average? That'll help with that component, especially if we start considering a Clustered Index that's not just the ID column.

Does padding mean the same thing as fillfactor? Sorry for my ignorance.

My average row width is 328 bytes, and the maximum is 1094 bytes.

However, that's blind without usage information.

I have many hundreds of sprocs accessing this data. I'd like to search through all my code and get a list of the most commonly used columns in WHERE clauses, JOINs and SELECT list. But I haven't found a tool out there to do this. I've been looking at some YACC-based tools which might help.

I would however contemplate a tripleset of non-clustered, depending on the circumstances. Again this is a usage evaluation scenario. (...) If I always wanted Attr3 and Attr5 though, I would at least include them in the non-clustered

Not sure what a tripleset is. I googled it and this thread is the first hit! 🙂

Anyway I think I understand what you are getting at. As I've read elsewhere, with my clustered on ID, it's as though every column in the table is at the leaf level (so to speak), and SQL has to retrieve the whole row in order to read individual attribute columns from it. I understand that this can perform faster using a nonclustered index, if you know which specific columns you'll be retrieving, and you're not trying to read back every column in the table. (Question: Will attribute columns also be returned faster after a non-clustered lookup, even in cases where they're not INCLUDEd in any index?)

In fact, in my current testing I've discovered that my temp table methodology is not working well at all for larger result sets. SQL is performing a Clustered Index Scan (in the final query of my code) against my clustered index on ID, which being a primary key has no covering columns. I'm willing to bet that certain queries in this system were running faster before I came along with my newfangled clustered index -- the queries were SELECTing columns which had individual, single-column nonclustered indexes on them, and SQL was using all the indexes together to form a result.

So I've learned something from this: A clustered index is not a magic bullet for performance. I still need my attribute columns to be covered in nonclustered indexes if I want to read them back quickly, especially for large rowsets, and especially for a wide table like this one... right?

Load concerns can be handled with better index usage (which, btw, does not fall under the header of 'one NC index per column')

Yes, as I mentioned above, this was the method used by the previous developer. There were no composite keys on most of the tables, just nonclustered indexes on individual columns. And usually no clustered index. My job now is to put something sensible in place.

My criteria in the latter query will often look at a number of columns, many of which are individually indexed. There are too many columns and too many different combinations to successfully cover them all with covering indexes.
But how often are they being used then ANYWAY? My guess is you're doing heavy scans under almost all circumstances, instead of the optimizer choosing a single column of these and heading back to the Heap/Cluster for the rest and then re-filtering. A few of these columns are probably highly selective enough for the optimizer to think a bookmark lookup is worth it. All of them? Highly unlikely. Something else to analyze via usage statistics.

Sounds right to me. But recently I've tried commenting out my WHERE clause, and discovered that if I'm including a lot of columns in the SELECT list and they aren't covered in a non-clustered index, it takes a lot longer (especially when I'm pulling back 50,000-200,000 rows). So I need a way to resolve this part of the issue.

I've re-read your post a few times and you've been quite helpful; now I just need to figure out what to do about the existing code, which tends to include 20+ columns from the Schedule table in the SELECT list, around half of which might have single-column indexes on them.

In fact, in my current testing I've reduced one large query down to this:

CREATE TABLE #tmpSchedIDs (ID int not null)

INSERT #tmpSchedIDs

SELECT s.ID

FROM Schedule s

WHERE

s.SchedDateTime >= @FromDate AND

s.SchedDateTime < @ToDate

CREATE NONCLUSTERED INDEX IX_tmpSchedIDs ON #tmpSchedIDs (ID) WITH FILLFACTOR=100

-- (tried primary key clustered also; it performs about the same)

SELECT ID, ClientID, EmpID

FROM #tmpSchedIDs tmp

JOIN Schedule s ON (tmp.ID = s.ID)

...and even with just 3 columns coming back, I am getting a clus index scan of the entire table!

(Please note that the Schedule table has non-clus indexes on both ClientID and EmpID (albeit with no covered/included columns), but the optimizer isn't using them at all in this case. Also note that PK_Schedule is my clustered on ID.)

This is confusing -- Why is it scanning so many rows? Why doesn't SQL first reduce the number of rows to those which overlap with the temp table, and then loop through only those rows to pull the ClientID/EmpID values? (This is why I have a Grasshopper tag -- I should know the answer to this by now, but it eludes me.)

Next, I create a nonclustered index on ID:

create nonclustered index IX_SchedClientEmp on Schedule (ID)

include (ClientID, EmpID)

...and now it runs very quickly & uses the above index automatically. Which is great, but still I'm concerned about the queries where I'm SELECTing 20+ columns from Schedule which may not be in any index.

Thanks so much for the lengthy and educational response! Let me know if you have any further thoughts.

Regards

Jordan

jordantx Mr or Mrs. 500 Points: 588 More actions · Answer 5

Followup: As a test, I dropped the composite nonclustered index (the one mentioned near the end of my last post above), and while leaving the PK clustered on ID only, I created a new nonclustered index on ID only:

create nonclustered index IX_SchedID on Schedule (ID)

...and then ran the SQL in my most recent post -- and guess what? No more clustered index scan! Instead, the optimizer uses the brand new index, and narrows down the results before hitting the clustered index for key lookups.

I'm sure this is all elementary stuff to the consummate experts on here, but to me it's still a bit baffling. Why does having the same index twice, both clustered & nonclustered on only ID column, perform better than having only the clustered?

Here's the result after adding the above index:

(In the "Index Seek" image, I've obscured the index name, but it's the same as my IX_SchedID found above.)

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

jordantx (11/18/2010)
Does padding mean the same thing as fillfactor? Sorry for my ignorance.

Yep. I use bad slang on occassion, sorry. I'll try to stop doing it in our discussions, all I'll end up doing is confusing you more, and you're probably already suffering from information overload. 🙂

My average row width is 328 bytes, and the maximum is 1094 bytes

Carry the 4... add the 8... Use a fillfactor of about 85%, that'll leave you room for a full row, and 3 average rows, before you have to deal with page splits. 1094/8000 for your row%/page estimate, I round to be conservative.

I have many hundreds of sprocs accessing this data. I'd like to search through all my code and get a list of the most commonly used columns in WHERE clauses, JOINs and SELECT list. But I haven't found a tool out there to do this. I've been looking at some YACC-based tools which might help.

This one's tricky, however, I think the line about 'How do you eat an elephant?' applies here. Don't try to get 'em all. Go for your top ten. Run a trace to track calls, drop that to a table, including all the ad-hocs. If you don't have many ad-hocs you can let that go (you hope this is the case). After that, just get a count on who your top ten callers are, and perhaps top ten in time. Concentrate on these. Find out which ones hit the table in question, then determine how they're hitting it. The parameters used in the trace textdata will help you maneuver any if logic you might run into.

Not sure what a tripleset is. I googled it and this thread is the first hit! 🙂

Yeaaaa.... Sorry 'bout that. More slang. Just meant 3 indexes approaching the same columns in different orders, to try to get the most bang for your buck. When you pull this stunt though you really need to keep those indexes tight. They bloat fast.

As I've read elsewhere, with my clustered on ID, it's as though every column in the table is at the leaf level (so to speak), and SQL has to retrieve the whole row in order to read individual attribute columns from it.

Information can become unwieldy with the Clustered Index. A lot of folks end up approaching it backwards. Think of it this way. A table with no indexes is a heap, right? It's just rows of data on internal memory (pages/extents, one thing at a time...). If you use a non-clustered index, it dupes whatever columns you add to it into another physical memory table that you can't directly access, with a search tree layered on top of that. The clustered index doesn't rebuild this table, it sits directly ON the main table's physical storage. It orders the data on the main physical table, and then sits a B-Tree on top of the main storage for searching.

Hopefully that helps with the confusion there. So, yes, any time you pull a record from the Clustered index, it needs to load the entire row to memory. There is a LOT more to it under the hood than that. It actually pulls the page to memory, and finds the row on the page. The Clustered index/heap, however, has a lot more pages than most NC indexes. Thus, more disk reading, more memory, etc etc...

I understand that this can perform faster using a nonclustered index, if you know which specific columns you'll be retrieving, and you're not trying to read back every column in the table. (Question: Will attribute columns also be returned faster after a non-clustered lookup, even in cases where they're not INCLUDEd in any index?)

I believe you've got your understanding on this a little off. Any column that's part of an index is also included at the leaf level as though it was INCLUDEd. So, if I make the index: CREATE INDEX idx_test ON test (colA, colB) INCLUDE (colC, colD). ColA and ColB make up the search tree in the B-Tree nodes. colA, colB, colC, and colD are all included at the final leaf level of that index. Again, think of NC Indexes as non-accessible tables, with their data always in sync with the main table.

So, armed with that knowledge, we can answer your question. If the attribute is included in the index, it will be returned fastest. If the rows are properly located via the NC index, and then a lookup is performed back to the Clustered to get the attribute not INCLUDEd, this is fast-ish. If it has to scan the entire Clustered Index because it just has no other 'good' way, this is slowest. The optimizer will make these choices for you without your input, unless you use hints. Avoid hints for now.

In fact, in my current testing I've discovered that my temp table methodology is not working well at all for larger result sets.

This doesn't necessarily surprise me.

SQL is performing a Clustered Index Scan (in the final query of my code) against my clustered index on ID, which being a primary key has no covering columns.

Clustered index, by definition, is completely covering. That's why it's the default index if the optimizer's not sure. It's basically doing a direct table scan.

I'm willing to bet that certain queries in this system were running faster before I came along with my newfangled clustered index -- the queries were SELECTing columns which had individual, single-column nonclustered indexes on them, and SQL was using all the indexes together to form a result.

Nope. You get one index (two if it's doing a bookmark lookup, and is part of the mechanic) per table per join.

So I've learned something from this: A clustered index is not a magic bullet for performance. I still need my attribute columns to be covered in nonclustered indexes if I want to read them back quickly, especially for large rowsets, and especially for a wide table like this one... right?

Correct, but you want to tailor them to specific needs.

But recently I've tried commenting out my WHERE clause, and discovered that if I'm including a lot of columns in the SELECT list and they aren't covered in a non-clustered index, it takes a lot longer (especially when I'm pulling back 50,000-200,000 rows). So I need a way to resolve this part of the issue.

This is where controlled application of new, properly covering indexes based on usage will help you out tremendously.

I've re-read your post a few times and you've been quite helpful; now I just need to figure out what to do about the existing code, which tends to include 20+ columns from the Schedule table in the SELECT list, around half of which might have single-column indexes on them.
In fact, in my current testing I've reduced one large query down to this:
CREATE TABLE #tmpSchedIDs (ID int not null)
INSERT #tmpSchedIDs
SELECT s.ID
FROM Schedule s
WHERE
s.SchedDateTime >= @FromDate AND
s.SchedDateTime < @ToDate
CREATE NONCLUSTERED INDEX IX_tmpSchedIDs ON #tmpSchedIDs (ID) WITH FILLFACTOR=100
-- (tried primary key clustered also; it performs about the same)
SELECT ID, ClientID, EmpID
FROM #tmpSchedIDs tmp
JOIN Schedule s ON (tmp.ID = s.ID)

This is actually equivalent of you having not used the temp table at all. It still needs to scan for the dates in question. You'd probably have a better shot in the second one just using your date where clause.

In this case, the following index would be your best bet:

CREATE NONCLUSTERED INDEX idx_Foobar ON <table> (SchedDateTime) INCLUDE ( ClientID, EmpID)

This is confusing -- Why is it scanning so many rows? Why doesn't SQL first reduce the number of rows to those which overlap with the temp table, and then loop through only those rows to pull the ClientID/EmpID values? (This is why I have a Grasshopper tag -- I should know the answer to this by now, but it eludes me.)

The Scan is actually how it's looping through those rows to find the dates. It's decided the scan is cheaper than going to an index, getting bookmarks, and then doing a lookup on the main table.

create nonclustered index IX_SchedClientEmp on Schedule (ID)

include (ClientID, EmpID)

This is an equivalent to your Clustered Index. It defeats your purpose unless you're just pulling back ClientID and EmpID based on a where clause of ID = @a and you needed a tighter pull from the drive.

Second post question:

...and then ran the SQL in my most recent post -- and guess what? No more clustered index scan! Instead, the optimizer uses the brand new index, and narrows down the results before hitting the clustered index for key lookups.

The short form is that because the optimizer decided it was cheaper to do the lookup than to do the scan. Has to do with selectivity, usually. I'm not necessarily surprised by the NC index's usage, I'm rather shocked that it decided to seek off one and not on the other. It makes a twisted kind of sense though, but one I'd have to reason through and possibly ask one of the other gurus here to help out with.

Index selection mechanics are part art form, and part mystic vagary... unless your name is Paul White. 😀 If you fire up a new post specific to that question with your details, and a way to test and show the occurence (IE: sample data and the like), I'm sure some folks will be very intereted to dig into that, and probably find an answer. At a guess, it's got to do with page size selection at the leaf level determining the B-Tree seek was more expensive than the scan, since it felt it was probably going to do a scan anyway.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 7

Craig Farrell (11/18/2010)
If the rows are properly located via the NC index, and then a lookup is performed back to the Clustered to get the attribute not INCLUDEd, this is fast-ish.

Depending on the number of rows that need key lookups. Lookups are done one row at a time and can get very slow if the optimiser misguesses the number of rows affected

I'm willing to bet that certain queries in this system were running faster before I came along with my newfangled clustered index -- the queries were SELECTing columns which had individual, single-column nonclustered indexes on them, and SQL was using all the indexes together to form a result.
Nope. You get one index (two if it's doing a bookmark lookup, and is part of the mechanic) per table per join.

SQL is capable of using more than one index per table in a query. It's called an index intersection. They're a little expensive, so typically not something you'll see for smaller tables.

Whether it's something you want is another matter

http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 8

GilaMonster (11/18/2010)
Depending on the number of rows that need key lookups. Lookups are done one row at a time and can get very slow if the optimiser misguesses the number of rows affected

Hm, I had though the key lookup would seek the index, for groups, not single row loop. Interesting. I'll have to explore more deeply into that. I had it stuck in my head that it would hash for some reason.

SQL is capable of using more than one index per table in a query. It's called an index intersection. They're a little expensive, so typically not something you'll see for smaller tables.
Whether it's something you want is another matter
http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

I've got some reading to do, I don't think I've ever seen it do that.

Thanks Gail!

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 9

Craig Farrell (11/19/2010)
Hm, I had though the key lookup would seek the index, for groups, not single row loop. Interesting. I'll have to explore more deeply into that. I had it stuck in my head that it would hash for some reason.

One row at a time, with a nested loop join.

http://sqlinthewild.co.za/index.php/2009/01/27/a-bookmark-lookup-by-any-other-name/

and maybe

http://sqlinthewild.co.za/index.php/2009/01/09/seek-or-scan/ (I don't have any blog posts just on key lookups)

When you look at an execution plan, look at the execute count of the key lookup.

I've got some reading to do, I don't think I've ever seen it do that.

Not common, but possible. Depends whether optimiser thinks that the multiple seeks + joins will be better than a table scan.

As for reading, I put the link to the blog post in for a reason. 😉

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 10

Craig Farrell (11/18/2010)
I'm rather shocked that it decided to seek off one and not on the other. It makes a twisted kind of sense though, but one I'd have to reason through and possibly ask one of the other gurus here to help out with.

Would someone like to help me in untangling the 'mystic vagary' by posting the indexes in question, the query in question and if not the execution plan, a screen shot of the properties of the temp table scan, index seek and key lookup? (the thread's a little on the long side and I don't have all that much time to spend on this)

I'd prefer the entire execution plan so that I can look at properties which are not exposed by the SQL 2005 management studio (estimated and actual execution count)

How many rows total in the table?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Paul White SSC Guru Points: 150468 More actions · Answer 11

Craig Farrell (11/19/2010)
Hm, I had though the key lookup would seek the index, for groups, not single row loop. Interesting. I'll have to explore more deeply into that. I had it stuck in my head that it would hash for some reason.

You're describing Block Nested Loops join there.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 12

jordantx (11/18/2010)
As a test...while leaving the PK clustered on ID only, I created a new nonclustered index on ID only (...) Why does having the same index twice, both clustered & nonclustered on only ID column, perform better than having only the clustered?

You can find a detailed explanation here: http://www.sqlservercentral.com/articles/paging/69892/

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 13

jordantx (11/17/2010)
The table has 5.6 million rows and is growing quickly. It's also very wide with 75 columns, many of which are VARCHARs. I'm looking into partitioning and normalization (removing some columns into sub-tables), but the table is referenced in over 1000 pieces of code, so these changes will take awhile to implement.

Jordan,

I believe your priority should be to appropriately normalize the table, and create good basic indexes on the results. Once you have a more solid design, tuning will be much, much easier. The approach I have taken in the past is:

1. Create the normalized table design, including indexes which will be useful to the majority of your existing queries.

2. Copy the data from the source table to the new normalized structure.

3. Rename the original table

4. Create a view over the normalized structure, giving the view the same name as the original table.

5. Check that the view returns exactly the same data as the original table.

Naturally, you will want to thoroughly test your new design and data migration scripts/packages on your development and test systems before making the changes to production in planned downtime. Be particularly sure to test any external processes that access the table - some older code does not like being fooled into thinking that the view is actually a table.

Once the view is in place and working correctly, you can begin the process of rework to eventually remove the need for the view.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 14

All that said, until you can do the migration, these are my thoughts given the information provided so far:

1. My choice for unique clustered index would be (SchedDateTime, ID). This will remove the need for the temp-table thing you are doing right now. You should still have primary and candidate keys enforced separately, of course.

2. Partitioning (in 2005) will almost certainly add to your performance problems. My advice would be to give this idea a low priority, and preferably delay it until you upgrade to 2008.

3. Rather than trying to solve all performance problems at once, prioritize important queries/procedures. Examine actual execution plans to check for any wide disparity in estimated and actual row counts. There are established ways to work around most of these types of issues, which we can help you with.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi