Index usage and RID lookup.

Question

Index usage and RID lookup.

ben.brugman

SSChampion

Points: 13369
More actions
January 7, 2013 at 6:23 am

#268919

Hello All,
An index with three fields, A,B,C
(
A selection with
SELECT * FROM A_TABLE WHERE A = 6 AND C =7
performance question:
Does this read all rows from A with a 6 in it. So RID lookup of all A = 6 rows?
Or does it use in the index the C field to 'filter' out only the C fields with the value?
(Is the filtering of the C part done on index level or on row level ?)
Some code has been attached for testing, with an example optimised for the index and an example as above. Offcourse the optimised query runs much faster, but this does not anwser my question.
The code can be tuned easely to produce more rows or another distribution of rows.
(The code as supplied produces (112669 row(s) affected))
(Resultset (30 row(s) affected))
-- Example:
-- Ben Brugman
-- 20130107
--
-- Altered from the first Posting, NOW heap is default.
-- First few anwsered to this mail had the cluster as default.
--
-- Example to test index behavior.
-- Runs as supplied. (Heap table is measured).
-- Comment out or run relevant parts.
--
/* BUILD CLUSTER */
/* CLUSTER */
/* CLUSTER */
/* To prevent any potential data loss issues, you should review this script in detail before running it outside the context of the database designer.*/
BEGIN TRANSACTION
GO
CREATE TABLE dbo.A_Table
(
A int NOT NULL,
B int NOT NULL,
C int NOT NULL,
Text_field varchar(300) NULL,
Number_off_other_fields varchar(300) NULL
) ON [PRIMARY]
GO
ALTER TABLE dbo.A_Table ADD CONSTRAINT
PK_A_Table PRIMARY KEY CLUSTERED
(
A,
B,
C
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
ALTER TABLE dbo.A_Table SET (LOCK_ESCALATION = TABLE)
GO
COMMIT
--
-- Script creates the A_table twice first as heap second as clustered table.
-- Run the appropriete parts of the script. (Select parts or comment out parts).
--
drop table A_Table
/* BUILD HEAP */
/* HEAP */
/* HEAP */
/* To prevent any potential data loss issues, you should review this script in detail before running it outside the context of the database designer.*/
BEGIN TRANSACTION
GO
CREATE TABLE dbo.A_Table
(
A int NOT NULL,
B int NOT NULL,
C int NOT NULL,
Text_field varchar(300) NULL,
Number_off_other_fields varchar(300) NULL
) ON [PRIMARY]
GO
ALTER TABLE dbo.A_Table ADD CONSTRAINT
PK_A_Table PRIMARY KEY NONCLUSTERED
(
A,
B,
C
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
ALTER TABLE dbo.A_Table SET (LOCK_ESCALATION = TABLE)
GO
COMMIT
/*GENERATE DATA*/
/*GENERATE DATA*/
/*GENERATE DATA*/
;
WITH
L0 AS(SELECT 0 AS c UNION ALL SELECT 0), -- 2
L1 AS(select 0 as x from L0 A, L0 b), -- 2 ^2 = 4
L2 AS(select 0 as x from L1 A, L1 b), -- 4 ^2 = 16
L3 AS(select 0 as x from L2 A, L2 b), -- 16 ^2 = 256
L4 AS(select 0 as x from L3 A, L3 b), -- 256 ^2 = 65536
L5 AS(select 0 as x from L4 A, L4 b), -- 65536 ^2 = ruim 4 E9
L9 AS(Select *, row_number() OVER(PARTITION BY x order by x ) as nr from L5) -- voeg rijnummers toe
insert into A_Table
SELECT
TA.NR,
TB.NR,
TC.NR,
CONVERT(VARCHAR(300),TA.NR)+' '+
CONVERT(VARCHAR(300),TB.NR)+' '+
CONVERT(VARCHAR(300),TC.NR)+' ',
'sum '+CONVERT(VARCHAR(300),TA.NR+TB.nr+TC.nr)+' product '+ CONVERT(VARCHAR(300),TA.NR*TB.nr*TC.nr)
from L9 TA, L9 TB, L9 TC where
TA.nr*TB.nr <1000 and -- change numbers to
TA.nr*TC.nr <1000 and -- change number of rows
Tb.nr*TC.nr <1000 and -- or influence distribution
TA.nr <1000 and
TB.nr <1000 and
TC.nr <1000
-- the overdetermined where clause is more efficient than 'short' where clause.
select top 3000 * from A_Table
SET STATISTICS TIME ON
SET STATISTICS IO ON
/*BENCHMARK OR TEST*/
/*BENCHMARK OR TEST*/
/*BENCHMARK OR TEST*/
dbcc dropcleanbuffers
dbcc freeproccache
--
-- selection in order of the fields.
--
select * from A_Table where A = 33 and B = 26 -- Optimised for the index.
dbcc dropcleanbuffers
dbcc freeproccache
--
-- selection of fields which are not in the order of the index
--
select * from A_Table where A = 33 and C = 26 -- Not optimised for th index
drop table A_Table -- Clean up

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 1

Actually, with the code you posted there is no difference. You index is a clustered index meaning that both queries use the clustered index. Looking at the actual execution plan, both do a clustered index scan (which is also a table scan as the clustered index is the table).

ben.brugman SSChampion Points: 13369 More actions · Answer 2

Lynn Pettis (1/7/2013)
You index is a clustered index meaning that both queries use the clustered index.

The code for a Heap situation is included in the code as wel.

I should have made the Heap the default, sorry for that.

(heap is build, dropped, cluster is build. commenting out the drop and the cluster build will give the Heap situation).

With the size's provided the cluster does do a clustered search but there is a difference in efficiency. Because there is no other index on the cluster the access will offcourse be a cluster scan. (complete or part).

But cluster should not have been de default in the script. :doze:

I included the clustering, to check for performance on that as well.

Thanks for your time and attention,

Sorry to have choosen the wrong default.

Ben Brugman

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

ben.brugman (1/7/2013)
Lynn Pettis (1/7/2013)
You index is a clustered index meaning that both queries use the clustered index.
The code for a Heap situation is included in the code as wel.
I should have made the Heap the default, sorry for that.
(heap is build, dropped, cluster is build. commenting out the drop and the cluster build will give the Heap situation).
With the size's provided the cluster does do a clustered search but there is a difference in efficiency. Because there is no other index on the cluster the access will offcourse be a cluster scan. (complete or part).
But cluster should not have been de default in the script. :doze:
I included the clustering, to check for performance on that as well.
Thanks for your time and attention,
Sorry to have choosen the wrong default.
Ben Brugman

Taking a closer look at your code I see that now. You tell us to run your code from further below. Sorry, I missed that. I did a copy/paste/run of your code as is. Perhaps you should put your test code where it is needed instead of relying on us to move code around for you.

ben.brugman SSChampion Points: 13369 More actions · Answer 4

Lynn Pettis (1/7/2013)
Taking a closer look at your code I see that now. You tell us to run your code from further below. Sorry, I missed that. I did a copy/paste/run of your code as is. Perhaps you should put your test code where it is needed instead of relying on us to move code around for you.

As you said I should have taken the 'other' default.

I changed it in the original posting so now the heap is default.

A copy/paste/run now works better.

Thanks for your advice,

Ben Brugman

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 5

Your code still only runs the test code for one of the indexes not both. If I want the execution plans for the queries using the clustered index now I still have to copy your code around to do it.

Both queries using the nonclustered index use the index (although the second query says an index on ColA, ColC would help). The second has to read more data from the index before going to the heap for further data which is why it probably runs a bit slower than the first.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

ben.brugman (1/7/2013)
Does this read all rows from A with a 6 in it. So RID lookup of all A = 6 rows?

No. It reads all rows from the index where A=6, filters them as it reads for C =7. Once it has all rows that match all conditions that have columns within the index, if the index is not covering the lookups are done to fetch the rest of the columns.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ben.brugman SSChampion Points: 13369 More actions · Answer 7

Lynn Pettis (1/7/2013)
Your code still only runs the test code for one of the indexes not both. If I want the execution plans for the queries using the clustered index now I still have to copy your code around to do it.
Both queries using the nonclustered index use the index (although the second query says an index on ColA, ColC would help). The second has to read more data from the index before going to the heap for further data which is why it probably runs a bit slower than the first.

Sorry that I am not able to tell my question more clearly,

I am not asking the difference between two types of indexes,

but my question is how with a A,B,C index an index is processed if only A and C are used in the Where clause.

Is the selection of the C done in the index tree?

Or is the selection of the C done after the pages have been read in to memory as a filter?

The example was given as a starting point, because often code is asked. The question is independend of the code, but the code is used to clearify (not succeeding) the question and providing a starting point.

It is possible that because of the B field missing in the selection the C field is not used while reading the rows into memory. And that the wrong C rows are filtered out as an 'afterthought'.

It is also possible that at a specific moment while using the index the engine can exclude rows on that they do not fit the C.

Both this may also be dependend on the size of the table or de index. The example produces over 100 000 rows (easely changed), but results might differ for differenc sizes of rows and/or tables.

Hope this does make sense,

Ben Brugman

ben.brugman SSChampion Points: 13369 More actions · Answer 8

GilaMonster (1/7/2013)
No. It reads all rows from the index where A=6, filters them as it reads for C =7. Once it has all rows that match all conditions that have columns within the index, if the index is not covering the lookups are done to fetch the rest of the columns.

Thanks this anwsers my question,

Ben,

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 9

ben.brugman (1/7/2013)
Or is the selection of the C done after the pages have been read in to memory as a filter?

All processing is done in memory, regardless of whether it's an index seek, scan, lookup, etc. The query processor has no idea what a file is or how to do disk access.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ben.brugman SSChampion Points: 13369 More actions · Answer 10

GilaMonster (1/7/2013)
All processing is done in memory, regardless of whether it's an index seek, scan, lookup, etc. The query processor has no idea what a file is or how to do disk access.

Something has to do the disk accesses. And because the physical reads on disk is often the determining factor for throughput, this must be somehow included in the optimizer. (I am aware that the optimizer does not know what is present in the cache, but it does base it's optimizing on how many pages are needed).

I am not completely aware what the

query processor

query engine

RDBMS engine

SQL-server engine

Is.

For me the database is the part which is still there if you pull the plug out of the machine.

But do not realy understand the distinction between the above mentioned processors/engines.

For me the RDBMS engine does deliver the results and does take all the neccesary actions to get those results. (Both for select and mutation statements).

The query processor ??? processes the query (compilation and execution) I assume.

The query engine ??? invokes the query processor and gets the neccesary data ???

SQL-server engine is a RDBMS engine.

But for me the borders between the different parts are a bit :crazy: vague. Sorry.

Ben

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 11

ben.brugman (1/7/2013)
Something has to do the disk accesses.

The storage engine

And because the physical reads on disk is often the determining factor for throughput, this must be somehow included in the optimizer.

The optimiser assumes that none of the required data will be in cache.

You have (in general)

The optimiser - responsible for generating the execution plans

The query processor - executes the plans (also called the query execution engine)

The storage engine - responsible for all disk access, for giving rows to the query processor, for handling transactions and a whole bunch more.

The query processor (what actually executes queries) has no clue what a disk or file is. It asks the storage engine for the rows it needs. If the rows are on pages in cache, the storage engine just gives it rows. If they're on pages not in cache, the storage engine fetches them into cache, then reads the rows off and hands them to the query processor

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ScottPletcher SSC Guru Points: 100946 More actions · Answer 12

With a clustered index on ( A, B, C ) and a

WHERE A = 6 AND C = 7

I would expect SQL to a clustered index seek.

That is, it will do an indexed search to go straight to the "A = 6" rows and read just those; no other (non-index) rows would have to be read. SQL will search those rows looking for "C = 7".

With only nonclustered indexes, the process is more involved.

SQL first has to determine if you have a "covering index": that is, an index with ALL columns used in the query. If so, SQL will use that; if multiple indexes with all columns, SQL uses the one with the shortest row.

If there is no fully covering index, SQL will decide if an index with all the WHERE column(s) is available, and if it is worth using. Since reads using a nonclustered index involve lots of additional I/O -- SQL must use the rows in the nonclus index to go back and do random reads on the parent row in the heap/clus index -- there is a "tipping point" where SQL won't use a nonclustered index at all and instead just scan the table itself.

There is no specific % of rows where SQL will or won't use a nonclus index. The exact % varies. For example, if only 1% of rows have "A = 6", SQL is almost certain to use the nonclus index. If 90% of the rows have "A = 6", SQL is almost certain to do a table scan. Typically the "tipping point" is somewhere between 10% and 30% of rows, but even that is just a rough guideline and could vary for specific situations.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 13

ScottPletcher (1/7/2013)
If so, SQL will use that; if multiple indexes with all columns, SQL uses the one with the shortest row.

The most selective one (the one that, based on the statistics, is going to return the fewest rows after the seek). If multiple indexes are equally selective, then the one with the fewest number of pages.

For example, if only 1% of rows have "A = 6", SQL is almost certain to use the nonclus index.

If 1% of the rows have A=6 AND C=7 (the index has both columns, therefore both conditions limit the number of rows needing a key lookup), SQL is almost certain to do a table scan.

Typically the "tipping point" is somewhere between 10% and 30% of rows, but even that is just a rough guideline and could vary for specific situations.

The tipping point is where the number of rows needing a lookup = 30% of the number of pages in the table. Usually this is somewhere around 0.1% - 0.5% of the table, but depends on the size of the rows.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ScottPletcher SSC Guru Points: 100946 More actions · Answer 14

GilaMonster (1/7/2013)
ScottPletcher (1/7/2013)
If so, SQL will use that; if multiple indexes with all columns, SQL uses the one with the shortest row.
The most selective one (the one that, based on the statistics, is going to return the fewest rows after the seek). If multiple indexes are equally selective, then the one with the fewest number of pages.
For example, if only 1% of rows have "A = 6", SQL is almost certain to use the nonclus index.
If 1% of the rows have A=6 AND C=7 (the index has both columns, therefore both conditions limit the number of rows needing a key lookup), SQL is almost certain to do a table scan.
Typically the "tipping point" is somewhere between 10% and 30% of rows, but even that is just a rough guideline and could vary for specific situations.
The tipping point is where the number of rows needing a lookup = 30% of the number of pages in the table. Usually this is somewhere around 0.1% - 0.5% of the table, but depends on the size of the rows.

Fewest rows? The number of rows that match all the criteria will be the same regardless of the index used, right?

What is the basis for you claim that if only 1% of rows in the table match that SQL will almost certainly do a table scan? If the table has 1M rows, but only 10K match, you state SQL is almost certain to do a table scan??

Where did you get that 30% is a hard number for a tipping point? How could 0.1% of the rows in a table equal 30% of the table pages??

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.