I currently have the task of rewriting some poorly-written sprocs. I've just been to SQL Saturday, where I learned about measuring selectivity as you analyze a query. I've read elsewhere about measuring selectivity for each individual predicate
in the JOIN and WHERE clauses. From my understanding, the purpose of gathering this information is to inform what indexes you will create, and how you might structure the query to perform best. But there are a few things I really don't understand about measuring selectivity. Let's say you're starting from an existing piece of SQL code which has lots of JOINs and a complex WHERE clause:
1. Should I measure only individual predicates, or should I also combine all predicates for the same table and measure the overall selectivity of that composite?
2. If I'm doing inner joins, do I measure selectivity from one JOIN to the next? Each join may reduce the final result set by some amount, so it would seem to make a difference whether you are measuring against the total number of rows in a table, or against the combined number of rows that all your previous joins would have returned? (I realize that we don't have any guarantee from SQL about the order in which JOINs are processed.) Conceptually I'm trying to just wrap my head around this idea of selectivity in regards to JOINs. Also, what if the JOIN has multiple predicates in the ON clause? Do I measure it against all predicates combined, or each one individually?
3. If the answer is to combine predicates and include JOINs when calculating selectivity, then what's an easy way to do this? So far, the way I do this is to simply comment out everything in the query except the INNER JOINs and relevant parts of the WHERE clause, then add a SELECT COUNT and comment out the rest of the SELECT. This allows me to determine how many rows would result from each join, or even each condition in the join. But I still need to understand which of these metrics to use, and how.
4. A sort-of unrelated question which has been on my mind lately: Does SQL actually do its initial lookup from an index using only the first column, even if it's a composite key index? I think I remember reading somewhere that it looks up using the first column, then brings back the other key columns to perform seek/scan/hash operations. (Note that I'm referring to "other key columns", not "other INCLUDE columns at leaf level", in this question.) I just want to be clear on this, because it seems pertinent for how I construct my indexes.
Okay, so here is an example of what I'm trying to ask:
set nocount on
create table #tmpA (aID int, bID int, attrib char(10))
create table #tmpB (bID int, cID int, dt datetime)
create table #tmpC (cID int, attrib char(10))
insert #tmpA values (1, 1, 'blue')
insert #tmpA values (2, 3, 'green')
insert #tmpA values (3, 4, 'red')
insert #tmpA values (4, 2, 'green')
insert #tmpA values (5, 1, 'green')
insert #tmpA values (6, 8, 'green')
insert #tmpB values (1, 1, '1/1/2010')
insert #tmpB values (2, 1, '2/1/2010')
insert #tmpB values (3, 3, '3/1/2010')
insert #tmpB values (4, 5, '4/1/2010')
insert #tmpB values (5, 1, '5/1/2010')
insert #tmpB values (6, 5, '6/1/2010')
insert #tmpB values (7, 1, '7/1/2010')
insert #tmpB values (8, 4, '8/1/2010')
insert #tmpC values (1, 'north')
insert #tmpC values (2, 'west')
insert #tmpC values (3, 'north')
insert #tmpC values (4, 'east')
insert #tmpC values (5, 'north')
select a.*,c.cID from #tmpA a
inner join #tmpB b on (b.bID = a.bID)
inner join #tmpC c on (c.cID = b.cID)
where b.dt between '1/1/2010' and '5/1/2010'
and c.attrib = 'north'
and a.attrib = 'green'
drop table #tmpA
drop table #tmpB
drop table #tmpC
Here are some of the metrics one might gather about selectivity:
1. Rows in tmpB with b.dt in range: 5
2. Rows in tmpA with a.attrib = 'green': 4
3. Rows from both tables with b.dt in range + a.attrib = 'green': 3
4. Rows from tmpB matching (b.bID = a.bID) (assuming no other predicates): 5
----------select count(distinct a.bID) from #tmpA a[/center]
5. Rows from tmpC with c.attrib = 'north': 3
6. Rows from tmpC matching (c.cID = b.cID) (assuming no other predicates): 4
----------select count(distinct b.cID) from #tmpB b
7. Rows from tmpC matching (c.cID = b.cID) + c.attrib = 'north': 3
----------select count(distinct c.cID) from #tmpB b
----------join #tmpC c on (c.cID = b.cID)
----------where c.attrib = 'north'
8. Rows from all tables matching (b.bID = a.bID) + (c.cID = b.cID) (assuming no other predicates): 6
----------select count(*) from #tmpA a
----------inner join #tmpB b on (b.bID = a.bID)
----------inner join #tmpC c on (c.cID = b.cID)
(Obviously, selectivity metrics aren't important for these tiny example tables. But in Production, tmpB represents a table with over 5 million rows, while tmpA has about 27,000 and tmpC has over 300,000 -- and they're not actually temp tables.)
I want to be sure I understand which of these values are useful. I know that higher selectivity performs better in lookups, and makes a better candidate for an index. But a lot of my system's SQL code tends to combine several low-selectivity predicates, so I need to learn how to use this information. I'm guessing that one answer might be "measure individual predicates and then decide how you might want to combine them in indexes/code, based on the type of query plan you're hoping to get." But I don't know if this is correct.
Thanks for taking a look at this!