How many ways to measure selectivity?

  • I currently have the task of rewriting some poorly-written sprocs. I've just been to SQL Saturday, where I learned about measuring selectivity as you analyze a query. I've read elsewhere about measuring selectivity for each individual predicate in the JOIN and WHERE clauses. From my understanding, the purpose of gathering this information is to inform what indexes you will create, and how you might structure the query to perform best. But there are a few things I really don't understand about measuring selectivity. Let's say you're starting from an existing piece of SQL code which has lots of JOINs and a complex WHERE clause:

    1. Should I measure only individual predicates, or should I also combine all predicates for the same table and measure the overall selectivity of that composite?

    2. If I'm doing inner joins, do I measure selectivity from one JOIN to the next? Each join may reduce the final result set by some amount, so it would seem to make a difference whether you are measuring against the total number of rows in a table, or against the combined number of rows that all your previous joins would have returned? (I realize that we don't have any guarantee from SQL about the order in which JOINs are processed.) Conceptually I'm trying to just wrap my head around this idea of selectivity in regards to JOINs. Also, what if the JOIN has multiple predicates in the ON clause? Do I measure it against all predicates combined, or each one individually?

    3. If the answer is to combine predicates and include JOINs when calculating selectivity, then what's an easy way to do this? So far, the way I do this is to simply comment out everything in the query except the INNER JOINs and relevant parts of the WHERE clause, then add a SELECT COUNT and comment out the rest of the SELECT. This allows me to determine how many rows would result from each join, or even each condition in the join. But I still need to understand which of these metrics to use, and how.

    4. A sort-of unrelated question which has been on my mind lately: Does SQL actually do its initial lookup from an index using only the first column, even if it's a composite key index? I think I remember reading somewhere that it looks up using the first column, then brings back the other key columns to perform seek/scan/hash operations. (Note that I'm referring to "other key columns", not "other INCLUDE columns at leaf level", in this question.) I just want to be clear on this, because it seems pertinent for how I construct my indexes.

    Okay, so here is an example of what I'm trying to ask:

    set nocount on

    create table #tmpA (aID int, bID int, attrib char(10))

    create table #tmpB (bID int, cID int, dt datetime)

    create table #tmpC (cID int, attrib char(10))

    insert #tmpA values (1, 1, 'blue')

    insert #tmpA values (2, 3, 'green')

    insert #tmpA values (3, 4, 'red')

    insert #tmpA values (4, 2, 'green')

    insert #tmpA values (5, 1, 'green')

    insert #tmpA values (6, 8, 'green')

    insert #tmpB values (1, 1, '1/1/2010')

    insert #tmpB values (2, 1, '2/1/2010')

    insert #tmpB values (3, 3, '3/1/2010')

    insert #tmpB values (4, 5, '4/1/2010')

    insert #tmpB values (5, 1, '5/1/2010')

    insert #tmpB values (6, 5, '6/1/2010')

    insert #tmpB values (7, 1, '7/1/2010')

    insert #tmpB values (8, 4, '8/1/2010')

    insert #tmpC values (1, 'north')

    insert #tmpC values (2, 'west')

    insert #tmpC values (3, 'north')

    insert #tmpC values (4, 'east')

    insert #tmpC values (5, 'north')

    select a.*,c.cID from #tmpA a

    inner join #tmpB b on (b.bID = a.bID)

    inner join #tmpC c on (c.cID = b.cID)

    where b.dt between '1/1/2010' and '5/1/2010'

    and c.attrib = 'north'

    and a.attrib = 'green'

    drop table #tmpA

    drop table #tmpB

    drop table #tmpC

    Here are some of the metrics one might gather about selectivity:

    1. Rows in tmpB with b.dt in range: 5

    2. Rows in tmpA with a.attrib = 'green': 4

    3. Rows from both tables with b.dt in range + a.attrib = 'green': 3

    4. Rows from tmpB matching (b.bID = a.bID) (assuming no other predicates): 5

    ----------select count(distinct a.bID) from #tmpA a[/center]

    5. Rows from tmpC with c.attrib = 'north': 3

    6. Rows from tmpC matching (c.cID = b.cID) (assuming no other predicates): 4

    ----------select count(distinct b.cID) from #tmpB b

    7. Rows from tmpC matching (c.cID = b.cID) + c.attrib = 'north': 3

    ----------select count(distinct c.cID) from #tmpB b

    ----------join #tmpC c on (c.cID = b.cID)

    ----------where c.attrib = 'north'

    8. Rows from all tables matching (b.bID = a.bID) + (c.cID = b.cID) (assuming no other predicates): 6

    ----------select count(*) from #tmpA a

    ----------inner join #tmpB b on (b.bID = a.bID)

    ----------inner join #tmpC c on (c.cID = b.cID)

    (Obviously, selectivity metrics aren't important for these tiny example tables. But in Production, tmpB represents a table with over 5 million rows, while tmpA has about 27,000 and tmpC has over 300,000 -- and they're not actually temp tables.)

    I want to be sure I understand which of these values are useful. I know that higher selectivity performs better in lookups, and makes a better candidate for an index. But a lot of my system's SQL code tends to combine several low-selectivity predicates, so I need to learn how to use this information. I'm guessing that one answer might be "measure individual predicates and then decide how you might want to combine them in indexes/code, based on the type of query plan you're hoping to get." But I don't know if this is correct.

    Thanks for taking a look at this!

    Jordan

  • I honestly have to say that, in 10 years of being a DBA and writing and tuning queries, I've never bothered with anything as complex as what you seem to be proposing to do.

    I think it would make an interesting academic exercise, but I have never needed it in any practical situation. That doesn't mean someone else hasn't needed it routinely, it just means I haven't. I can't speak for others.

    In query tuning, there are usually (every time I've ever looked) a series of simple actions that will accomplish all the tuning you're likely to need. None of them involve any complex analysis of the guts of the system, like selectivity of complex join statements.

    First, look at the code being used for the query and break down the elements in it. These top things will usually get more speed than you might expect:

    Does it use nested views with unnecessary joins, columns, etc?

    Cursors? Worse yet, nested cursors?

    Recusive UDFs?

    Procedural coding, either in the code or in UDFs that the code calls?

    Triangular joins?

    Execution plan:

    - Bookmark lookups?

    - Unneeded table/index scans?

    - Out-of-date stats?

    - Unexpected tables (usually from nested views or UDFs)?

    Sargability of Joins and Where?

    I once debugged a query that used nested cursors in a set of recursive UDFs, and took execution from 10 minutes to 10 milliseconds just by fixing that to a set-based methodology. That one was keeping anyone from logging into the company's website at all.

    I once debugged a whole server by replacing a procedural UDF with an inline "select max(col) from table where..." subquery, in a query that was key to over half of the web pages that were in use.

    In 10 years of this, I haven't had to worry about selectivity to the degree you're asking about, not once. Which column is the leading edge of an index? Yes, routinely and regularly, but based on how it's used in the query, not on its selectivity.

    So, while you can certainly find answers to the questions you've just posed, do keep in mind that they may be of limited use.

    Also keep in mind that data changes. A solution built with the expectation that a certain column has certain selectivity may be less than optimum within minutes of being implemented.

    I don't want to discourage you (or anyone) from digging into selectivity this way. I just don't want you to go into it thinking it's a key skill.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I'm a little surprised by the approach too. Was it SQL Saturday in NYC? Which presenter was it? I'm curious to see the presentation now.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi Grant,

    Yes it was in NYC! I talked with you briefly on Saturday, in fact. Nice to put a face to the message board login!

    Nobody at SQL Saturday was proposing the extensive metrics that I'm referring to in my post.

    However, in Thomas Larock's presentation, he talked about measuring selectivity for various conditions (such as JOINs) in your queries, as part of a query optimization process. I assume this is a fairly standard practice. The reason for my post is that I didn't/don't understand whether to measure each predicate individually, or group them together when measuring, or both.

    I am also not suggesting that I would ever do all these metrics. I simply want to know which ones to do. I hope this clarifies a little...

    Jordan

  • Hey! Nice talking to you as well.

    OK. That clarifies it a bit. If you really want to measure selectivity, you add all the columns together. The leading edge, the first column, is the most important in any index, because that's what the histogram is built on, but the overall selectivity of the index is an accumulation of all columns.

    This means, if you were looking at it from a query point of view, you'd be concerned about columns in the JOIN and in the WHERE clause. Generally that's not the approach taken though.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • This is mainly relevant to non-clustered indexes, of course. ]If you specify the first column on a clus index, naturally it does not need to be selective to use the clus index 🙂 .]

    What you have to keep in mind is that SQL is very selective when deciding whether to use a non-clus index. Often the data needs to be as selective as 3%(!) for SQL to use a nc index, esp. when a clus index exists.

    So knowing the selectivity will help you decide whether it's likely to be worthwhile to build an index at all.

    Of course the easiest way is still to just review the actual query plans :-). But some front-end planning and understanding of your data selectively could still be useful.

    Scott Pletcher, SQL Server MVP 2008-2010

  • I might have misunderstood, but selectivity matters just as much in a clustered index. The same b-tree structure sits on top of the pages, controlling access and determining if you're getting seeks or scans, just like with a non-clustered index. I wouldn't deemphasize the importance of selectivity in the cluster. But, as I said, maybe I misunderstood.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I don't think so, unless you have extremely unselective data.

    Being the clustered index, SQL can always do a seek if you provide a specific key value(s) or range.

    Scott Pletcher, SQL Server MVP 2008-2010

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply