Partitioning on a computed column (int based)

  • I am trying to implement Search functionality using the partitioned table concept.

    Every search that user performs, their hits need to the persisted in the database (for future use by other apps)

    Say, I have [Searches] table with columns:

    SearchID INT, DocID INT

    where SearchID is the unique ID assigned to each user search and DocID is the ID of the document that is hit for this searchID

    So I have as example:

    VALUES ( 1, 1)

    VALUES ( 1, 2)

    VALUES ( 1, 3)

    VALUES ( 1, 4)

    VALUES ( 1, 5)

    ...

    VALUES ( 2059, 1)

    VALUES ( 2059, 2)

    VALUES ( 2059, 3)

    VALUES ( 2059, 40)

    VALUES ( 2059, 500000)

    ...

    And 1 searchid could have thousands hits if not millions.

    My problem (at least based on my experience) is that SQL Server 2008 R2 provides for only 1000 partitions within a table.

    I cannot upgrade to SQL Server 2012 where I think the limit is 15000 (but even there is a physical limitation).

    Given that the number of searches in the application will cross the 1000 or even the 15000 barrier eventually, I was thinking of not partitioning the table by SearchID but by introducing a third column called PartitionID.

    I could SET PartitionID = SearchID % 1000 which means I will have multiple SearchIDs in one partition. In my case that is ok in terms of performance.

    My CLUSTERED INDEX will include both SearchID and PartitionID and a 4th IDENTITY column in the table to maintain uniqueness.

    So final table (bare bones:):

    CREATE TABLE Searches (SID IDENTITY(1,1), SearchId INT, DocID INT, PartitionID INT)

    I could make PartitionID a computed column or have a function to generate the ID based on SearchID during INSERT but that I think is not the issue here.

    Values become:

    VALUES ( 1, 1, 1, 1)

    VALUES ( 2, 1, 2, 1)

    VALUES ( 3, 1, 3, 1)

    VALUES ( 4, 1, 4, 1)

    VALUES ( 5, 1, 5, 1)

    ...

    VALUES ( 10042, 2059, 1, 59)

    VALUES ( 10043, 2059, 2, 59)

    VALUES ( 10044, 2059, 3, 59)

    VALUES ( 10045, 2059, 40, 59)

    VALUES ( 10050, 2059, 500000, 59)

    ...

    Now this seems like a poor implementation because:

    SELECT SearchID, DocID FROM Searches WHERE SearchID = 2059

    Execution plan:

    <RunTimePartitionSummary>

    <PartitionsAccessed PartitionCount="1000"> <-------- THIS IS BAD, Right????

    <PartitionRange Start="1" End="1000" /> <-------- THIS IS BAD, Right????

    </PartitionsAccessed>

    </RunTimePartitionSummary>

    even though it uses a clustered index seek.

    SELECT SearchID, DocID FROM Searches WHERE SearchID = 2059 AND PartitionID = 59

    Execution plan is able to go to the exact partition i.e. 59:

    <RunTimePartitionSummary>

    <PartitionsAccessed PartitionCount="1"> <-------------THIS IS PROBABLY GOOD.

    <PartitionRange Start="59" End="59" /> <-------------THIS IS PROBABLY GOOD.

    </PartitionsAccessed>

    </RunTimePartitionSummary>

    and also use clustered index seek.

    Sorry for the long description but couldn't think of explaining any better to get to the issue.

    Does this mean I should look for other design and implementation methods or is the above ok or can it be tweaked?

    If I implement it as above I and other fellow developers will have to know which partition to specify in the WHERE clause. That might still be ok.

    Just wanted to find out if there is a better way to implement partitions for such use cases or maybe partitioning is not such a good idea here?

    Over time this could grow to billions of rows and hence I was trying to implement this using partitioning.

    One single table doesn't make sense.

    Having multiple tables - one per SearchID - i.e. Searches_<SearchID> is how we are think of implementing it as an alternative but that means we will end up with significant amount of dynamic sql in the future to get to the correct Search_## table(s) and UNIONs in the dynamic SQLs if we go that route.

    Any feedback?

  • Partitioning, in general, is an administrative tool and not a development or optimization tool. While you do need to make sure your partitioning is included in optimization of existing structures, it's not something you do to get optimization... USUALLY.

    In the cases where it can help with optimization is where tables usually have a 'noisy' region, an area of high throughput that is constantly searched on and in flux, and a long tail that is basically nothing but forgotten archives. Getting those archives out of the way can significantly help performance if you partition for it.

    A partition that is ID % 1000 isn't going to get you that unless you can control spindle access at the filegroup level, and even then you'd have to seriously dig into usage patterns of your users to get that organized.

    I don't see any particular reason to actually partition this data with what you're describing, and particularly not in the way you're describing. You can't separate the backup/restores of the filegroups, you're not looking at sectioning off related searches to stay within a single partition (for example, everything within 3 months being in the same year). I wouldn't pursue partitions for this unless you had a significant administrative reason to do so.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Great Thanks! Clearly I am not understanding partitioning all that well then. Yes I was also thinking of putting the filegroup with the partitioned tables in a different set of disk drives to take advantage of the I/O performance when this table is joined to tables from another filegroup (Primary) on different set of disk drives. But anyway that is not the point here. I don't need partitions to implement filegroups.

    Even now, the system has non-partitioned tables that have grown to billions of rows. We have proper indexes on them (checked with DTA as well) but performance sometimes is not good. The use case is similar. The SELECT is generally based on an ID which returns other set of columns and millions of rows back. I am using covered indexes. These tables are not date driven.

    With partitioning I thought there is a way to SELECT based on the ID against a smaller set and within specific partition to help improve performance. And also I thought the INSERTS and UPDATES and DELETES will be better because I will then be inserting or updating within specific partitions.

    Is there any better way to design a table that is expected to contain billions of rows in SQL Server 2008 R2?

    Thanks again!

  • charles99 (10/30/2012)


    Great Thanks! Clearly I am not understanding partitioning all that well then.

    Don't feel bad. I've been working with it off and on for the last 2-3 years and I'm still not sure I'M understanding all of it. 😉

    Yes I was also thinking of putting the filegroup with the partitioned tables in a different set of disk drives to take advantage of the I/O performance when this table is joined to tables from another filegroup (Primary) on different set of disk drives. But anyway that is not the point here. I don't need partitions to implement filegroups.

    Precisely. The table itself can be set to a different spindle, as can the non-clustered indexes different from that. That can help tremendously. In general though if you're going to drive in on a single entry (note, not range) on an index, in this case your IDs, you don't have to worry about first column selectivity in either a partitioned or non partitioned state as long as you go in on an index with that ID as the leading edge.

    What the heck am I talking about. Right. Basically, once you 'range' (non equality) a column in an index, you break going any deeper into that index to search for the right location. For more information because I sure as heck don't feel like I'm explaining this right, go here:

    http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/

    Even now, the system has non-partitioned tables that have grown to billions of rows. We have proper indexes on them (checked with DTA as well) but performance sometimes is not good. The use case is similar. The SELECT is generally based on an ID which returns other set of columns and millions of rows back. I am using covered indexes. These tables are not date driven.

    I'd have to see the structure of what you're dealing with, but index or partition if you're going in with a single entry against the first piece of the clustered you'll get roughly equivalent performance because of the selectivity.

    With partitioning I thought there is a way to SELECT based on the ID against a smaller set and within specific partition to help improve performance.

    Believe it or not, you're actually probably better off with the direct index seek against a non-aligned set then you are doing a partition seek and then seeking within the partition.

    And also I thought the INSERTS and UPDATES and DELETES will be better because I will then be inserting or updating within specific partitions.

    What you really want to look at for speeding up that process is what's called a partition swap... but you're not going to do that with data this disparate.

    Is there any better way to design a table that is expected to contain billions of rows in SQL Server 2008 R2?

    That's like asking is there a better way to design a really big bridge. The devil's in the details. Data types, row widths, heuristics, usage patterns, etc. I know that's not what you'd hoped for but you're talking consultant levels of research and experimentation and getting to know your environments to really be able to answer that intelligently.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks Craig. I know it was an open ended question...but I had to ask. 🙂

    hmmm...I read through the indexing article regarding equality and inequality predicates. I do understand that. Thanks again. I am also thinking of creating one search table per searchid. The tables will get created dynamically but at least it will not do insert/updates/selects/deletes against a single table with potentially billions of rows. Managing one single big table poses its own problems. index rebuilds (based on number of pages in the table and fragmentation levels), statistics updates etc will be a lot quicker on smaller tables. Views per searchid from one big table will not help me either with the updates/inserts/deletes. Unfortunately a select might return a million or more rows and so might as well create a table per searchid (Search_<SearchID>). I still haven't given up on partitioning. Although selecting the correct column (and not searchid % 1000) is still an issue over and above your sane advise that partitioning is more to over administrative issues of large tables than for improving performance issues in the development phases (in general, indexes when predicates are equality based should suffice you say but managing such big indexes is an issue in my mind). anyway...as you can see i am still struggling. But Thanks a ton!

  • charles99 (10/31/2012)


    Thanks Craig. I know it was an open ended question...but I had to ask. 🙂

    LOL, fair enough. There's possibly a couple of people around here who could give you better pointers than I can simply because their expertise allows them to make better assumptions, but I hate those foot in mouth moments when I try without a lot of information.

    I am also thinking of creating one search table per searchid. The tables will get created dynamically but at least it will not do insert/updates/selects/deletes against a single table with potentially billions of rows. Managing one single big table poses its own problems. index rebuilds (based on number of pages in the table and fragmentation levels), statistics updates etc will be a lot quicker on smaller tables.

    Be careful of this. Besides the fact that all your procs will now require dynamic calls to swap the table involved in any query, even using wrapper views will require constant updates to the view when a new search is built and you'll end up in a locking nightmare.

    I would highly recommend avoiding this approach. It's a maintenance nightmare.

    I still haven't given up on partitioning. Although selecting the correct column (and not searchid % 1000) is still an issue over and above your sane advise that partitioning is more to over administrative issues of large tables than for improving performance issues in the development phases (in general, indexes when predicates are equality based should suffice you say but managing such big indexes is an issue in my mind). anyway...as you can see i am still struggling. But Thanks a ton!

    My pleasure. Be forewarned, you're going to need aligned indexes across the partitions, so you're not really removing a lot of that index work. A good index though will remove 99% of the issues, but you're correct that a defrag will be painful. Partitions won't necessarily help that, but I'd need to do a bit more research before I'll throw that out there as a fact instead of a memory.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply