Design suggestions for Large SQL 2008 table (several billion+ rows)

  • We have a single table with historical time series data which has over 2 billion rows with a growth rate of 100 million + rows per year.

    The table has a compound primary key based on four columns with one value column. Two of these columns are varchar > 10 characters, the other two are an int and char(1).

    Our primary concern is query performance although design suggestions that aid with loading and maintenance in addition to query performance would be helpful.

    Can anyone comment on some of the following design concepts in terms of which ones may have the most impact (primarily on performance but space and maintenance are other considerations)

    1) multiple indexes to accommodate different ordering of columns based on how the data may be queried. at this point, some queries have decent performance while others are quite poor so SQL suggests an additional non-clustered index with a different column order.

    2) changing the varchar columns to int/ID's referencing lookup tables. may reduce size of table but impact on performance ???

    3) changing three of the primary key columns to a compound key in a separate table defined with a new compositeID

    4) Utilize index intersection and build non-clustered indexes on each column in the primary key

    5) table partitioning (could separate data based on each year - 20 or so years in table)

    other ideas / suggestions ?

    thanks

  • 1) SQL is very keen to suggest additional indexes; some are good, and you shouldn't ignore the suggestion, but make sure it's for a query that is run a lot, as often you can end up with more than is really practical, and before you know it, half the size of your database is taken up with indexes;

    2) Definitely don't use any character columns, especially variable length ones, as performance is impacted. Also, I've found SQL treats character columns differently in queries to evaluating them in a primary key, (I think it might use a checksum when evaluating them in a primary key) so you can get primary key violations where you don't expect them. If users are using the text in their queries, try to get them to use the ID's (easy if they've got a UIwith a drop down) or just use WHERE ID IN(SELECT ID FROM lookuptable WHERE name = 'something') - that's way quicker than it looks;

    3)See how that works - "it depends". I've had that go both ways;

    5) Partitioning is good - especially at this size, as it can reduce the amount of data the query needs to consider, only covering the range you specify, but it needs careful planning .... I'm supposed to be re writing a sliding window script, (partitons) and I'm writing replies on here in stead!

  • Also, if you're on 2008, take a look at the built in reports on index usage, or you can write your own query based on the same base system view, sys.dm_db_index_usage_stats. It will give a good insight to what indexes are actually getting used - but note that it's only since SQL Server was last started.

  • You are bordering on "get a consultant"...but I'll throw in my two cents.

    pete 85875 (7/26/2011)


    We have a single table with historical time series data which has over 2 billion rows with a growth rate of 100 million + rows per year.

    The table has a compound primary key based on four columns with one value column. Two of these columns are varchar > 10 characters, the other two are an int and char(1).

    Our primary concern is query performance although design suggestions that aid with loading and maintenance in addition to query performance would be helpful.

    Can anyone comment on some of the following design concepts in terms of which ones may have the most impact (primarily on performance but space and maintenance are other considerations)

    1) multiple indexes to accommodate different ordering of columns based on how the data may be queried. at this point, some queries have decent performance while others are quite poor so SQL suggests an additional non-clustered index with a different column order.

    If you're going to index this table seven ways 'til Sunday then you really should choose a ever-increasing, narrow & unique clustering key. Here is a good read on the topic. It was written in 2005 but it is as relevant today as when it was written:

    Ever-increasing clustering key - the Clustered Index Debate..........again!

    And the debate rages on into 2010:

    More considerations for the clustering key - the clustered index debate continues!

    2) changing the varchar columns to int/ID's referencing lookup tables. may reduce size of table but impact on performance ???

    It depends on the frequency of times when you'll need to deliver the varchar. It's a balancing act, JOIN expense versus cost and space savings in indexes, memory and data. Only you will be able to determine the right answer after testing, but my default tendency is use one of the exact numeric data types and a lookup table.

    3) changing three of the primary key columns to a compound key in a separate table defined with a new compositeID

    Same response as #2.

    5) table partitioning (could separate data based on each year - 20 or so years in table)

    This can be a big win depending on:

    1. Knowledge of your data: what will be accessed most often, less often, or probably not at all?

    2. Bags of gold you have to throw at this table: how many filegroup-homes will you have for isolation?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • To simplify and maybe help you narrow down what OPC is discussing here, are you joining that PK into other tables? If so, that's going to hurt, and you might want to get yourself to a surrogate key (think IDENTITY or GUID columns) in the same table. I wouldn't necessarily split that out to another table.

    You can easily make sure your business key stays intact and viable via a non-clustered unique index/constraint, while using the surrogate key for all subsequent join mechanics.

    The reason for the billion recommendations from SQL is most likely because of your filtering methods on the base data. If you can get more consistent in how you approach the data, you'll find the mechanics of dealing with the attributes of the key much easier. However, the reason for the different ordering is because of 'leading edge' selectivity of data determines if the optimizer wants to use it. Each one of its suggestions needs to be analyzed and reviewed for sanity's sake. Sometimes you may just decide the query itself needs another approach.

    For particular queries that are running slow, you may want to look into divide and conquer approaches, using #tmp tables as subtables in the middle of the processing to help deal with extravagant joins and very complex logic for the optimizer. Any time you see a 'timeout' in the optimization process in the XML of the sqlplan, this is usually a good place to start.

    Edit:

    Sorry, misclicked instead of preview I hit post out of habit

    1) multiple indexes to accommodate different ordering of columns based on how the data may be queried. at this point, some queries have decent performance while others are quite poor so SQL suggests an additional non-clustered index with a different column order.

    2) changing the varchar columns to int/ID's referencing lookup tables. may reduce size of table but impact on performance ???

    3) changing three of the primary key columns to a compound key in a separate table defined with a new compositeID

    4) Utilize index intersection and build non-clustered indexes on each column in the primary key

    5) table partitioning (could separate data based on each year - 20 or so years in table)

    So to your questions in particular:

    1) Evaluate and discuss, and be very targetted when you do this. Each index is a copy of the data that needs to be upkept and you may want to approach it with filtered indexes or another method of data inclusion.

    2) I might look into moving rarely re-used fields into a secondary table, but only at 5% selectivity or lower. Otherwise I usually don't see the gain unless you go to the extreme of a warehouse layout.

    3) I would definately review this possibility, but I wouldn't take the fields out of the table, since they're obviously important business keys. Just unique NC that and add in a surrogate key column for the other tables to join on.

    4) I wouldn't approach this except under very particular circumstances where OR clauses are heavily used, and even then I'd first try to re-evaluate if there was another approach to the OR logic.

    5) Partitions are powerful structures, even if stored on the same LUN. They do require a re-analyzation of all queries involved though to make use of the partition splits though.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply