Table Partitioning

,

Partitioning is typically used to break data up for either performance or

space reasons, comes in a few different flavors, and is often used with or as an

alternate to implementing a data archiving plan. Partitioning isn't terribly

hard to do, but there are enough details that it's worth working through some

scenarios to make sure you know what goes where. Over the next couple articles

I'll try to demo all the major options, and I'm going to start here with a very

high level review of our options.

Vertical Partitions

This is often done when you have a table that is much wider than it needs to

be (often with nice columns like Skill1, Skill2, etc, etc) and hurts performance

when table scans are done. By splitting lesser used columns into a parallel

table we can increase the number of rows per page in the main table which will

increase performance on table scans, and will also give us a boost when building

or rebuilding indexes on that table. Remember that it doesn't do a lot of good

to move columns that are mainly null, or to necessarily move text columns which

only take up 16 bytes in the table. We might move columns that we rarely use, or

we might split highly used columns across the two tables. From a performance

perspective it's generally a good thing to keep together columns that are

frequently updated together so that we only log one update instead of two. From

a space perspective we may need to just move the heavier columns to the other

table just to get some room to work.

Within vertical partitioning we can pursue two strategies. One is to add

triggers to the main (original) table to force the secondary table to stay in

sync. If we add a row to the main table we add a row to the secondary table even

if all we have is the primary key, and deleting a row from the main table forces

a parallel delete in the secondary table. The alternative is to do a sparse

implementation where we only add a row to the secondary table if we have data

for one or more columns. This is typically a little harder to work with because

it requires an upsert type operation, but it's doable and does decrease

space usage and that would boost performance as there would be fewer rows to

scan (if a scan is needed), and is usually done inside a stored procedure, but

could be handled by triggers.

Vertical partitioning is a do it yourself option, SQL doesn't provide any

fancy tools to help you do it. Once done you can index each table appropriately

and you'll get good or bad query plans based on indexing, statistics, etc, as

usual. There's no rule that says you are limited to splitting a main table into

two pieces, but I think beyond that it starts to get hard to manage and you

incur the increasing overhead of maintaining copies of the primary key in each

table. Vertical partitioning is much less common than horizontal partitioning.

Horizontal Partitions

The idea here is to split the rows into multiple tables, either physically or

virtually, and gain the ability to place them on separate filegroups which can

offer us room to grow and often increased speed as well due to more disks being

involved. Pretty much all versions support 'partitioned views' where we manually

create separate tables based on our partitioning criteria and then union them

all back together with a view to make it look like it's really still one table.

This is physically partitioning the data. We can also create a distributed

partitioned view which lets us spread out data across multiple servers and join

them all back together using a view on each server that uses linked servers to

access the remote tables. In SQL 2005 we have true partitions, where we can

define a table as residing on a partition and all the sub tables are managed by

SQL, there is no view. To us, it looks and feels like a table.

If we use the old 'partitioned view' approach we can add opt to bake in some

intelligence to our data access and just query the current orders table, and

fall back on the view of all orders if needed. We can also index each table

differently. For example, we might choose to add more indexes to previous years

tables since we know that data will not change. The tradeoff is that if we add

an index we're responsible for adding it to all the other tables, and it's easy

to forget that step. By contrast with the newfangled partitions in SQL 2005 we

don't have the option of defining different sets of indexes by partition, which

can be good or bad depending on your needs. SQL 2005 partitions also require the

table to be empty so that it can be created on a partition. It's not going to be

a good idea to move all the data around during the day anyway, so this isn't as

bad as it might seem.

Implementing any of these requires moving significant chunks of data around

and is not likely to be done during production hours. Each can impart some

performance gains if the column(s) that were used as the partitioning

criteria are part of your query. For example, let's say that we decide to

partition our orders table based on order date and build a separate partition

for each year. Queries that include the order date will automatically be

confined to that one partition, where as if we query on something like customer

name without adding order date we will have to search all three tables using

seeks or scans as SQL deems best.

In SQL 2000 it's all do it ourselves, in SQL 2005 we can use the partitioning

mechanism built it, or choose to do it the old way manually. Horizontal

partitions are definitely nice if you need to manage space constraints, just

push a big chunk of data to a different set(s) of drives. Performance gains

aren't always so clear. Each can impart some performance gains if the

column(s) that were used as the partitioning criteria are part of your query.

For example, let's say that we decide to partition our orders table based on

order date and build a separate partition for each year. Queries that include

the order date will automatically be confined to that one partition, where as if

we query on something like customer name without adding order date we will have

to search all three tables using seeks or scans as SQL deems best. There can be

a gain overall if the partitions are put on separate RAID sets, giving SQL more

available IO to satisfy the query.

Archiving

Archiving isn't a partitioning strategy, but it's part of the debate because

we may need to do it instead of or in addition to partitioning. My definition of

archiving is that the data is no longer visible to any production application or

report. We typically move it to another table/database in case we need to

research something, or we decide to go spelunking for some big historical trend.

The advantage to archiving is that can be easy at the table level, just create a

job along the lines of 'move anything more than 5 years old in Table A to

Archive..TableA'. It get's trickier when you have to take into account foreign

key relationships but it's not an impossible job and it one that usually only

makes sense to do on the largest tables.

As far as space savings we definitely achieve that, assuming we move the

records to a different drive or machine, or just delete them (seriously, this is

usually only a good idea for the most trivial of data points). For performance

we reduce the total number of rows which will help if we can, and to a degree

fewer rows even makes seeks faster.

Which Should I Choose?

I'm a firm believer that every table should have an archiving plan identified

when it's created, even if the plan is to never archive it, and set up jobs to

maintain the archiving process as soon as the table is live. If I'm short of

space I would usually prefer to move entire objects rather than partition, but

there might be times when it makes sense. Partitioning for performance should be

just about the last thing you try. Picking the right column to partition on

isn't always easy, the gains very depending on the queries being run, and you

add a layer of complexity that while manageable, is still one more thing to deal

with. Before implementing any of the options you'll definitely want to take them

for a test run or two, and I'll be doing that in the next couple articles.

Rate

4.2 (5)

Share

Share

Rate

4.2 (5)