Partitioning is typically used to break data up for either performance or
space reasons, comes in a few different flavors, and is often used with or as an
alternate to implementing a data archiving plan. Partitioning isn't terribly
hard to do, but there are enough details that it's worth working through some
scenarios to make sure you know what goes where. Over the next couple articles
I'll try to demo all the major options, and I'm going to start here with a very
high level review of our options.
This is often done when you have a table that is much wider than it needs to
be (often with nice columns like Skill1, Skill2, etc, etc) and hurts performance
when table scans are done. By splitting lesser used columns into a parallel
table we can increase the number of rows per page in the main table which will
increase performance on table scans, and will also give us a boost when building
or rebuilding indexes on that table. Remember that it doesn't do a lot of good
to move columns that are mainly null, or to necessarily move text columns which
only take up 16 bytes in the table. We might move columns that we rarely use, or
we might split highly used columns across the two tables. From a performance
perspective it's generally a good thing to keep together columns that are
frequently updated together so that we only log one update instead of two. From
a space perspective we may need to just move the heavier columns to the other
table just to get some room to work.
Within vertical partitioning we can pursue two strategies. One is to add
triggers to the main (original) table to force the secondary table to stay in
sync. If we add a row to the main table we add a row to the secondary table even
if all we have is the primary key, and deleting a row from the main table forces
a parallel delete in the secondary table. The alternative is to do a sparse
implementation where we only add a row to the secondary table if we have data
for one or more columns. This is typically a little harder to work with because
it requires an upsert type operation, but it's doable and does decrease
space usage and that would boost performance as there would be fewer rows to
scan (if a scan is needed), and is usually done inside a stored procedure, but
could be handled by triggers.
Vertical partitioning is a do it yourself option, SQL doesn't provide any
fancy tools to help you do it. Once done you can index each table appropriately
and you'll get good or bad query plans based on indexing, statistics, etc, as
usual. There's no rule that says you are limited to splitting a main table into
two pieces, but I think beyond that it starts to get hard to manage and you
incur the increasing overhead of maintaining copies of the primary key in each
table. Vertical partitioning is much less common than horizontal partitioning.
The idea here is to split the rows into multiple tables, either physically or
virtually, and gain the ability to place them on separate filegroups which can
offer us room to grow and often increased speed as well due to more disks being
involved. Pretty much all versions support 'partitioned views' where we manually
create separate tables based on our partitioning criteria and then union them
all back together with a view to make it look like it's really still one table.
This is physically partitioning the data. We can also create a distributed
partitioned view which lets us spread out data across multiple servers and join
them all back together using a view on each server that uses linked servers to
access the remote tables. In SQL 2005 we have true partitions, where we can
define a table as residing on a partition and all the sub tables are managed by
SQL, there is no view. To us, it looks and feels like a table.
If we use the old 'partitioned view' approach we can add opt to bake in some
intelligence to our data access and just query the current orders table, and
fall back on the view of all orders if needed. We can also index each table
differently. For example, we might choose to add more indexes to previous years
tables since we know that data will not change. The tradeoff is that if we add
an index we're responsible for adding it to all the other tables, and it's easy
to forget that step. By contrast with the newfangled partitions in SQL 2005 we
don't have the option of defining different sets of indexes by partition, which
can be good or bad depending on your needs. SQL 2005 partitions also require the
table to be empty so that it can be created on a partition. It's not going to be
a good idea to move all the data around during the day anyway, so this isn't as
bad as it might seem.
Implementing any of these requires moving significant chunks of data around
and is not likely to be done during production hours. Each can impart some
performance gains if the column(s) that were used as the partitioning
criteria are part of your query. For example, let's say that we decide to
partition our orders table based on order date and build a separate partition
for each year. Queries that include the order date will automatically be
confined to that one partition, where as if we query on something like customer
name without adding order date we will have to search all three tables using
seeks or scans as SQL deems best.
In SQL 2000 it's all do it ourselves, in SQL 2005 we can use the partitioning
mechanism built it, or choose to do it the old way manually. Horizontal
partitions are definitely nice if you need to manage space constraints, just
push a big chunk of data to a different set(s) of drives. Performance gains
aren't always so clear. Each can impart some performance gains if the
column(s) that were used as the partitioning criteria are part of your query.
For example, let's say that we decide to partition our orders table based on
order date and build a separate partition for each year. Queries that include
the order date will automatically be confined to that one partition, where as if
we query on something like customer name without adding order date we will have
to search all three tables using seeks or scans as SQL deems best. There can be
a gain overall if the partitions are put on separate RAID sets, giving SQL more
available IO to satisfy the query.
Archiving isn't a partitioning strategy, but it's part of the debate because
we may need to do it instead of or in addition to partitioning. My definition of
archiving is that the data is no longer visible to any production application or
report. We typically move it to another table/database in case we need to
research something, or we decide to go spelunking for some big historical trend.
The advantage to archiving is that can be easy at the table level, just create a
job along the lines of 'move anything more than 5 years old in Table A to
Archive..TableA'. It get's trickier when you have to take into account foreign
key relationships but it's not an impossible job and it one that usually only
makes sense to do on the largest tables.
As far as space savings we definitely achieve that, assuming we move the
records to a different drive or machine, or just delete them (seriously, this is
usually only a good idea for the most trivial of data points). For performance
we reduce the total number of rows which will help if we can, and to a degree
fewer rows even makes seeks faster.
Which Should I Choose?
I'm a firm believer that every table should have an archiving plan identified
when it's created, even if the plan is to never archive it, and set up jobs to
maintain the archiving process as soon as the table is live. If I'm short of
space I would usually prefer to move entire objects rather than partition, but
there might be times when it makes sense. Partitioning for performance should be
just about the last thing you try. Picking the right column to partition on
isn't always easy, the gains very depending on the queries being run, and you
add a layer of complexity that while manageable, is still one more thing to deal
with. Before implementing any of the options you'll definitely want to take them
for a test run or two, and I'll be doing that in the next couple articles.