Best way to implement partitioning

Question

Post reply

Best way to implement partitioning

OmegaZero

Valued Member

Points: 72
More actions
July 6, 2012 at 2:27 pm

#258563

I have my own idea on how to handle this, but want to se if there is a different perspective on this. The problem I have with this is I have the following tables (sample tables)
-------------------------------------------------------------------------------------------------------------------------
CREATE TABLE Account
(
AccountId int identity(1,1),
CreationDate datetime,
CONSTRAINT [PK_Account] PRIMARY KEY CLUSTERED
(
[AccountId] ASC
)WITH (FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE AccountSales
(
AccountSalesId int identity(1,1),
AccountId int
CONSTRAINT [PK_AccountSales] PRIMARY KEY CLUSTERED
(
[AccountSalesId] ASC
)WITH (FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[AccountSales] WITH CHECK ADD CONSTRAINT [FK_AccountSales_Account] FOREIGN KEY([AccountId])
REFERENCES [dbo].[Account] ([AccountID])
GO
ALTER TABLE [dbo].[AccountSales] CHECK CONSTRAINT [FK_AccountSales_Account]
GO
------------------------------------------------------------------------------------------------------------------------------
Let's say the Account table is 500GB and the AccountSales is 1TB. I would like to partition these tables for obvious reasons. The way the current logic works is that data is dealt with on a day by day basis, so it makes sense to partition on the date field for the Account table, however the AccountSales table has no date field. Also the data in both tables is no sequential when it comes to the id and date as data is loaded from different servers at different times. So like accounts may have the following data
Id Date Server loaded from (not a column - just for display purposes)
--------------------------
1 1/1/2000 00:00 1
2 1/1/2000 01:00 1
3 1/1/2000 02:00 1
4 1/1/2000 00:00 2
5 1/1/2000 01:00 2
6 1/1/2000 0:300 1
I was thinking about for the Accounts table just making the clustered index on CreationDate and setting the AccountId as a PK with Unique NC index. I would then partition on the date.
I'm not 100% sure of how to handle the AccountSales table though - since if I did it by id the dates would not match up correctly if joined with the Account table, however I don't see a way to fix this.
Any ideas? What is the best way to handle this? If any more information is needed please let me know. Thanks if advance!

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

OmegaZero Valued Member Points: 72 More actions · Answer 1

OmegaZero

Valued Member

Points: 72

July 9, 2012 at 12:10 pm

#1510488

bump

Jared SSC Guru Points: 61793 More actions · Answer 2

Add a created date column to the accountsales table.

Jared
CE - Microsoft

Gullimeel Hall of Fame Points: 3871 More actions · Answer 3

Just a small thing.Why do you have a fill factor of 80 for increasing identity field?Shouldnt it be 100%?

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

Gullimeel Hall of Fame Points: 3871 More actions · Answer 4

Will you work on the date alone or you will have lots of queries using account as well?

I would suggest If you are going to query the full day's data the use the date column in accountsales table as suggested in earlier post in account sales table.

if you are going to query say like hourly or like that then you can use a derived column like diff in minutes since say Jan 01.01.1900 or some other date and put that columns in both the tables.Then use that column for join along with the accountid.if you need to do partitioning then you can have on that column as well.

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

Jared SSC Guru Points: 61793 More actions · Answer 5

Gullimeel (7/9/2012)
I would suggest a derived column like diff in minutes since say Jan 01.01.1900 and put that columns in both the tables.Then use that column. for join along with the accountid.

Umm... Why? What relevance does the diff in minutes from 1900.01.01 have? To me, partitioning on date makes sense because the value is relevant and you can make a filegroup read only. Difference in minutes to an arbitrary date makes absolutely no sense to me.

Jared
CE - Microsoft

Jared SSC Guru Points: 61793 More actions · Answer 6

To the OP, what difference does it make if the ids in the accountsales table don't match up with dates in the accounts table? They will never match up, presumably. However, it seems counter-intuitive to me to not have a date in the AccountSales table (This assumes that the Account table holds a customer's account info and gives them an id when they are first created and the AccountSales table attributes a sale to that Account, don't you need the date of the sale?).

Jared
CE - Microsoft

Gullimeel Hall of Fame Points: 3871 More actions · Answer 7

Gullimeel (7/9/2012)
--------------------------------------------------------------------------------
I would suggest a derived column like diff in minutes since say Jan 01.01.1900 and put that columns in both the tables.Then use that column. for join along with the accountid.
Umm... Why? What relevance does the diff in minutes from 1900.01.01 have? To me, partitioning on date makes sense because the value is relevant and you can make a filegroup read only. Difference in minutes to an arbitrary date makes absolutely no sense to me.
Thanks,

This depends on what kind of query you would like to go against these tables. if he is going to write all the queries like date >=01/01/2012 and dat < 01/02/2012. Then just having a date column(just 3 bytes) makes much more sense.

But if he is quering say based on hours or minutes the adding that kind of column(int) makes much more sense. But it all depends on query...

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

Jared SSC Guru Points: 61793 More actions · Answer 8

Gullimeel (7/9/2012)
Gullimeel (7/9/2012)
--------------------------------------------------------------------------------
I would suggest a derived column like diff in minutes since say Jan 01.01.1900 and put that columns in both the tables.Then use that column. for join along with the accountid.
Umm... Why? What relevance does the diff in minutes from 1900.01.01 have? To me, partitioning on date makes sense because the value is relevant and you can make a filegroup read only. Difference in minutes to an arbitrary date makes absolutely no sense to me.
Thanks,
This depends on what kind of query you would like to go against these tables. if he is going to write all the queries like date >=01/01/2012 and dat < 01/02/2012. Then just having a date column(just 3 bytes) makes much more sense.
But if he is quering say based on hours or minutes the adding that kind of column(int) makes much more sense. But it all depends on query...

I completely disagree. If I know I want rows where the createdDate was between '2012-05-01 13:32:00.000' AND '2012-05-01 13:35:00.000' why on earth would I want to convert those to integers first? I don't believe there is any evidence that it would be faster to scan or seek based on an integer or bigint versus a datetime. Logically, it makes no sense to me and theoretically it makes no sense to me. Based on the way that SQL Server stores this information, I would prefer not to do the calculation first.

Jared
CE - Microsoft

Gullimeel Hall of Fame Points: 3871 More actions · Answer 9

I completely disagree. If I know I want rows where the createdDate was between '2012-05-01 13:32:00.000' AND '2012-05-01 13:35:00.000' why on earth would I want to convert those to integers first? I don't believe there is any evidence that it would be faster to scan or seek based on an integer or bigint versus a datetime. Logically, it makes no sense to me and theoretically it makes no sense to me. Based on the way that SQL Server stores this information, I would prefer not to do the calculation first.

I am not trying to say that int is faster than date or like that. It is simply to save the space both on table and any index. This doesnt matter on small table but when you are talking about the 500GB table then 4 extra bytes in a table also makes diff to the size of the table as well as index..

There will be just two calculations based on start and end..

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

Jared SSC Guru Points: 61793 More actions · Answer 10

Gullimeel (7/9/2012)
I completely disagree. If I know I want rows where the createdDate was between '2012-05-01 13:32:00.000' AND '2012-05-01 13:35:00.000' why on earth would I want to convert those to integers first? I don't believe there is any evidence that it would be faster to scan or seek based on an integer or bigint versus a datetime. Logically, it makes no sense to me and theoretically it makes no sense to me. Based on the way that SQL Server stores this information, I would prefer not to do the calculation first.
I am not trying to say that int is faster than date or like that. It is simply to save the space both on table and any index. This doesnt matter on small table but when you are talking about the 500GB tale then 4 extra bytes in a table also makes diff to the size of the table as well as index..
There will be just two calculations if at all needed. based on start and end..

Well, I see where your thought process is, but I disagree. I prefer to store my data as useful data. If I want to use that column, I have to convert back to datetime which is a waste of resources to me. It is also more work for my developers and my system. I suppose if space is a large issue, you may have no choice. However, I would certainly not jump to that without knowing that there was a severe constraint on space. Especially with the various forms of compression available.

Jared
CE - Microsoft

Gullimeel Hall of Fame Points: 3871 More actions · Answer 11

Yes if he needs the column's data as datetime.Then i wouldnt suggest it.As I mentioned earlier.It purely depend on your workload against these tables like type of queries and frequency of these queries..

GulliMeel

Finding top n Worst Performing queries[/url]
Improve the performance of Merge Join(special case)
How to Post Performance Problem -Gail Shaw[/url]

ScottPletcher SSC Guru Points: 100947 More actions · Answer 12

OmegaZero (7/6/2012)
I'm not 100% sure of how to handle the AccountSales table though - since if I did it by id the dates would not match up correctly if joined with the Account table, however I don't see a way to fix this.
Any ideas? What is the best way to handle this? If any more information is needed please let me know. Thanks if advance!

I know why the date was left out of the Sales table -- it can be derived.

But in this case I think denormalization is a good idea. I, too, would store the date/datetime in the Sales table, even though it's extra bytes, just to allowed it to be clustered upon.

I can also see the advantage of using minutes diff rather than datetime, as suggested, since it's shorter (4 bytes vs 8). On such a large table, it could indeed make a genuine size difference.

However, for ease of use, I would probably just use the actual datetime. Incorrect results from wrongly-coded queries could more than offset the gain of a few bytes per row.

Whether the clustered keys should be a composite of date and AccountId is a much more complex q, and answering properly would require analysis of daily loads and expected query patterns and types.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jared SSC Guru Points: 61793 More actions · Answer 13

ScottPletcher (7/9/2012)
I know why the date was left out of the Sales table -- it can be derived.

What am I missing... How can it be derived?

Jared
CE - Microsoft

ScottPletcher SSC Guru Points: 100947 More actions · Answer 14

Gullimeel (7/9/2012)
Just a small thing.Why do you have a fill factor of 80 for increasing identity field?Shouldnt it be 100%?

I agree that the fillfactor should be carefully reviewed, but not that it's a "small thing". It could have a big impact on performance if/when you rebuild the table.

Unless you do significant increases to the row length after it is inserted, you need to increase the fillfactor considerably. [Even if that is true, I would instead consider "pre-padding" the row rather than using such a low fill factor.]

For a table this large, I usually aim for ~98%, since you typically have to allow for varchar columns to be lengthened on occasion.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.