June 2, 2008 at 6:48 am
I am currently working on a new application and consequentially the DB behind it. But I'm a developer, not a DBA, so I'm trying to learn as fast as I can.
We are designing a system that will need to support around 5 million users. Each user will be able to add new records to various tables. It would be easy to compare this to a blogging application. So say there is a User table and a Blog_Entry table. If we let all 5 million users insert blog entries into the one table, after a couple years that one table would be HUGE!
I've looked at partitioning the Blog_Entry table based on the user_ID, but partitioning a table can only be done on one server. If we ever needed to scale out to multiple DB servers, I couldn't put the partitions on to different servers.
So, instead I've thought about custom partitioning. In other words, each user record in the user table would have a field like "user_partition_ID". This field would indicate which table that user's entries are in. For example, one user_partition_id would be 0001, another would be 0002, etc. I would then have tables called Blog_Entry_0001 and Blog_Entry_0002.
When we are pulling information for customers, we are only concerned with their entries, so we won't need to join to any other common tables or anything like that.
With this custom partitioning I could easily move each partition to its own server. (My C# DAL code would determine which database/table to call the stored procedures in based on the user's partition ID.) Also note that in my example I used a Blog_Entry table. In my actual database, there are about 6 different tables that need partitioning. Each one would be partitioned and accessed as my Blog_Entry table example above. By partitioning these tables this way, one user's information might reside on 1 server while another user's information is all on another server. Theoretically this should allow me to scale horizontally quite well.
Is this a crazy design, or would it work okay? Any better ideas? (And if my explanation doesn't make sense, please let me know!)
June 2, 2008 at 7:06 am
First, your estimate of 5 million users seems a bit optimistic, but let's go with that.
Managing the table of references to dtermine where data is store based on the user would mean you have a lookup that has 5 million rows in it. I think your design would perform better if you did not have to look in such a large table to determine where data is stored. Perhaps you could do something based on geography. This would also allow you to specify a server physically close to where users are.
Now, as far as holes in the plan, your biggest problem will be that you will either need to write your own OLEDB or ODBC drivers, or you will have problems with any third party tools accessing your data. This is always the problem with custom partitioning plans like yours. I would assume you will end up with a data warehouse or some other DSS that will handle reporting, but you will have to create custom processes for combining the data. That is why sticking to a provided partitioning solution is usually much easier.
SQL Server 2005 has two types of partitioning available. You can use distributed partitioned views as a scale-out solution - here is an article:
http://articles.techrepublic.com.com/5100-10878_11-6172535.html
Then, I would go back and look very carefully at standard table partitioning. With a SAN configuration and active/active clustering, you can scale disk access and processors out pretty reasonably and with the hardware growth going at it's current rate, it is likely to keep up with your increasing data.
June 2, 2008 at 7:26 am
Thanks for the info.
Yes, 5 million users is VERY optimistic, but I'm going with what the boss says. 🙂 (I think it will be around 500k, but I'm just a lowly developer!)
Managing the table references wouldn't be difficult in my case. The reason for that is that users can only view one other user's posts at a time. When they click on a user, we load that user's settings (including their partition ID) into a session state variable. (This will be an ASP.NET application) So there won't be any additional lookups other than the first.
As far as third party access, I don't think there will be any. My C# DAL (data access layer) will be the only means of access to the data. If any access is needed, it will be through web services that I wrote there.
I will definitely look at that distributed partitioned views. I hadn't ran across that yet.
And lastly, regarding standard table partitioning, my biggest concern with it is performance. There are a lot of very complex stored procedures I'm working on that will require significant processing time. Also, they will be called very frequently. (multiple times a second) I'm trying to work out the design to be less intense, but I just don't see that as a possibility. That is why I'm really leaning toward scaling out.
See my post regarding that issue here: http://www.sqlservercentral.com/Forums/Topic509826-338-1.aspx
Any other thoughts?
June 2, 2008 at 11:03 am
Check out the views. The writes are easy since you could always Q them (think Service Broker) and have the receiver determine where to write things. I'd also consider moving older stuff, less accessed, to other servers as well. Perhaps even a web service type access that could be handled if there was a delay.
Custom partitioning means that you have lots of places for things to go wrong. Better to use what's there, distributed partitioned views, if you can. It's easy to come up with a scheme and write the code, but easy to also make mistakes.
If you can somehow balance the load with your DAL code, that makes some sense. Be sure you consider failures and how you deal with this. Don't assume that multiple servers will always be up.
June 2, 2008 at 11:42 am
I'll just add that you really should have an experienced DBA or DB developer involved at this point.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
June 2, 2008 at 11:47 am
I couldn't agree with you more!
But we are a small startup company and there just isn't any $$ for a DBA. 🙁
June 2, 2008 at 12:11 pm
ammon (6/2/2008)
I couldn't agree with you more!But we are a small startup company and there just isn't any $$ for a DBA. 🙁
Right. In that case I would recommend that you contract the services of an experienced professional DBA consulatant to do Architecture and Design reviews and some mentoring.
Yeah, I know, there's no money for that either.
The problem is that mistakes and oversights at this stage can prove to be very costly or limiting down the road. And the things that are going to bite you are not the things that you are thinking of now, because you are smart enough to either figure them out or to come to a place like this and ask about them.
Rather, the real problem will be the things that don't even occur to you because you do not yet know enough to realize how important they are (or will be). And we cannot help you much with those now because you will not know to ask us and we do not have enough information (nor is it pratical in a forum) to warn you yet, it's just too general.
[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]
June 2, 2008 at 12:13 pm
I think the most benefit you will get out of a good archiving strategy that separates current data from old data.
Since old data is typically queried less often, the requirements with regards to performance should be much more relaxed.
By physically separating the data, you can ensure that people working with current data will experience best performance.
Best Regards,
Chris Büttner
June 2, 2008 at 12:24 pm
Christian Buettner (6/2/2008)
I think the most benefit you will get out of a good archiving strategy that separates current data from old data.
Unfortunately that isn't possible in our system. The users will post their own content and even data that is really old might still be accessed regularly. 🙁
This is the main reason I am coming here now is to run my ideas by the experts to make sure there isn't anything HUGE I am missing. I know things will come out and bite me later on. I'm just hoping that they aren't too big. 😉
I'm trying to design everything to be as simple as possible while still allowing us to scale horizontally in the future. I'd rather think about scaling issues now as opposed to later.
I did just buy a couple books of Amazon about designing SQL Server DBs. Hopefully they will give me the extra help I need to do it right. I'm sure once we get more $$ we can get a true DBA. I just hope I don't mess things up too much before then!
June 2, 2008 at 1:13 pm
This field would indicate which table that user's entries are in.
For more reasons than I could ever put in this tiny space, that would open a can of serious worms that will eat your face off in the years to come. There is no way that I would have one table per user... it would require dynamic SQL in the future and any changes made to 1 user table would have to be made to 5 million similar tables.
This is a huge mistake... a 5 million row user table and a table of blog entries for several years for those users isn't that big. In the presence of correct indexing, it can be made to run very smartly for its size.
Lemme say it again... one table per user is a serious mistake and will lead to an unending world of hurt.
--Jeff Moden
Change is inevitable... Change for the better is not.
June 2, 2008 at 1:19 pm
Jeff Moden (6/2/2008)
that would open a can of serious worms that will eat your face off in the years to come
Sorry, maybe I didn't explain that well. What I meant is that one field in the User's table would indicate which group of tables they would use. For example, customers 0-100 would use tables that end in _000 and customer 101-200 would use tables that end in _001, etc.
Sorry again for the confusion!
June 2, 2008 at 1:28 pm
ammon (6/2/2008)
Christian Buettner (6/2/2008)
I think the most benefit you will get out of a good archiving strategy that separates current data from old data.Unfortunately that isn't possible in our system. The users will post their own content and even data that is really old might still be accessed regularly. 🙁
Hi ammon,
It is OK if old data is accessed "regularly". The question is whether or not it is accessed often.
Old data is "per definition" old, and therefore will not be accessed that often.
The majority of users will only be interested in current blogs.
Best Regards,
Chris Büttner
June 2, 2008 at 1:36 pm
Good point. Unfortunately, its hard to know the answer to that question now. I just wonder how hard it would be for me to determine what I thought was old data and if just partitioning the data into old vs new would really give me the horizontal scalability I'm aiming for...
June 2, 2008 at 1:40 pm
ammon (6/2/2008)
Good point. Unfortunately, its hard to know the answer to that question now. I just wonder how hard it would be for me to determine what I thought was old data and if just partitioning the data into old vs new would really give me the horizontal scalability I'm aiming for...
Sorry, I did not mean this to be a replacement for partitioning.
It should just be considered as part of the overall solution.
Best Regards,
Chris Büttner
June 2, 2008 at 1:43 pm
Oh, that makes more sense! Thanks for clearing that one up!
Viewing 15 posts - 1 through 15 (of 34 total)
You must be logged in to reply to this topic. Login to reply