Architecture Question - 2 million records

  • This might be a dumb question.......

    I have a question regarding the initial best practice for a large DB structure.

    I'm developing a new application for a client. The record count could reach 2 million records by the end of the year.

    Will it make a difference in access and reporting performance if I store the data in 1 table with 2M records, or 2000 tables with 1000 records each? I can create this separation programmatically when the new customers are created.

    Is this separation of data into tables really just virtual and all the data exists in the same space? Obviously, it all exists in the same big data file.

    Malcolm

     

  •  Hi Malcolm

    My take inline below in BOLD:

    This might be a dumb question.......

    >>> we all learn on daily basis

    I have a question regarding the initial best practice for a large DB structure.

    I'm developing a new application for a client. The record count could reach 2 million records by the end of the year.

    >>> 2 million rows is common these days. We have tables that cross 100 Million rows in one table with no issues. Of course we have a server that is big enough - 12 GB RAM/SAN Storage/4 CPUs.

    Will it make a difference in access and reporting performance if I store the data in 1 table with 2M records, or 2000 tables with 1000 records each? I can create this separation programmatically when the new customers are created.

    >>> while it can be done , it is not advisable as SQL will have no problem with such row count if the DB/Table is tuned up with necessary indexing and considering the harware capabilities. Since you did not mention any details about the server/machine where the SQL hosting will take place, it is not easy to say though. But any server with 2-4 GB RAM and good enough storage will host such sized table without any problem.

    Is this separation of data into tables really just virtual and all the data exists in the same space? Obviously, it all exists in the same big data file.

    >>> SQL will perfom better with 2 tables than 1 table but programming access to select which table to query is hard enough. One thing : i assume you will not JOIN the 2 tables ! Otherwise 1 table is better. You can have multiple files in SQL for the same DB while storing different tables in different files. The BASIC guideline is (I am not an authority on this, but i read some MS documents to that effect):: have one file per CPU, so if you have a 4 CPU server , you can have 3-4 files and gain better performance.

    My advise: use one table and do a lot of stress testing first. MAke sure you have a covering index to help the application queries.

    If your SQL hosting machine has 2 CPUs , 2 GB RAM or more and RAID hard disk system, then go with 1 table.

     

  • I fail to see any real correlation between hardware and this specific question.

    Data partitioning is good if it aids aging out of data or if there are distinct reasons where the partitioning will aid performance. With sql 2000 you have to use partitioned views, these have all of their own issues and can be problematic, that said i've used them very effectively in both OLTP and DW databases.

    You really have to consider your schema and how you would handle partitioned tables, as to 2 million rows, it's a matter of scale, if the table is narrow and doesn't contain text columns then it's probably not a problem, but does this mean that in 5 years it's going be over 10 million rows and will that give you a problem?. If you're likely to be updating heavily and searches will require scans ( table or part table ) then a large table may cause issues.

    It's really difficult to be precise but I'd probably consider 2000 tables vs 1 table as a really bad idea, as I say it's how you access all these tables that counts. We can all do the I've got a table of x million rows and it works fine, I've got VERY problematic tables with less than half a million rows , and yes I've got 8 procs, 24 gb ram and a SAN, so the real answer to your question, which is very far from dumb, is that it just depends. You might like to check out articles. white papers by Kimberley Tripp and Kalen Delaney.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Thanks for the responses.

    While I realize that throwing more money (hardware) will improve performance, I am in the development/proposal stage and wanted advice on the best schema approach.

    The MSSQL 2000 structure will be quite simple. They plan on only a 1.1% market penetration into a particular field per year. This will result in ~1600 distinct customers. Each of these customers can have 750 – 1500 records associated with them. (of course, they consider this very conservative and the actual count of customers could be much higher. More is good.) 

    I have no problem creating code to create distinct tables for each customer.

    I was considering that this partitioning may increase online data access performance for this customers’ use of online forms and reports for his records, as opposed to a single table where the queries would always run through all the records.

    Or, is all this irrelevent because the seperation is only virtural.

    Administration and financial reports that run database-wide will be less frequent, and would only be run by internal managers and not the customers.

    Again, thanks for the responses…..

    Malcolm

     

  • Cheers

    I agree with Colin when stating  "It's really difficult to be precise but I'd probably consider 2000 tables vs 1 table as a really bad idea, as I say it's how you access all these tables that counts."

    But disagree about the hardware issue.The question was about performance. Both hardware and data partitioning decisions are performance oriented and will affect the end result. Hardware is not the REAL issue but it helps in an easy way if money is not an issue.

    One more thing is , if SQL2005 is used , then you can rely on Data Partitioning which should help in such scenarios.

     

     

     

  • Guys,

    I woke up at 3:07am and wrote this down

    Customer hits login page

    Successful login

        StoredProcedure builds table with just this customers records from tbl_Master.

        tbl_Cust_205

    Customer interacts with just their records. (sort/search/edit/add/delete)

    Session ends

    Delete original records in tbl_Master

    Update table with tbl_Cust_205 records

    Verify successfull update

    Delete tbl_Cust_205

     

  • A table with 2 million rows is fairly trivial and commonplace.

    The other design, with a table per customer, will be a source of endless problems.

     

     

  • Agree with Dclark, its not that many records. Design the table well and figure out which columns to index, build from there. If you need an archiving strategy, build that separately. If performance suffers at some point, then you can look at all your options (though 2000 tables probably wont be one of them). Separate tables would work, but you'll make life more complicated for you and your developers. For you, you'll have to do schema changes for 2000 objects instead of one, and for your developers, using stored procedures will be a headache, probably forcing them into dynamic sql and probably hurting performance to boot.

  • Hi Malcolm,

    Guess mine might be a dumb question, but have you considered the Normalisation of your schema (and any de-normalisation as appropriate)?  For instance, are you saying that if you go down the route of 2,000 tables, will the schema be the same for each table (which would be against Codd)? As Colin has said, depends on the width of your data before anyone can say whether 2,000,000 rows will be an issue or not.

    [salim] not sure how you arrive at the h/w spec you do? How do you know >= 2GB or more of RAM, >=2 * cpu, and RAID will be adequate for 1 table? Are you saying that less than this means that Malcolm should go for the 2,000 table option?  What RAID should Malcolm use? RAID 5? RAID 1?, RAID 10? How should he split the RAID arrays up?  I'm not trying to challenge you here, just trying to understand how you arrived at the senario you did. SQL2005 is a good option for partitioning, but this can only be done with Ent Ed, and that is big bucks. If money is no object then no problem, guess you can get the h/w to go with it. But this is where I do agree with Colin, too. You need to understand your data and schema and application, and then from there match the appropriate h/w based on I/Os, throughput, DB size(s) and growth, etc. Throwing h/w at a DB is not always the answer and doesn't always give you the improvement in performance you'd hope. Good design of the DB schema is essential if you're to get good performance.

    Certainly, if you create, say, the 2,000 tables, one for each customer, how would you do reporting where you wanted to find, say, the most profitable customer? or the one who's put the most orders in for a particular month? Surely to find such answers you would have to look sure values up in each table to compare, and this would be a killer to your box (IMHO). Of course, you may not want to find the info I've just mentioned above, but I'm guessing your client will want to do some sort of customer comparison reporting.

    Malcolm, to give you an idea, we're running a reporting database which has one table which has approx 30,000,000 rows (over a 2 week period), its width is approx 800 Bytes. This is running on SQL2000 Ent with 8GB RAM (6.5 GB dedicated to SQL Server), 8 * cpu, data files on a 16-disk RAID 10 array, TLogs on 10-disk RAID 10 Array, OS and SQL binaries on own RAID 1 arrays, and separate arrays for the backup files. There are another 15 tables with similar or more rows, and than there are 5 databases each with this sort of setup. This data is streaming in all the time, rather than bulk loaded at intervals. The box is hardly stressed. SQL Server is coping with it admirably. So, best thing would be to test. Also, if there is a lot of reporting, hashing, etc., you may want to take a look at how the tempdb is set up, as this may get used more heavily and become your bottleneck (esp. so with SQL2005).

    HTH.

    rgds  iwg

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply