advanced tuning on 876,000,000 rows

  • Hi Guys,

     

    Just developing a new system and ive just done some calculations for the amount of data im going to end up with....

    The worse case scenario is that it will grow each year at the rate of 29,200 per user with each server being specified to handle 10,000 users at the very max.

    So after a 3 year life span the each server could have a maximum of 876,000,000 rows in one of the tables.

    The table is very basic:

    [lookup key field]         int

    [text_item]                  varchar(255)

    [text_item2]                varchar(255)

    expiry_date                  datetime

    We need very fast  INSERT and SELECT against this table, there will be 2 queries that need to be quick as possbible:

    1. Querying the [lookup key field] and an exact (=) match on [text_item]

    and

    2. Querying the [lookup key field] and partial (LIKE '%query') on [text_item2]

    Now, I can go for a decent spec machine (But nothing too major).

    I was thinking of spliting the table into different tables used to store items alphabetically, so i could have for example:

    tbl_store_A

    tbl_store_B

    tbl_store_C

    tbl_store_D

    .....

    Where tbl_store_A stores items that begin with A etc.

    And then use an partitioned view to access them, I think this would increase the access time.

    Anyone had any experience of this kind of volume and have any good pointers?

    Thanks,

    Dan

  • You are looking at about 185GB just for this table.  It is worth reviewing all your space estimates to get bot ha worst-case and a realistic estimate of space required.  If your total datbase size gets near 1TB, then you should consider SAN storage, as the management of this amount of storage becomes much cheaper using SAN against DAS. 

    The key aspect of server design for this data volume is I-O capacity.  You should look at having a minimum of 4 SCSI channels to your data, otherwise you could get bad disk queueing.

    The next stage would be to check with your management if they have the cash to buy the kit needed. 

    Assuming the costs are OK, then you can get into table design.  The bad news here is that the relative performance of different models of Windows hardware is such that you need to try things out at your site.

    The key to performance is spreading the I-O load, and reducing the amount of data that has to be accessed to answer a query.  You may find that a pertitioned view (with underlying tables on different disks) gives better performance than a single table in its own filegroup (with multiple files each on a separate disk).

    The main potential advantage of a partitioned view is that if some of the underlying tables contain only old semi-obsolete data, you may need fewer indexes on these than other tables holding current data that are accessed more often.  You can also put your old-data tables on a single disk, and spread your current data tables over many more disks than otherwise possible.

    In your case, consider the effect of partitioning by expiry date, and having extra indexes to answer the business queries.  If this causes the business queries to focus only on the tables with the most recent 50M rows, you could get a good performance gain.  Unfortunately you will have to test this yourself.

     

     

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • mmmm.....

    yes, i hadnt thought through the storage size issues yet! I think i shall have to re-think the structure. Alot of the data will be duplicated so im going to have to structure a more relational model that is quick to access. This should cut the size down a great deal.

    Thanks for the pointers!

     

  • SQL Server will not use an index on a (LIKE '%query')  filter so you might want to reconsider this type of query. 

  • SQL Server will not use an index on a (LIKE '%query')  filter so you might want to reconsider this type of query. 

  • Consider setting up full-text index catalog on this field, and do a CONTAINS query.  Much faster.



    Mark

  • You will get better performance from fibre channel than scsi.  2 x 2 gig Fibre Channel cards load balanced with something like EMC's Power path software will give good results and provide you with redundancy.

    You can get excellent performance from a disk array, you over a SAN. I have had the best performance on a HP disk array over a Late model EMC SAN. (160mb/sec versus 120mb/sec) Cost of HP disk array $125K US, Cost of EMC SAN $2.5 mill US.

    Spend 50% of your time sorting out your disk array.

    You will need to stripe the database file over a minimum of 4 disks preferably 6 but no more than 10. Each of our databases only has one datafile.

    We partition a similar table to yours into yearly databases.

    ie.

    create view AllData

    as

    select a,b,c from DB2002.dbo.table2002

    union all

    select a,b,c from DB2003.dbo.table2003

    union all

    select a,b,c from DB2004.dbo.table2004

    This makes it easier to manage the databases.  We can perform changes on any database while maintaining full availablity by restoring a database and changing the view to include the newly restored database. Its simpiler than filegroup backups.

    It makes it easier to backup and restore as you are generally backing up and restoring a smaller database and you are likely to have the disk space to do it.  We had a 400gig database that we never tested a restore on as we didn't have the disk space available on local disk let alone another server and we didn't want to restore over the top of the existing database in case it failed.  It used to take 15 hours for the backup over the network, probably would have taken 30 for a restore.  Not good availability.

    We use litespeed, a backup and compression utility. It takes about 18 minutes to backup and 35minutes to restore a 100gig database on a 4 x 1.5gig cpu server with hyperthreading, and compresses it to 25gig.  It makes it so much more manageable. We bulk load the data once a day so we backup each day.

    In our case each of the tables in the yearly databases has a composite primary key and two other indexes. With 1.4 billion rows we can get any one record within 20milliseconds or less. Ranges provide similar performance.  You must partition, not only from a manageablility perspective but from a performance perspective. SQL Server will not use the other tables in the partition if they don't contain data that meets the partition contraint, which ultimately means a smaller number of index pages to be traversed. Pay close attention to the query plan, you need to see index seeks not index scans.

    Good Luck.

  • Thanks for that, thats very helpful!!

    Its very good to see that good response times are possible for that amount of data.

    I am going to be re-structing my database.

     

  • What table design did you finally come up with?

    GaryA

     

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply