Splitting a huge, a huge table in two or three

  • Ok,

    I've seen big tables and databases before, but this one is really big. I just recently discovered that the reason for this is because customer wants to keep data over 2 years on it.

    Anyway, got some ideas and would like to get input about those and how I would translate to physical and actual implementation.

    Idea #1

    Splitting the table in two. Keeping table #1 on the actual production database server but with data for current year. Then, based on the PK or date time column, move everything that is older to a different table and MS-SQL server , kind of data archiving stuff. Once that's done, replace any table reference on existing store procedure with a view, that points to the two physical tables.

    Idea #2.

    This one may improve performance, but I guess keeping the information on the same MS-SQL instance. This one is using partitions. Create two more file groups, and divide the workload and storage accordingly. What worries me about this one, is that may incur in downtime, especially if I need to drop and recreate the existing cluster Index when configuring the partition.

    I did partitioning before, so I think I got a brief idea of how to implement #2. But what about #1? I will appreciate input about these two, or suggestions for different strategies.

    In a nutshell, my main goal, regardless of the approach, is make that huge table smaller or divide the workload.

    Thanks in advance.

  • how big is "big" ??

    row numbers

    columns numbers

    file size

    index size

    ??

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Hmmm...

    I am not working today, but several GBs ... over of 10 millions rows, off top of my head.

    I do not remember the numbers exactly, but it is huge! I already validated this and there I did not have the approval to delete historical data, so pretty much those two are my only options.

    The historical data MUST be available via T-SQL query all the time.

  • sql-lover (8/18/2012)


    Hmmm...

    I am not working today, but several GBs ... over of 10 millions rows, off top of my head.

    I do not remember the numbers exactly, but it is huge! I already validated this and there I did not have the approval to delete historical data, so pretty much those two are my only options.

    The historical data MUST be available via T-SQL query all the time.

    hmm...10 million rows is not that big....I have tables of that size on my laptop.

    In a nutshell, my main goal, regardless of the approach, is make that huge table smaller or divide the workload.

    what are you trying to achieve by reducing table size...?

    are you experiencing poor performance in your application ?

    perhaps if you explain the reasons behind you request....someone will chime in.

    good luck

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • -Performance

    -Disk's space (the whole db is over 1TB now)

    Reducing the table will allow me to save lot of space and RAM. The reports that take almost an hour, will run faster.

    Honestly, the reason for doing this is clear to me 🙂 ... I just want help about how to physically apply those, specially splitting the table and creating a view for it.

    Any suggestions in how to implement what I explained?

    Thanks.

  • sql-lover (8/18/2012)


    -Performance

    -Disk's space (the whole db is over 1TB now)

    Reducing the table will allow me to save lot of space and RAM. The reports that take almost an hour, will run faster.

    Honestly, the reason for doing this is clear to me 🙂 ... I just want help about how to physically apply those, specially splitting the table and creating a view for it.

    Any suggestions in how to implement what I explained?

    Thanks.

    thanks for clarifying...appreciated.

    I am not the person to advise...but I expect other forum members with the relevant experience will come along soon to assist.

    good luck....

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • sql-lover (8/18/2012)


    Ok,

    Idea #1

    Splitting the table in two. Keeping table #1 on the actual production database server but with data for current year. Then, based on the PK or date time column, move everything that is older to a different table and MS-SQL server , kind of data archiving stuff. Once that's done, replace any table reference on existing store procedure with a view, that points to the two physical tables.

    Not ideal but could work. This could be done in the way you used to partition pre 2005. Research partitioned views

    sql-lover (8/18/2012)


    Ok,

    Idea #2.

    This one may improve performance, but I guess keeping the information on the same MS-SQL instance. This one is using partitions. Create two more file groups, and divide the workload and storage accordingly. What worries me about this one, is that may incur in downtime, especially if I need to drop and recreate the existing cluster Index when configuring the partition.

    Partitioning is not really a solution to performance, there are performance problems with aggregations etc. If they clarify they want x years of data then you should look this and create a sliding window process to switch old partitions to an archive table (or abandon them all together).

    Personally, based on the information given and assuming there is indefinite data retention, I would create a single file group with multiple equally sized files on different LUN's. SQL could share the IO over multiple threads and increase IO throughput to a degree.

    If your customer intends to keep a lot of data then you should really be looking at carefully selected indexes on the table.

  • MysteryJimbo (8/19/2012)


    Partitioning is not really a solution to performance, there are performance problems with aggregations etc. If they clarify they want x years of data then you should look this and create a sliding window process to switch old partitions to an archive table (or abandon them all together).

    Why?

    It is my understanding that doing partitions, the T-SQL query may skip some data set altogether, so the query won't scan the whole table.

    MysteryJimbo (8/19/2012)


    Personally, based on the information given and assuming there is indefinite data retention, I would create a single file group with multiple equally sized files on different LUN's. SQL could share the IO over multiple threads and increase IO throughput to a degree.

    If your customer intends to keep a lot of data then you should really be looking at carefully selected indexes on the table.

    That is correct. There is indefinite data retention. And while I am trying to change this business decision, it is out of my control and not a quick or short term solution.

    I cannot separate or split workload on same server, because I have no more LUNs, and I cannot redo the disks, this is a "no no". While ideal, it's not possible at this time. My only choice is migrate or move chunks of data, same table, to a different server that resides on same network and datacenter. Hence my idea of a view and moving part of the table outside of the existing server.

  • sql-lover (8/20/2012)


    MysteryJimbo (8/19/2012)


    Partitioning is not really a solution to performance, there are performance problems with aggregations etc. If they clarify they want x years of data then you should look this and create a sliding window process to switch old partitions to an archive table (or abandon them all together).

    Why?

    It is my understanding that doing partitions, the T-SQL query may skip some data set altogether, so the query won't scan the whole table.

    worth a read....http://www.simple-talk.com/sql/database-administration/gail-shaws-sql-server-howlers/

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • J Livingston SQL (8/20/2012)


    sql-lover (8/20/2012)


    MysteryJimbo (8/19/2012)


    Partitioning is not really a solution to performance, there are performance problems with aggregations etc. If they clarify they want x years of data then you should look this and create a sliding window process to switch old partitions to an archive table (or abandon them all together).

    Why?

    It is my understanding that doing partitions, the T-SQL query may skip some data set altogether, so the query won't scan the whole table.

    worth a read....http://www.simple-talk.com/sql/database-administration/gail-shaws-sql-server-howlers/

    Thanks!

    Really interesting, as most, if not all, Gail's articles. 😉 But I guess does not answer my question or it's a little bit off topic.

    What I really want to know if which one is a best approach for my particular problem. Based on the comments so far, it looks like moving some old data, may be the best option.

    Also, I think I missed an important point. I cannot optimize this more with Indexes. This is a data warehouse where some queries are "ad hoc" queries. I can't be more specific because some queries are created "on the fly".

    I just need to divide or split the table, while keeping existing store procedures logic intact.

  • By the way,

    The table in question has a little bit more than 20 million rows. It's 2.7GB with 4GB in Indexes.

    🙂

  • To answer the immediate question, I would recommend partitioning. We have data warehouses for several clients, and this solution works great for us.

    In my opinion, the bigger question is why is 20 million records consuming 2.7 GB? Might want to review the data types of the columns used and/or maybe consider using compression for you indexes.

  • john 50727 (8/20/2012)


    To answer the immediate question, I would recommend partitioning. We have data warehouses for several clients, and this solution works great for us.

    In my opinion, the bigger question is why is 20 million records consuming 2.7 GB? Might want to review the data types of the columns used and/or maybe consider using compression for you indexes.

    Thanks, I'll think about it.

    The main reason for a table this big, it's normalization. But like I said before, short term, normalizing or dividing workload by LUNs is not possible. I fully understand I may not be addressing the root cause, but it is, what it is.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply