Partioning Huge Table (650GB about)

  • Hi all,

    in my data warehouse (SQL Server 2016 Std) I need to maintain 36 (!!) versions of a table in order to query data in different periods of the year
    Actually, I have only one version and the table size is about 18 GB (so, 18*36 = 648GB)
    To achieve this target, I'm thinking to create 36 different filegroups and 36 partitions (each partition for each filegroup) using the version number of the table in  my partition function
    In this way, I'm thinking to develop a FIFO procedure in order to move data from a partition to another each time I load a new version and I delete the last version
    What do you think about this strategy and what should I perform to improve performance, as regard as partitions, filegroups, etc.,  when a query will be executed againt this huge table?
    In this time, data warehouse size is about 60GB using a single filegroup on a SAS disk  provided with SSD cache with a good satisfactions about performance.
    I will appreciate any advice
    Many thanks in advance

  • If I understand you correctly, you have one version of an 18GB table, but in some fashion need 36 versions of it?   As that's a pretty sizable increase in disk usage, and you haven't said anything about what the different versions of this table are going to be based on, we're going to need a LOT more detail to be able to help much.   First, what will be different between the various versions of your table?   Why do you need 36 different versions?   Does date play a role in any of this?   What does this table contain that you need 36 different versions of it?   I could probably ask a LOT more questions, but if we start with these, we can move on once these have been answered.   Try to describe your overall objective in more detail.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If you have 36 tables all with the same schema, instead of moving all that data into another table, why don't you look at using partitioned view (a view over all the tables).  Done properly it could provide you the same benefits as a partitioned table.

  • sgmunson - Tuesday, August 7, 2018 7:29 AM

    If I understand you correctly, you have one version of an 18GB table, but in some fashion need 36 versions of it?   As that's a pretty sizable increase in disk usage, and you haven't said anything about what the different versions of this table are going to be based on, we're going to need a LOT more detail to be able to help much.   First, what will be different between the various versions of your table?   Why do you need 36 different versions?   Does date play a role in any of this?   What does this table contain that you need 36 different versions of it?   I could probably ask a LOT more questions, but if we start with these, we can move on once these have been answered.   Try to describe your overall objective in more detail.

    Hi,
    each version is a snapshot of the table performed on each monday for the last 9 months
    Why..... is a business requirement . I share your doubdt but business requires these 36 snapshots in order to query data comparing the same measures in different periods of the year

  • Lynn Pettis - Tuesday, August 7, 2018 8:02 AM

    If you have 36 tables all with the same schema, instead of moving all that data into another table, why don't you look at using partitioned view (a view over all the tables).  Done properly it could provide you the same benefits as a partitioned table.

    Hi,

    I'm also considering these solution and for this reason I'm writing here 😀
    I'm looking for some experience in these context in order to identify the best (or the less worst) solution

  • Rosanero4Ever - Tuesday, August 7, 2018 8:21 AM

    sgmunson - Tuesday, August 7, 2018 7:29 AM

    If I understand you correctly, you have one version of an 18GB table, but in some fashion need 36 versions of it?   As that's a pretty sizable increase in disk usage, and you haven't said anything about what the different versions of this table are going to be based on, we're going to need a LOT more detail to be able to help much.   First, what will be different between the various versions of your table?   Why do you need 36 different versions?   Does date play a role in any of this?   What does this table contain that you need 36 different versions of it?   I could probably ask a LOT more questions, but if we start with these, we can move on once these have been answered.   Try to describe your overall objective in more detail.

    Hi,
    each version is a snapshot of the table performed on each monday for the last 9 months
    Why..... is a business requirement . I share your doubdt but business requires these 36 snapshots in order to query data comparing the same measures in different periods of the year

    Rosanero4Ever - Tuesday, August 7, 2018 8:23 AM

    Lynn Pettis - Tuesday, August 7, 2018 8:02 AM

    If you have 36 tables all with the same schema, instead of moving all that data into another table, why don't you look at using partitioned view (a view over all the tables).  Done properly it could provide you the same benefits as a partitioned table.

    Hi,

    I'm also considering these solution and for this reason I'm writing here 😀
    I'm looking for some experience in these context in order to identify the best (or the less worst) solution

    If all you are doing is capturing a snapshot of the table at a particular point in time, why not have a trigger that copies every change as a new record in a new data warehouse table, that is initially populated with the existing data, and each change  comes into the new table with picking up a date/time changed, such that the new table's Pk is the existing PK plus the date/time changed column, which may or may not need to be new.   You would also add a "change type" column to delineate what kind of change was made to the row.   Overall, this process would slow down the growth to only those rows that have changed, and would give you a relatively easy way to determine what the table looked like at any given point in time.   Mind you, archiving is more complicated because you need to only delete rows that have been deleted for at least as long as your overall maximum time frame for keeping data from the point of view of the snapshot idea.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Tuesday, August 7, 2018 9:48 AM

    If all you are doing is capturing a snapshot of the table at a particular point in time, why not have a trigger that copies every change as a new record in a new data warehouse table, that is initially populated with the existing data, and each change  comes into the new table with picking up a date/time changed, such that the new table's Pk is the existing PK plus the date/time changed column, which may or may not need to be new.   You would also add a "change type" column to delineate what kind of change was made to the row.   Overall, this process would slow down the growth to only those rows that have changed, and would give you a relatively easy way to determine what the table looked like at any given point in time.   Mind you, archiving is more complicated because you need to only delete rows that have been deleted for at least as long as your overall maximum time frame for keeping data from the point of view of the snapshot idea.

    I don't want a "changes table" but 36 different snapshots even if values are the same.
    Moreover, I cannot insert a trigger in the operational database in order to catch changes on each record.:doze:

  • Rosanero4Ever - Tuesday, August 7, 2018 9:57 AM

    sgmunson - Tuesday, August 7, 2018 9:48 AM

    If all you are doing is capturing a snapshot of the table at a particular point in time, why not have a trigger that copies every change as a new record in a new data warehouse table, that is initially populated with the existing data, and each change  comes into the new table with picking up a date/time changed, such that the new table's Pk is the existing PK plus the date/time changed column, which may or may not need to be new.   You would also add a "change type" column to delineate what kind of change was made to the row.   Overall, this process would slow down the growth to only those rows that have changed, and would give you a relatively easy way to determine what the table looked like at any given point in time.   Mind you, archiving is more complicated because you need to only delete rows that have been deleted for at least as long as your overall maximum time frame for keeping data from the point of view of the snapshot idea.

    I don't want a "changes table" but 36 different snapshots even if values are the same.
    Moreover, I cannot insert a trigger in the operational database in order to catch changes on each record.:doze:

    Good grief, Charlie Brown!   Why don't they just tie you to a post, with one hand behind your back, and leave the keyboard and computer 500 feet away...   and then ask you to do your work.    How, exactly, were you planning on taking your snapshots?   If the table is 18GB, then surely that would take considerably more resources than having a trigger on the table?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Actually, your ETL process should be handling the insertion of data into the data warehouse so that you have the historical data.  In fact, with SQL Server 2016, Microsoft introduced temporal tables that could help you with this.  Of course moving to this type of table may be problematic with the current data, but might be worth the effort.

  • +1 for Temporal Tables as a great solution to this kind of problem and +10 if you're on 2017 and can just set a retention policy. It's a lot easier to work with and much more efficient in terms of storage unless there are high volumes of changes between the period you would normally take snapshots (in which case snapshots are usually missing too much any way).

  • andycadley - Tuesday, August 7, 2018 11:30 AM

    +1 for Temporal Tables as a great solution to this kind of problem and +10 if you're on 2017 and can just set a retention policy. It's a lot easier to work with and much more efficient in terms of storage unless there are high volumes of changes between the period you would normally take snapshots (in which case snapshots are usually missing too much any way).

    Just curious, Andy... what's your experience with wide tables that have about 1 row per page (more than 7k bytes per row... I know... not my idea) and more than 100 columns per row with temporal tables and, after the original insert, you only ever update just 4 columns?

    I've not used temporal tables because they do full row audits and they add columns.  For such wide rows with so few columns being updated, I would think that would be a huge waste of disk space.

    As a bit of a sidebar, I built a nasty fast audit-trigger generator to do columnar audits rather than row audits and, no... I didn't use SQLCLR to do it.  In fact, I replaced the SQLCLR triggers that were there because they were terribly slow. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Tuesday, August 7, 2018 7:35 PM

    Just curious, Andy... what's your experience with wide tables that have about 1 row per page (more than 7k bytes per row... I know... not my idea) and more than 100 columns per row with temporal tables and, after the original insert, you only ever update just 4 columns?

    I've not used temporal tables because they do full row audits and they add columns.  For such wide rows with so few columns being updated, I would think that would be a huge waste of disk space.

    As a bit of a sidebar, I built a nasty fast audit-trigger generator to do columnar audits rather than row audits and, no... I didn't use SQLCLR to do it.  In fact, I replaced the SQLCLR triggers that were there because they were terribly slow. 😀

    We have some stupidly wide tables that lean heavily on row overflow to get round the 8k limit (definitely not my idea) but they live in older, darker corners of the system so historical records there are handled as just literal copies of the whole table contents at a point in time that are post-processed into change records (the data is wide, but relatively short so it's not too onerous). I think in a situation as your describing I'd be inclined to just split the table horizontally so I end up with one table containing the relatively static data and another the stuff that actually changes kept separate with a 1:1 relationship between the two.

    We do have some columnar audits but they're handled by an ETL process that lives between front and back end systems, where being quick has never been the highest priority. And the resultant data ends up all EAV-like (in an apparently rare moment of somebody deciding to minimize storage) so, while it works for the reporting it is used for, it's fairly horrible for anything else.

    Well written triggers are probably a good solution, but I will openly admit to having a fundamental dislike of triggers in general and they're something I tend to avoid if at all possible - they just have a little too much surprise factor when you're reading other code.

  • Thanks and, yeah, joined at the hip with you on vertically spitting such wide tables into sister tables but, like you say, not my idea on these tables and common sense and logic has not prevailed on them or many of the totally null fixed width datatype columns in these tables.  The use of Temporal Tables to audit these tables would kill us for space because the base tables are updated frequently.

    Understood on the feelings about triggers and I'm not one to use them unless they're absolutely necessary, which is usually when nothing else will do.  Then, you've gotta be careful about performance.  Of course, you already know that.

    The people before me installed some easy-to-copy-to-every-table CLR triggers.  OMG they were slow because, in order to determine if a column had actually changed or not (one of the requirements they had to only write real changes to the audit table), they had to materialize both of the trigger logical tables.  So, to update just 4 column on 10,000 rows on these wide monsters was taking 4 minutes.  I wrote code to generate hardcoded triggers when a column was added or deleted on the table and that same amount of work by the hardcoded triggers went down to between 600 and 800 ms. Still a bit slow because of the number of pages affected due to the width of the rows but not much slower than if the trigger didn't exist (we tested that, too).  Most of the slowness came from FKs, which were an absolute necessity on the base tables.

    Anyway, I'm not a big fan of Temporal Tables mostly because of the wide table problem I currently enjoy.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yeah, I can see how in that situation there's a lot more incentive to work with something that's a lot more granular in terms of storage and undoing poor design decisions that have become ingrained in a system is often a difficult case to make when it's working as-is. I have one database where one-to-many relationships are implemented as varchar columns containing CSVs of foreign keys - an abomination that I'd dearly love to pick apart - but as terrible as it is to work with, it does what it needs to (mostly).

  • andycadley - Tuesday, August 7, 2018 10:59 PM

    Yeah, I can see how in that situation there's a lot more incentive to work with something that's a lot more granular in terms of storage and undoing poor design decisions that have become ingrained in a system is often a difficult case to make when it's working as-is. I have one database where one-to-many relationships are implemented as varchar columns containing CSVs of foreign keys - an abomination that I'd dearly love to pick apart - but as terrible as it is to work with, it does what it needs to (mostly).

    So, do you think Temporal Table can be useful for my requirements? In summary, I need to create a snapshot  fact table (see https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/periodic-snapshot-fact-table/)
    Many thanks again

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply