Why such a big log size?

  • I have a datamart with one main fact table that is about 45Gbs of data and 9Gbs of index

    It contains 220 million rows

    Within one MERGE statement, we inserted/updated/deleted 80 million rows, resulting with only one more million rows.

    This single statement used 135Gbs of log space ???
    I understand the log keep the pages before and after update, hence I could understand the log being at most twice the size of that table (with index)?

    While I watched the log grow, I noticed that it went steadily to 84GBs, stayed there for quite a while, to finally reach 135Gbs.

    Is there a rational explanation for this?

    Thanks

    Eric

    PS: Further details if that is relevant
    - the merge compares a "Staging" version to the final version of that fact table (about 220 million rows each side)
    - there is no trigger on the fact table

  • How many indexes and how big are they?  All the updates/inserts/deletes also have to happen to each of the indexes as well as the table itself.  These are also logged.

  • To add to what Lynn correctly stated, there's a hidden overhead that most people don't even know about and those are the effects of "Bad" page splits on the log file.  When one of those occurs, roughly half (normally) of the data is moved to a new page.  Every single row that is moved is fully logged.   The smaller the rows, the more there are per page, the more logging that must be done.

    There's also a super nasty "Bad" page split that isn't as rare as people may think an those are "skewed" page splits.  This is where the initial page split doesn't satisfy the need to add a row to either page and more pages are split in the process.  Normal "Bad" page splits can easily take 4.5 times or more the number of bytes logged and "skewed" page splits can easily take 44 times more log file space.  Paul Randal over at SQLSkills.com has a wonderful article on the hidden cost of "Bad" page splits.  I don't have the link handy, though, so you'll need to look for it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There is a Clustered Index on a composed Primary Key plus a non unique index on a timestamp (Datetime)

  • Eric Mamet - Friday, September 14, 2018 7:41 AM

    There is a Clustered Index on a composed Primary Key plus a non unique index on a timestamp (Datetime)

    Mind telling us what the columns are for that Primary Key?  It's a bit important to this process. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Also, the information in the transaction log is not stored as it is in the data files.  Comparing the size of the transaction log to the database files themselves is really an apple and oranges thing.

  • I was "assuming" pages are entirely copied to the log (before and after)

    For the columns, here it is

    CREATE TABLE ... (
      [FactDPRetailerMonthlyId] INT     NOT NULL,
      [CountryKey]     INT     NOT NULL,
      [ReportingProductGroupKey] INT     NOT NULL,
      [ReportingChannelKey]  SMALLINT   NOT NULL,
      [DistributionTypeKey]  SMALLINT   NOT NULL,
      [BrandKey]      INT     NOT NULL,
      [PeriodKey]      INT     NOT NULL,
      [RawUnitsSold]     DECIMAL (19, 4) NOT NULL,
      [RawSalesValueEUR]   DECIMAL (19, 4) NOT NULL,
      [RawSalesValueNSEEUR]  DECIMAL (19, 4) NOT NULL,
      [RawSalesValueUSD]   DECIMAL (19, 4) NOT NULL,
      [RawSalesValueNSEUSD]  DECIMAL (19, 4) NOT NULL,
      [RawSalesValueLocal]   DECIMAL (19, 4) NOT NULL,
      [RawSalesValueNSELocal]  DECIMAL (19, 4) NOT NULL,
      [POSDataQuality]    TINYINT    NOT NULL,
      [Coverage]      DECIMAL (18, 15) NOT NULL,
      [CoverageDataQuality]  TINYINT    NOT NULL,
      [PriceClassLocalKey]  INT   NOT NULL,
      [SmallTransactionId]   INT     NOT NULL,
      [CreatedOn]      DATETIME   CONSTRAINT [DF_factDPRetailerMonthly_CreatedOn] DEFAULT (getdate()) NOT NULL,
      [CreatedBy]      NVARCHAR (100) CONSTRAINT [DF_factDPRetailerMonthly_CreatedBy] DEFAULT (suser_sname()) NOT NULL,
      [ForcedBySAFE]     BIT     NULL,
      CONSTRAINT [PK_DPRetailer.DPRetailerDatastoreMonthly] PRIMARY KEY CLUSTERED ([CountryKey] ASC, [ReportingProductGroupKey] ASC, [PeriodKey] ASC,
                               [ReportingChannelKey] ASC, [DistributionTypeKey] ASC, [BrandKey] ASC,
                               [PriceClassLocalKey] ASC),
    CONSTRAINT [FK_factDPRetailerMonthly_BrandKey] FOREIGN KEY ([BrandKey]) REFERENCES [dpretailer].[dimDPRetailerBrand] ([BrandId]),
      CONSTRAINT [FK_factDPRetailerMonthly_CountryKey] FOREIGN KEY ([CountryKey]) REFERENCES [dpretailer].[dimDPRetailerCountry] ([CountryId]),
      CONSTRAINT [FK_factDPRetailerMonthly_DistributionTypeKey] FOREIGN KEY ([DistributionTypeKey]) REFERENCES [shared].[dimDistributionType] ([DistributionTypeId]),
      CONSTRAINT [FK_factDPRetailerMonthly_PeriodKey] FOREIGN KEY ([PeriodKey]) REFERENCES [shared].[dimPeriod] ([PeriodId]),
      CONSTRAINT [FK_factDPRetailerMonthly_ReportingChannelKey] FOREIGN KEY ([ReportingChannelKey]) REFERENCES [shared].[dimReportingChannel] ([ReportingChannelId]),
      CONSTRAINT [FK_factDPRetailerMonthly_ReportingProductGroupKey] FOREIGN KEY ([ReportingProductGroupKey]) REFERENCES [dpretailer].[dimDPRetailerReportingProductGroup] ([ReportingProductGroupId])
    );

    GO
    CREATE INDEX IX_DPRetailerDatastoreMonthly_CreatedOn ON dpretailer.factDPRetailerMonthly(CreatedOn);

  • You could try running this in the database:


    select * from fn_dblog(null,null);

  • I'll do that next week because I have just created another log file (on a different drive) and shrank the original (which I'll delete when it becomes inactive).

  • "Unfortunately", we don't seem to have much updates going through the system now...

    Would the output still be useful?

    Lynn Pettis - Friday, September 14, 2018 9:07 AM

    You could try running this in the database:


    select * from fn_dblog(null,null);

    Regards

    Eric

  • Eric Mamet - Monday, September 17, 2018 5:03 AM

    "Unfortunately", we don't seem to have much updates going through the system now...

    Would the output still be useful?

    Lynn Pettis - Friday, September 14, 2018 9:07 AM

    You could try running this in the database:


    select * from fn_dblog(null,null);

    Regards

    Eric

    I just wanted you to see what information is in the transaction log.  Gives you an idea of why it may get bigger than you expect.

  • MERGE is notorious for some performance issues. You might want to try "UPSERT" -- an equivalent UPDATE followed by an INSERT.

    For overall performance,, make sure you have the best clustering key.  Nothing at all wrong with a multi-column PK, but that does not necessarily mean it should be the clustering key.

    What column(s) do you (almost) always specify when querying the table?
    CountryKey or ( CountryKey, ReportingProductGroupKey ) or
      ( CountryKey, ReportingProductGroupKey, [PeriodKey] )
    OR
    CreatedOn?

    If it's CreatedOn, cluster on that, followed by the PK column(s) frequently used for lookup . You could still keep a separate non-clus PK with the same columns if you wanted/needed to.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Monday, September 17, 2018 3:41 PM

    MERGE is notorious for some performance issues. You might want to try "UPSERT" -- an equivalent UPDATE followed by an INSERT.

    For overall performance,, make sure you have the best clustering key.  Nothing at all wrong with a multi-column PK, but that does not necessarily mean it should be the clustering key.

    What column(s) do you (almost) always specify when querying the table?
    CountryKey or ( CountryKey, ReportingProductGroupKey ) or
      ( CountryKey, ReportingProductGroupKey, [PeriodKey] )
    OR
    CreatedOn?

    If it's CreatedOn, cluster on that, followed by the PK column(s) frequently used for lookup . You could still keep a separate non-clus PK with the same columns if you wanted/needed to.

    I thought the growth of the Log usage in two distinct steps made me think that indeed it was the UPSERT aspect.

    I think the Clustered index is fine in regards to our requirements and yes we could split the merge into its constituents (including DELETE).
    Not only it could help with the speed but it would also mean three transactions instead of one (which is good in this case since we use Simple recovery mode and we can take care of integrity ourselves).

    Thanks

    Eric

    PS: for us, the speed is less of an issue that the potential size of the log

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply