Is there any obvious performance difference between updating 1 column in a 4 column wide table vs 1 columns in a 1000 column wide table ?

  • I have a table with 4 columns and 10 million rows in which I want to update 1 column
    I have another table with 100 columns and 10 million rows in which again, I want to update 1 column.

    Which one among these 2 would perform better.

  • How many pages are in each table?

  • koustav_1982 - Thursday, December 14, 2017 12:53 PM

    I have a table with 4 columns and 10 million rows in which I want to update 1 column
    I have another table with 100 columns and 10 million rows in which again, I want to update 1 column.

    Which one among these 2 would perform better.

    Are you updating the 10 million rows?
    If the update is only to a few rows, the performance shouldn't be much different, but I suggest that you still test.
    One thing that might affect and it's more probable in a wider table is the presence of multiple indexes that contain the updated column.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • koustav_1982 - Thursday, December 14, 2017 12:53 PM

    I have a table with 4 columns and 10 million rows in which I want to update 1 column
    I have another table with 100 columns and 10 million rows in which again, I want to update 1 column.

    Which one among these 2 would perform better.

    The question I have is why does it matter?  Presumably you're going to update both columns regardless, so does it really matter which one will perform better?

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Pure speculation and more than a couple of assumptions... The chance for a page split is higher when updating the wider table, so over time, accumulated, the update cost is going to be higher for the wider table than for the narrow one.

    However, 100 column table? Are we talking fact table or a "normalized" system? If the latter, U R DOIN IT RONG.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Impossible to say.  It depends primarily on (1) how much free space is available in each page already, (2) as noted already, how many affected indexes there are on each table.

    If the first table had no free space per page, and the second table had a lot, the first would likely be more overhead.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • I believe the answer will be that it takes less time and fewer resources to update X rows in a narrow table than it does to update the same X number of rows in a wide table because the pages being updated do have to be read and a page is the smallest amount of data that can actually be read.  So, the more narrow the table, the more rows you can fit on a page, the fewer the number of pages that will need to be read.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • There's too much speculation and little details with no further replies from the OP.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares - Saturday, December 16, 2017 12:23 PM

    There's too much speculation and little details with no further replies from the OP.

    I don't know about that, ol' friend.  If you stick only with the basics mentioned in the original post, there's no speculation required.

    I DO agree that the OP appears to have abandoned this post, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, December 16, 2017 12:35 PM

    Luis Cazares - Saturday, December 16, 2017 12:23 PM

    There's too much speculation and little details with no further replies from the OP.

    I don't know about that, ol' friend.  If you stick only with the basics mentioned in the original post, there's no speculation required.

    I DO agree that the OP appears to have abandoned this post, though.

    I would still like to know how many rows would be updated, if there are indexes using that column, if it's a varying length type, etc.
    I agree with the replies, but there's nothing conclusive. Also, I don't have the intention to create a 100 column table to test. At least, not for someone else.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • All other things being equal it will take longer to update the table with more columns. If only because there is going to be more disk IO in reading the data. Also if a table has 1000 rows it's quite likely that your updates will cause page splits, which would also slow down updates.

  • Jonathan AC Roberts - Sunday, December 17, 2017 10:12 AM

    All other things being equal it will take longer to update the table with more columns. If only because there is going to be more disk IO in reading the data. Also if a table has 1000 rows it's quite likely that your updates will cause page splits, which would also slow down updates.

    It's not just the reads from disk, which will only occur if the data isn't already cached, and testing makes it look like both a narrow and wide table will perform the same when that occurs because people forget about the eventual writes that must occur.  It's the eventual writes to disk that are the real killer even after the table is cached and most people don't measure that or take it into account in their testing.

    I absolutely agree on the page splits becoming a potentially major performance and memory issue if the update causes expansion of data in variable width columns.

    That being said, one good test is worth a thousand of our expert opinions and I'm working on a demonstration where everything is identical in two tables except one table has a CHAR(1000) column to simulate 100 additional columns.  I'll post it all soon.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To answer the original question... no... there is no OBVIOUS difference.  You have to look for it and then you'll find that there's a huge difference. 😉

    First, here's some test data.  It makes two tables; the first (NarrowTable) only has 3 columns  and the second (WideTable) has those same 3 columns and a 4th CHAR(1000) column meant to simulate having an additional "100 columns".  The 3 similar columns in both tables are populated with exactly the same data.  There is a Clustered Index on one of the columns and it will not be the column that is updated.


    -- DROP TABLE dbo.NarrowTable, dbo.WideTable
    GO
    --=============================================================================
    --      Create the test tables and populate them with identical data.
    --      This takes about 00:01:18 on my laptop to complete.
    --=============================================================================
    --===== We don't need to measure performance for this.
        SET STATISTICS TIME,IO OFF
    ;
    --===== The "NarrowTable" contains only 3 columns and a Clustered Index
         -- to keep things as simple as the original post.
     CREATE TABLE dbo.NarrowTable
            (
             SomeID         BIGINT PRIMARY KEY CLUSTERED
            ,SomeGuid       UNIQUEIDENTIFIER
            ,SomeDT         DATETIME
            )
    ;
    --===== The "WideTable" is schematically identical to the NarrowTable except
         -- it has a CHAR(1000) column to simulate "an additional 100 columns".
     CREATE TABLE dbo.WideTable
            (
             SomeID         BIGINT PRIMARY KEY CLUSTERED
            ,SomeGuid       UNIQUEIDENTIFIER
            ,SomeDT         DATETIME
            ,SomeWideCol    CHAR(1000) DEFAULT 'X'
            )
    ;
    --===== This populates the NarrowTable with some simple data.
     INSERT INTO dbo.NarrowTable
            (SomeID, SomeGuid, SomeDT)
     SELECT TOP 1000000
             SomeID     = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            ,SomeGuid   = NEWID()
            ,SomeDT     = GETDATE()
       FROM      sys.all_columns ac1
      CROSS JOIN sys.all_columns ac2
    ;
    --===== This copies the exact same data from the NarrowTable to the WideTable
         -- so that we're doing the exact same test. The only difference here is
         -- the presence of the "additional 100 columns" simulated by the
         -- SomeWideCol column.
     INSERT INTO dbo.WideTable WITH (TABLOCK)
            (SomeID, SomeGuid, SomeDT)
     SELECT SomeID, SomeGuid, SomeDT
       FROM dbo.NarrowTable
      ORDER BY SomeID
     OPTION (RECOMPILE)
    ;
    --===== Flush everything to disk.
    CHECKPOINT
    ;

    Next, we have some test code.  Each table will be tested 3 times, the first of which is absolutely "cold cache" because of all the nasty DBCC stuff used just prior to starting the trilogy run for each table.  DO NOT RUN THIS CODE ON A PRODUCTION BOX BECAUSE IT DROPS ALL DATA FROM MEMORY!!!


    /*****************************************************************************/ 
    /********** DANGER!!! DO NOT RUN THIS CODE ON A PRODUCTION SYSTEM!!! *********/
    /*****************************************************************************/
    --=============================================================================
    --      Test 1: Update the NarrowTable
    --=============================================================================
    --===== Identify the test in the Messages tab.
        SET STATISTICS TIME,IO OFF;
      PRINT REPLICATE('=',80);
      PRINT SPACE(8)+'Test 1: Update NarrowTable';
      PRINT REPLICATE('=',80);

    --===== Clear all data and cache from the system.
         -- DANGER!!! DON'T EVEN THINK OF RUNNING THIS ON A PRODUCTION BOX!!!!
        SET STATISTICS TIME,IO OFF;
            CHECKPOINT;
       DBCC FREESYSTEMCACHE ('ALL');
       DBCC FREESESSIONCACHE;
       DBCC FREEPROCCACHE;
       DBCC DROPCLEANBUFFERS;
    WAITFOR DELAY '0:00:10' -- Wait for everything to "calm down".
        SET STATISTICS TIME,IO ON;
    GO
     PRINT REPLICATE('-',80)   
    DECLARE @Now DATETIME;
     SELECT @Now = GETDATE();
     UPDATE dbo.NarrowTable
        SET SomeDT = @Now;
    CHECKPOINT;
    GO 3
    /*****************************************************************************/ 
    /********** DANGER!!! DO NOT RUN THIS CODE ON A PRODUCTION SYSTEM!!! *********/
    /*****************************************************************************/
    --=============================================================================
    --      Test 2: Update the WideTable
    --=============================================================================
    --===== Identify the test in the Messages tab.
        SET STATISTICS TIME,IO OFF;
      PRINT REPLICATE('=',80);
      PRINT SPACE(8)+'Test 2: Update WideTable';
      PRINT REPLICATE('=',80);

    --===== Clear all data and cache from the system.
         -- DANGER!!! DON'T EVEN THINK OF RUNNING THIS ON A PRODUCTION BOX!!!!
        SET STATISTICS TIME,IO OFF;
            CHECKPOINT;
       DBCC FREESYSTEMCACHE ('ALL');
       DBCC FREESESSIONCACHE;
       DBCC FREEPROCCACHE;
       DBCC DROPCLEANBUFFERS;
    WAITFOR DELAY '0:00:10' -- Wait for everything to "calm down".
        SET STATISTICS TIME,IO ON;
    GO
     PRINT REPLICATE('-',80)   
    DECLARE @Now DATETIME;
     SELECT @Now = GETDATE();
     UPDATE dbo.WideTable
        SET SomeDT = @Now;
    CHECKPOINT;
    GO 3
    --===== End of test
        SET STATISTICS TIME,IO OFF;

    I've summarized the results in the following Excel spreadsheet.

    

    If you look at the Light Blue cells, you can see there is some little difference on the million row update for CPU usage even if you look even if you look at the first row.  Nothing obvious there.

    If you look at the Light Green cells, the two tables take turns slightly beating each other on the 2nd and 3rd run after the data has been cached but the first run takes significantly more time on the first run for the wide table because it has to load so much data.  Remember that for "Row Store" data, whole pages must be read, not just the column data you want to work with.  Still, nothing real obvious because you expect the data to be loaded in cache and used many times.

    That's were most people stop looking.

    The most obvious thing that a lot of people forget is that whole pages must be read.  Since we're updating the whole table with the test code, look at the (MB) that must be loaded into memory.  40MB compared to 1.118GB is a huge difference.  So there's difference #1.

    Because it usually occurs in the background and isn't usually measured in these types of tests, most people totally forget another measurement.  They forget that all those pages you modified, even the 100+ columns that you didn't touch, still have to be written to disk (Flushed) sometime.  If you look at the Orange cells, you'll see there's quite a difference between the amount of CPU time it takes to Flush the Narrow and Wide tables but it's still relatively trivial considering that we're working with a million rows here.  The biggest difference can be found in the Yellow cells... that the duration it takes to write all those pages to disk.  And, not listed on the spreadsheet is that all those "Reads" on the first run become "Writes" on the update runs.  That makes for a couple more no-so-obvious differences.

    Overall, it means that updating all of the rows in the WideTable takes about 9 or 10 times longer on "warm cache" runs and about 12 times longer on a "cold cache" run.

    That doesn't even include any transaction log file measurements. 😉

    Running the same tests without the Clustered Index PKs resulted in very similar run times and resource usage.

    Despite the fact that we've not explored every possibility, I'm willing to suggest that the more columns you have in your table, even for a single column update in a row store, it will make a huge difference in resource usage and overall duration virtually all the time.  I'm also not one to justify things based on possibly low row counts alone.  The same ratio of resource usage and duration will exist and I can't see wasting either just because there's a low row count if I don't need to.  As Granny used to say, "Mind the pennies and the dollars will take care of themselves". 😉 

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Stupid new forum software doesn't allow expansion of the graphic I posted... at least not for me.  Here's something a little larger that I copied from Excel to Word and then copied from that to here. 

    Measurement

    Logical Reads

    Physical Reads

    Read Aheads

    Update
    CPU ms

    Update Duration ms

    Flush
    CPU ms

    Flush
    Duration ms

    Sum
    CPU ms

    Sum
    Duration ms

    Table

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    Narrow

    Wide

    1st Run

        5,102

       143,161

    21

    363

    5,093

    143,154

    967

    1,186

    2,356

    16,925

    0

    281

    776

    19,837

    967

    1,467

    3,132

    36,762

    2nd Run

        5,102

       143,161

    0

    0

    0

    0

    905

    1,030

    1,472

    1,273

    16

    265

    642

    18,969

    921

    1,295

    2,114

    20,242

    3rd Run

        5,102

       143,161

    0

    0

    0

    0

    920

    1,076

    1,255

    1,374

    16

    203

    729

    20,969

    936

    1,279

    1,984

    22,343

    Average

        5,102

       143,161

     

     

     

     

    931

    1,097

    1,694

    6,524

    11

    250

    716

    19,925

    941

    1,347

    2,410

    26,449

    (MB)

              40

           1,118

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi All

    Thanks for your replies and thanks Jeff Moden for the extensive research.

    Yes the question was related to a fact table design. 

    The background :
    In Data warehousing , in the fact table, revisions of the same record keeps getting posted , with minor changes in a few columns. Say , an order fact record has a quantity of 10 initially and then with the next revision it was updated to 20.
    We have 2 options here , 1 . to  update the same record , and 2 . to insert a new record which will save the update cost , but adds an additional cost to update the older revision with a flag - marking the record is no longer valid or to put an expiration date. 

    The first approach :
    To update , we can update the column that has changed - this adds to the cost of comparing all the columns in a fact table with the incoming record and find out what changed or update all the columns using the incoming record - both are costly operations

    The second approach : 

    To update a flag column, we have to search for an existing record in the fact table , and then update the flag / expiration date for that column. - This is what we are doing now and which is where the question comes.

    If we separate the PK and the flag to a separate table, we still have to go through the searching section to see if a record already exists, but then we come back and update a smaller table with the flag value.
    The downside is that queries on the fact table also has to join this outrigger table at run time.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply