New Column Updates

  • Leo Peysakhovich

    Hall of Fame

    Points: 3880

    Lets start from the beginning. [NameKey] [nvarchar](341) NOT NULL.  Looks like the choice of column for the clustered index is very bad. Looks like update/changes to the table are never done by the application but only by the back end processes. It means that the initial physical table creation by DBA brought the problem and made clustered index useless not only for this task but for the many other tasks as well.

  • noeld

    SSC Guru

    Points: 96590

    Am I missing something? ...

    ALTER TABLE MyTable

    ADD NewZip char(10) NULL DEFAULT 'Zipcode' WITH VALUES


    * Noel

  • Charles Kincaid

    SSChampion

    Points: 13593

    Agreed, an alpha field for clustered seems strange.  But then so does 341 for a field length. Sometimes we are not in toal control of the design.  I just hope that this field is generated some way in alpha order.  Otherwise INSERT performance will suck.  And let's not even talk about page splits.

    ATBCharles Kincaid

  • MudLuck

    SSCrazy

    Points: 2271

    Well done author.

    Working with VLDB requires one to step back and take the time to think it out.  While there may be other ways to go about the operation and get simullar times yours is nice for the two reasons you state in your article. 

    Rollbacks are small by using batches, and by following the order of the clustered index you assist SQL in a orderly writing to the data file. 

    One 125 hour roll back can only force us all to start working and thinking about SQL server in this way.

    Thanks loved it.

     

  • James Rimell

    Right there with Babe

    Points: 731

    Erm, I am assuming the article is incomplete, or written incorrectly.  It states initially that the batched zipcode field update takes 20 hours, it does not state that this was then re-run after AlphaName_CS field was populated and THEN gave a result of 35 mins, in fact the way it reads it suggests that the population of the AlphaName_CS field takes 35 mins.   That is a completely different scenerio, firstly since it is not having to cycle through looking for nulls that may have been inserted after the batch update, but mainly since for the population it would have been using the clustered index that exists on the NameKey field anyway.

  • Vic Kirkpatrick-173212

    SSCommitted

    Points: 1801

    Why would anyone have a clustered index on an NVARCHAR(341) column? Just curious because this is a main part of the problem to begin with. Hopefully this does not get inserted by a user on a table this size.

  • Leo Peysakhovich

    Hall of Fame

    Points: 3880

    If DBA is not in total control of table indexes then .......   I can understand that dba is not always in control with database design but nobody tells dba where and how to use indexes but dba. It means that smart way to do the task was 1. analyze indexes  2. use standard way for the update.

  • Bimal Fernando

    SSCommitted

    Points: 1945

    Remeber very carefully. Don't ever set a default value when you add a new column to a table with such a large number of rows. Becasue it is no different to addding the column as null and then issue one transaction to update all the 35 million records with the default value. It's the worst way of doing this. That'll load the whole server and block the table for a very long time as well.

  • Bimal Fernando

    SSCommitted

    Points: 1945

    Yes, I agree with you guys. the Design is crap here, but my solution is for people who cannot change the whole design to add a new column to a table like this and to explain how the best way to traverse through a table to do this type of an update.

    I am sure there are lot of people who were trying to do this type of thing and waited so long it to finish saying to themlesves "This is a very large table, no wonder it takes so long". This will help them to understand the underline operation bit better and approach it more efficiently next time.

    And others who are designing the tables, now you know the troubles ahead if you didn't design it properly at the first time.

  • Bimal Fernando

    SSCommitted

    Points: 1945

    OK, firstly sorry if my writing is bit confusing. i'll try to clear it up with this explanation.

    My first example which updates the newzip column, is actually to demonstrate the method widely suggested to follow when you add a new column to a very large table. That's just an example, and i first used that method to do the update on my own table called [LargeTable] and that's the one took 20+ hours. And i haven't posted it in the article because the provided example (with NewZip column) is good enough to understand what it does.

    But for the solution i am suggesting, i have given the exact code i ran on my table [LargeTable].

    Hope this clears the confusion.

  • SDM

    SSCommitted

    Points: 1560

    I was going to keep out of this until the statement about the design being crap. 

    The clustered index is there - on that specific column - because it was designed that way for the critical operations on that DB.  The performance-critical activities are searches on data, where the search criteria (usually) translate quite nicely into range searches on the column implemented by the clustered index.  There is no other column on that table which would give the slightest benefit in a range search. 

    As to length... yes, that does look a bit silly.  It is in fact accommodating the largest value that could be constructed as the search key without truncation.  Does it really need to be that big?  No.  Nearly 98% of values are  25 characters or less, and the maximum is 56. 

    As to inserting / page splits.  Yep, the effect would be rather worse than if (say) the identity column was used, but (a) the column is never updated, (b) inserts are never done online and (c) the usual insert rate is very low (well, there is one occasional batch process that's a bit heavy, and a bit of a dog, but that's a separate thread entirely).

    What are my motives in responding?  Well, yes, I had some involvement in this design, but more important is the principle.  Don't automatically assume character columns are bad in clustered indexes.  A proper design needs to understand ALL the intended uses of the table, and make indexing decisions accordingly.

    An interesting article from Chris Hedgate is at: http://www.sqlservercentral.com/columnists/chedgate/clusterthatindex.asp

  • GSquared

    SSC Guru

    Points: 260824

    I have to agree with SDM. Just because, in general, a clustered index on an alpha field (char, nchar, varchar, nvarchar) is usually a bad idea, doesn't mean it's always a bad idea. Keep in mind that these things are guidelines, not laws.

    A simple example is a list of US states. I have the clustered index (and primary key) on the 2-letter abbreviation in my databases. Why wouldn't I?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • sqlJunkies2

    SSC Veteran

    Points: 225

    I saw the range table creation script and think that we can create it with another twist by utilizing Izik-Ben-Gan "Sequential Number table function" without using loop.

    Result expected

    RangeStart RangeEnd

    AA AB

    AB AC

    ....

    The final solution be like one line of query.

    WITH

    L0 AS (

    SELECT CHAR(n) AS Alphabet FROM dbo.[fn_nums](90) WHERE n >=65)

    SELECT

    A.Alphabet + B.Alphabet AS RangeStart,

    A.Alphabet + char(ascii(B.Alphabet)+1) AS RangeEnd

    FROM L0 AS A, L0 AS B

    WHERE

    A.Alphabet + B.Alphabet <> 'ZZ'

    ORDER BY

    A.Alphabet + B.Alphabet

    Basically the first CTE

    L0 AS (SELECT CHAR(n) AS Alphabet FROM dbo.[fn_nums](90) WHERE n >=65)

    Will give us 'A' to 'Z' and then we cross join itself with the second column is the rangeEnd with second alphabet is the next alphabet of the first column. (It's easier just try to match the result to the query)

    But before that can run we'll need to create THE function

    CREATE FUNCTION [dbo].[fn_nums](@n AS BIGINT) RETURNS TABLE

    AS

    RETURN

    WITH

    L0 AS(SELECT 1 AS c UNION ALL SELECT 1),

    L1 AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),

    L2 AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),

    L3 AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),

    L4 AS(SELECT 1 AS c FROM L3 AS A, L3 AS B),

    L5 AS(SELECT 1 AS c FROM L4 AS A, L4 AS B),

    Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L5)

    SELECT n FROM Nums WHERE n <= @n;

    This function will return a table result with a single column n and start from 1 to the parameter @n super fast. Very useful function check out Inside TSQL querying book by Izik Ben Gan.

    James

  • Bimal Fernando

    SSCommitted

    Points: 1945

    GSquared (4/7/2008)


    I have to agree with SDM. Just because, in general, a clustered index on an alpha field (char, nchar, varchar, nvarchar) is usually a bad idea, doesn't mean it's always a bad idea. Keep in mind that these things are guidelines, not laws.

    A simple example is a list of US states. I have the clustered index (and primary key) on the 2-letter abbreviation in my databases. Why wouldn't I?

    I agree. But please note that I meant to say NVARCHAR(341) is not a good choice for a clustered key and I was talking about very large table having around 30-40 million rows where defragmentation and page splits (reorganizing of pages) can be a real issue.

Viewing 14 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply