Changing column datatype performance question

  • Hello All,

    I have a table with hundreds of millions of rows. One particular column (varchar(250)) has about 12 million unique entries which means theres a lot of repeating values. These values are typically 50 characters give or take.

    I got the idea to save space and presumably increase performance by making a new table, inserting the 12 million unique entries into it and having a identity column that I then link in the original table. I would then change the datatype from varchar to int.

    There are a few other columns like this: a varchar with repeating data.

    I have two Questions:

    1. The update of the varchar to the new key int is taking days. I've created an index on both tables which helped, but its still taking about 30 records a second. Is there a better method to do this? I've read mixed things about SSIS packages. It sounds like they would take just as long.
    2.  Am I really going to see much of a performance gain? My assumption is that because an int would take up less space, reads should be faster? Is this just a bad assumption?

    PS: One other question. I have quite a few stored procedures reading the original table. If I create a view that essentially pulls up the same "old" table structure by joining the new table to the old one, would the SP's only need to have the original table name changed to the new view?

  • I'd create completely new tables.

    One table to hold all the varchar values and their new int value, as you're already done, I'm sure.  Load it with all the distinct values from all columns that you want to "encode" (assign an int value to represent the varchar value).

    A second new table with the new table structure, that will eventually replace the old table.  Insert rows from the old table into the new one in batches, in clustered key order, encoding all varchar columns you want to encode.

    Eventually you'll have all rows copied from the old table to the new table.  Then, create a new view that simulates the original table based on the new table.  Confirm it's working correctly.

    Finally, at that point, you're ready to:

    rename the original table to "table_name_original" (or whatever)

    rename the view to the original table name ("table_name" or whatever)

    The 100Ms of rows is not an issue, because you have as long as you need to prep the new table to replace the old.  And you can see the size savings and performance characteristics before going live with it.

    Btw, it would really helpful to know the clustering key(s) of the existing table.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks for the reply, ScottPletcher. That is indeed what I plan to do for the most part: have a secondary copy of the original table to make my changes. Sounds like that is still the fastest solution to changing this. You also answered my question about the view. Thank you.

    I've included the structure of the tables below. The OriginalTable has a lot more columns, but I've trimmed them for brevity.

    Is it far fetched to believe I should see performance gains from a view once the varchar(250) is converted to int?

    CREATE TABLE OriginalTable(
    Column1 [INT] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    Column2 [VARCHAR](250) NULL, --After populating NewTable with this data, it is replaced with ColumnA data and then the datatype is changed to int. Also becomes foreign key.
    Column3 [INT] NULL,
    Column4 [INT] NULL,
    Column5 [INT] NULL,
    --etc
    CONSTRAINT [PK_A_TL] PRIMARY KEY CLUSTERED
    (
    Column1 ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = ?) ON Column33
    ) ON Column33
    GO

    CREATE TABLE NewTable(
    ColumnA [INT] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    ColumnB [VARCHAR](100) NULL, --This contains the original varchar data from Column2 above.
    CONSTRAINT [PK_A_SourceTransaction] PRIMARY KEY CLUSTERED
    (
    [ColumnA] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
  • I think it's a bit more complex than that, at least if you want to keep existing code.  Personally I would not try rewrite all existing INSERT(s) / UPDATE(s) on the table.

    I'd create a new main table -- with the int lookup code added as a new column, not replacing the original column.

    And create the new lookup table also, for converting varchars to ints.

    Finally, you'll need INSERT / UPDATE triggers on the view (that mimics the original table), to handle automatically converting the varchar value(s) to their corresponding int code, including adding any new values to the conversion table ahead of time.

    Something generally like this, although with additional safeguards to make sure the original Column2 value is not lost no matter what:

    /*The table you have now let's call: dbo.OriginalTable*/

    --The new main table would be like this:
    CREATE TABLE dbo.NewTable (
    Column1 [INT] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    Column2 [VARCHAR](100) NULL,
    Column2_code int NULL, --<<--!!
    --...

    --The lookup/conversion table would be like this:
    CREATE TABLE dbo.LookupTable (
    varchar_code int IDENTITY(1, 1) NOT NULL NOT FOR REPLICATION,
    varchar_value varchar(250) NOT NULL
    --...

    --Steps:
    --1) Copy all existing data from OriginalTable to NewTable.
    --2) Rename OriginalTable to OriginalTable_backup (or whatever).
    --3) Create a view from NewTable that exactly matches OriginalTable.
    --4) Create the triggers necessary to support INSERTs / UPDATEs of the original table view.

    GO
    --3)
    CREATE VIEW dbo.OriginalTable
    WITH SCHEMABINDING
    AS
    SELECT NT.Column1, LT.varchar_value, NT.Column3 --, ...
    FROM dbo.NewTable NT
    LEFT OUTER JOIN dbo.LookupTable LT ON LT.varchar_code = NT.Column2_code
    GO

    --4)
    SET ANSI_NULLS ON;
    SET QUOTED_IDENTIFIER ON;
    GO
    CREATE TRIGGER OriginalTable__TRG_INS
    ON dbo.OriginalTable
    INSTEAD OF INSERT
    AS
    SET NOCOUNT ON;
    INSERT INTO dbo.LookupTable ( varchar_value )
    SELECT DISTINCT i.Column2
    FROM inserted i
    WHERE NOT EXISTS(SELECT 1 FROM dbo.LookupTable LT WHERE LT.varchar_value = i.Column2)

    INSERT INTO dbo.NewTable ( Column1, Column2, Column2_code, Column3 ) -- , ...
    SELECT i.Column1, NULL AS Column2, LT.varchar_code AS Column2_Code, i.Column3 --, ...
    FROM inserted i
    LEFT OUTER JOIN dbo.LookupTable LT ON LT.varchar_value = i.Column2
    GO

    SET ANSI_NULLS ON;
    SET QUOTED_IDENTIFIER ON;
    GO
    CREATE TRIGGER OriginalTable__TRG_UPD
    ON dbo.OriginalTable
    INSTEAD OF UPDATE
    AS
    SET NOCOUNT ON;
    INSERT INTO dbo.LookupTable ( varchar_value )
    SELECT DISTINCT i.Column2
    FROM inserted i
    WHERE NOT EXISTS(SELECT 1 FROM dbo.LookupTable LT WHERE LT.varchar_value = i.Column2)

    UPDATE NT
    SET Column1 = i.Column1, Column2 = NULL, Column2_code = LT.varchar_code,
    Column3 = i.Column3 --, ...
    FROM inserted i
    INNER JOIN dbo.NewTable NT ON NT.Column1 = i.Column1
    LEFT OUTER JOIN dbo.LookupTable LT ON LT.varchar_value = i.Column2
    GO

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • I'm not sure I'd say that this is going to help performance, not because I'm against normalization, I'm very much in favor, but because I'm unsure of the root cause of the performance issues here. Do you have wait statistics or execution plans or some other evidence that suggests this is your performance bottleneck? I only ask because, doing this, however correct, and it is, is a ton of work. I'd want to know that all that work was definitely going to help before I did it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Occasionally in this scenario I've used a computed column (normally I persist it - PCC) that mimics your lookup values. once you get all of your code sorted (joins, views etc) then you can drop the original field

    the one obstacles you will hit with this approach is that computed columns can't reference other objects, so you might have to get inventive , maybe using a hash of the text (I know... hash collisions across that many rows are almost inevitable, but that's why we test)

    Dropping the original column might be difficult if you do use a PCC but it does allow you to test your code before you go live

    I think you will find a performance improvement purely on less writes and therefore your disks are more available for reads.

    But I do like a nicely formed 3rd normal form database, if you don't take the normalisation path then it might bite you in a few years.

    MVDBA

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply