any way to reindex gigantic table efficiently?

  • so I have a time series table consisting of 3 columns: a key, a timestamp and a value. The primary key is of course on the key and timestamp. The table is about 80 billion rows. Certain data is deleted from the table every month. This has caused massive fragmentation on the table. Partitioning the table based on say a time doesn't do much good because many of the keys need to remain for all time periods and the key is nothing more than a generated identity.

    I do have Enterprise Edition (SQL Server 2008 R2) so I can do an online index rebuild, but if I'm not mistaken, an online rebuild is making a full copy of the primary key. The table itself is about 1.6 terabytes and I only have about 200Gb worth of disk space that is free. I can't span to other drives for now. Anyone have any suggestions on how I can reindex this table?

    The table is used 24x7 - pretty much no maintenance time is available. However, I am possibly going to switch to a new server on new storage, so perhaps there's a way to redo the table on the new server before switching. However, this is not going to be easy as 250 million rows are written to this table on a daily basis.

  • cmoy (7/20/2014)


    so I have a time series table consisting of 3 columns: a key, a timestamp and a value. The primary key is of course on the key and timestamp. The table is about 80 billion rows. Certain data is deleted from the table every month. This has caused massive fragmentation on the table. Partitioning the table based on say a time doesn't do much good because many of the keys need to remain for all time periods and the key is nothing more than a generated identity.

    I do have Enterprise Edition (SQL Server 2008 R2) so I can do an online index rebuild, but if I'm not mistaken, an online rebuild is making a full copy of the primary key. The table itself is about 1.6 terabytes and I only have about 200Gb worth of disk space that is free. I can't span to other drives for now. Anyone have any suggestions on how I can reindex this table?

    The table is used 24x7 - pretty much no maintenance time is available. However, I am possibly going to switch to a new server on new storage, so perhaps there's a way to redo the table on the new server before switching. However, this is not going to be easy as 250 million rows are written to this table on a daily basis.

    You cannot rebuild that table with such free space you have. You can add some disk space and add some files of/to the tempdb, so then you use the tempdb to sort/rebuild the index online.

    Usually tables with over 100 million rows are about to be partitioned.

    Because you delete data every month, you should have partitioned your table. With just one switch you're deleting a sub-table (with truncating the stage table) and don't introduce fragmentation to the main table.

    Maintenance of partitioned table takes less time. Apart that you can introduce a special maintenance to that table to rebuild only several partitions, and not all, which is an advantage.

    You can decrease the FillFactor to 99 or 98 to avoid fast fragmentation.

    Why is your key composite of Id-identity and timestamp? you know the advantages of key with identity property.

    Igor Micev,My blog: www.igormicev.com

  • this was all done way before my time.

    Partitioning is not feasible at this point - what would you partition on (what would the partition function be?). I would probably wind up with thousands of partitions and I'm not about to try and figure out how to dynamically create a restore database command to pick up these partitions (we do backup/restore onto other machines). It would have been nice to be able to partition on just the timestamp and be able to delete partitions after a certain age, but alas that's not the case.

    One of these years we'll go to some big data solution such as Cassandra but until that time...

  • Would have to see your typical queries to be sure, but it's extremely likely the clustering keys are backwards, i.e., they should be ( timestamp, ident ) rather than vice versa. But that's almost impossible to fix/change now. Maybe you can get downtime on Christmas Day? (Not a joke, that's when I do a lot of my "table can't be down" maintenance.)

    You can REORGANIZE the index, as that's always done online and in relatively small chunks. Still, make sure you have several gig of pre-allocated but unused log space available.

    Also, you can put your data into the bold parts of the script below and then it to check existing index usage.

    So the DELETEs are not of contiguous blocks of old rows, but instead some rows are deleted and some are not?

    USE [<your_db_name>]

    SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

    DECLARE @table_name_pattern sysname

    SET @table_name_pattern = 'your_table_name'

    PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

    -- list index usage stats (seeks, scans, etc.)

    SELECT

    ius2.row_num, DB_NAME() AS db_name,

    CASE WHEN i.name LIKE ca2.table_name + '%'

    THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1 +

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 1, 1) = '_' THEN

    CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 2, 1) = '_' THEN 2 ELSE 1 END

    ELSE 0 END, 200)

    ELSE i.name END AS index_name,

    CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +

    CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],

    ca2.table_name,

    i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,

    dps.row_count,

    SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,

    ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,

    ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,

    fk.Reference_Count AS fk_ref_count,

    DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date

    ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,

    FILEGROUP_NAME(i.data_space_id) AS filegroup_name,

    ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,

    ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update

    FROM sys.indexes i WITH (NOLOCK)

    INNER JOIN sys.objects o WITH (NOLOCK) ON

    o.object_id = i.object_id

    CROSS JOIN (

    SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'

    ) AS ca1

    CROSS APPLY (

    SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name

    ) AS ca2

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal > 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    ic.key_ordinal

    FOR XML PATH('')

    ) AS key_cols (key_cols)

    OUTER APPLY (

    SELECT

    ', ' + COL_NAME(object_id, ic.column_id)

    FROM sys.index_columns ic

    WHERE

    ic.key_ordinal = 0 AND

    ic.object_id = i.object_id AND

    ic.index_id = i.index_id

    ORDER BY

    COL_NAME(object_id, ic.column_id)

    FOR XML PATH('')

    ) AS nonkey_cols (nonkey_cols)

    LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON

    dps.object_id = i.object_id AND

    dps.index_id = i.index_id

    LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON

    ius.database_id = DB_ID() AND

    ius.object_id = i.object_id AND

    ius.index_id = i.index_id

    LEFT OUTER JOIN (

    SELECT

    database_id, object_id, MAX(user_scans) AS user_scans,

    ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans

    FROM sys.dm_db_index_usage_stats WITH (NOLOCK)

    WHERE

    database_id = DB_ID()

    --AND index_id > 0

    GROUP BY

    database_id, object_id

    ) AS ius2 ON

    ius2.database_id = DB_ID() AND

    ius2.object_id = i.object_id

    LEFT OUTER JOIN (

    SELECT

    referenced_object_id, COUNT(*) AS Reference_Count

    FROM sys.foreign_keys

    WHERE

    is_disabled = 0

    GROUP BY

    referenced_object_id

    ) AS fk ON

    fk.referenced_object_id = i.object_id

    WHERE

    i.object_id > 100 AND

    i.is_hypothetical = 0 AND

    i.type IN (0, 1, 2) AND

    o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND

    (

    o.name LIKE @table_name_pattern AND

    o.name NOT LIKE 'dtprop%' AND

    o.name NOT LIKE 'filestream[_]' AND

    o.name NOT LIKE 'MSpeer%' AND

    o.name NOT LIKE 'MSpub%' AND

    --o.name NOT LIKE 'queue[_]%' AND

    o.name NOT LIKE 'sys%'

    )

    --AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')

    ORDER BY

    --row_count DESC,

    --ius.user_scans DESC,

    --ius2.row_num, --user_scans&|user_seeks

    db_name, table_name,

    -- list clustered index first, if any, then other index(es)

    CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,

    key_cols

    PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply