Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

any way to reindex gigantic table efficiently? Expand / Collapse
Author
Message
Posted Sunday, July 20, 2014 10:31 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Sunday, October 12, 2014 2:38 PM
Points: 20, Visits: 146
so I have a time series table consisting of 3 columns: a key, a timestamp and a value. The primary key is of course on the key and timestamp. The table is about 80 billion rows. Certain data is deleted from the table every month. This has caused massive fragmentation on the table. Partitioning the table based on say a time doesn't do much good because many of the keys need to remain for all time periods and the key is nothing more than a generated identity.

I do have Enterprise Edition (SQL Server 2008 R2) so I can do an online index rebuild, but if I'm not mistaken, an online rebuild is making a full copy of the primary key. The table itself is about 1.6 terabytes and I only have about 200Gb worth of disk space that is free. I can't span to other drives for now. Anyone have any suggestions on how I can reindex this table?

The table is used 24x7 - pretty much no maintenance time is available. However, I am possibly going to switch to a new server on new storage, so perhaps there's a way to redo the table on the new server before switching. However, this is not going to be easy as 250 million rows are written to this table on a daily basis.
Post #1594435
Posted Monday, July 21, 2014 12:59 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Yesterday @ 7:01 AM
Points: 3,014, Visits: 3,101
cmoy (7/20/2014)
so I have a time series table consisting of 3 columns: a key, a timestamp and a value. The primary key is of course on the key and timestamp. The table is about 80 billion rows. Certain data is deleted from the table every month. This has caused massive fragmentation on the table. Partitioning the table based on say a time doesn't do much good because many of the keys need to remain for all time periods and the key is nothing more than a generated identity.

I do have Enterprise Edition (SQL Server 2008 R2) so I can do an online index rebuild, but if I'm not mistaken, an online rebuild is making a full copy of the primary key. The table itself is about 1.6 terabytes and I only have about 200Gb worth of disk space that is free. I can't span to other drives for now. Anyone have any suggestions on how I can reindex this table?

The table is used 24x7 - pretty much no maintenance time is available. However, I am possibly going to switch to a new server on new storage, so perhaps there's a way to redo the table on the new server before switching. However, this is not going to be easy as 250 million rows are written to this table on a daily basis.


You cannot rebuild that table with such free space you have. You can add some disk space and add some files of/to the tempdb, so then you use the tempdb to sort/rebuild the index online.

Usually tables with over 100 million rows are about to be partitioned.
Because you delete data every month, you should have partitioned your table. With just one switch you're deleting a sub-table (with truncating the stage table) and don't introduce fragmentation to the main table.
Maintenance of partitioned table takes less time. Apart that you can introduce a special maintenance to that table to rebuild only several partitions, and not all, which is an advantage.
You can decrease the FillFactor to 99 or 98 to avoid fast fragmentation.
Why is your key composite of Id-identity and timestamp? you know the advantages of key with identity property.




Igor Micev,
SQL Server developer at Seavus
www.seavus.com
Post #1594453
Posted Monday, July 21, 2014 1:11 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Sunday, October 12, 2014 2:38 PM
Points: 20, Visits: 146
this was all done way before my time.

Partitioning is not feasible at this point - what would you partition on (what would the partition function be?). I would probably wind up with thousands of partitions and I'm not about to try and figure out how to dynamically create a restore database command to pick up these partitions (we do backup/restore onto other machines). It would have been nice to be able to partition on just the timestamp and be able to delete partitions after a certain age, but alas that's not the case.

One of these years we'll go to some big data solution such as Cassandra but until that time...


Post #1594460
Posted Monday, July 21, 2014 11:35 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Today @ 4:33 PM
Points: 2,208, Visits: 3,323
Would have to see your typical queries to be sure, but it's extremely likely the clustering keys are backwards, i.e., they should be ( timestamp, ident ) rather than vice versa. But that's almost impossible to fix/change now. Maybe you can get downtime on Christmas Day? (Not a joke, that's when I do a lot of my "table can't be down" maintenance.)

You can REORGANIZE the index, as that's always done online and in relatively small chunks. Still, make sure you have several gig of pre-allocated but unused log space available.

Also, you can put your data into the bold parts of the script below and then it to check existing index usage.

So the DELETEs are not of contiguous blocks of old rows, but instead some rows are deleted and some are not?

USE [<your_db_name>]

SET DEADLOCK_PRIORITY LOW --probably irrelevant, but just in case

DECLARE @table_name_pattern sysname
SET @table_name_pattern = 'your_table_name'

PRINT 'Started @ ' + CONVERT(varchar(30), GETDATE(), 120)

-- list index usage stats (seeks, scans, etc.)
SELECT
ius2.row_num, DB_NAME() AS db_name,
CASE WHEN i.name LIKE ca2.table_name + '%'
THEN '~' + SUBSTRING(i.name, LEN(ca2.table_name) + 1 +
CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 1, 1) = '_' THEN
CASE WHEN SUBSTRING(i.name, LEN(ca2.table_name) + 2, 1) = '_' THEN 2 ELSE 1 END
ELSE 0 END, 200)
ELSE i.name END AS index_name,
CASE WHEN i.is_unique = 0 THEN 'N' ELSE 'Y' END + '.' +
CASE WHEN i.is_primary_key = 0 AND i.is_unique_constraint = 0 THEN 'N' ELSE 'Y' END AS [uniq?],
ca2.table_name,
i.index_id, --ius.user_seeks + ius.user_scans AS total_reads,
dps.row_count,
SUBSTRING(key_cols, 3, 8000) AS key_cols, SUBSTRING(nonkey_cols, 3, 8000) AS nonkey_cols,
ius.user_seeks, ius.user_scans, ius.user_lookups, ius.user_updates,
ius.last_user_seek, ius.last_user_scan, ius.last_user_lookup, ius.last_user_update,
fk.Reference_Count AS fk_ref_count,
DATEDIFF(DAY, CASE WHEN o.create_date > ca1.sql_startup_date THEN o.create_date
ELSE ca1.sql_startup_date END, GETDATE()) AS max_days_active,
FILEGROUP_NAME(i.data_space_id) AS filegroup_name,
ius.system_seeks, ius.system_scans, ius.system_lookups, ius.system_updates,
ius.last_system_seek, ius.last_system_scan, ius.last_system_lookup, ius.last_system_update
FROM sys.indexes i WITH (NOLOCK)
INNER JOIN sys.objects o WITH (NOLOCK) ON
o.object_id = i.object_id
CROSS JOIN (
SELECT create_date AS sql_startup_date FROM sys.databases WHERE name = 'tempdb'
) AS ca1
CROSS APPLY (
SELECT OBJECT_NAME(i.object_id/*, DB_ID()*/) AS table_name
) AS ca2
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal > 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
ic.key_ordinal
FOR XML PATH('')
) AS key_cols (key_cols)
OUTER APPLY (
SELECT
', ' + COL_NAME(object_id, ic.column_id)
FROM sys.index_columns ic
WHERE
ic.key_ordinal = 0 AND
ic.object_id = i.object_id AND
ic.index_id = i.index_id
ORDER BY
COL_NAME(object_id, ic.column_id)
FOR XML PATH('')
) AS nonkey_cols (nonkey_cols)
LEFT OUTER JOIN sys.dm_db_partition_stats dps WITH (NOLOCK) ON
dps.object_id = i.object_id AND
dps.index_id = i.index_id
LEFT OUTER JOIN sys.dm_db_index_usage_stats ius WITH (NOLOCK) ON
ius.database_id = DB_ID() AND
ius.object_id = i.object_id AND
ius.index_id = i.index_id
LEFT OUTER JOIN (
SELECT
database_id, object_id, MAX(user_scans) AS user_scans,
ROW_NUMBER() OVER (ORDER BY MAX(user_scans) DESC) AS row_num --user_scans|user_seeks+user_scans
FROM sys.dm_db_index_usage_stats WITH (NOLOCK)
WHERE
database_id = DB_ID()
--AND index_id > 0
GROUP BY
database_id, object_id
) AS ius2 ON
ius2.database_id = DB_ID() AND
ius2.object_id = i.object_id
LEFT OUTER JOIN (
SELECT
referenced_object_id, COUNT(*) AS Reference_Count
FROM sys.foreign_keys
WHERE
is_disabled = 0
GROUP BY
referenced_object_id
) AS fk ON
fk.referenced_object_id = i.object_id
WHERE
i.object_id > 100 AND
i.is_hypothetical = 0 AND
i.type IN (0, 1, 2) AND
o.type NOT IN ( 'IF', 'IT', 'TF', 'TT' ) AND
(
o.name LIKE @table_name_pattern AND
o.name NOT LIKE 'dtprop%' AND
o.name NOT LIKE 'filestream[_]' AND
o.name NOT LIKE 'MSpeer%' AND
o.name NOT LIKE 'MSpub%' AND
--o.name NOT LIKE 'queue[_]%' AND
o.name NOT LIKE 'sys%'
)
--AND OBJECT_NAME(i.object_id /*, DB_ID()*/) IN ('tbl1', 'tbl2', 'tbl3')
ORDER BY
--row_count DESC,
--ius.user_scans DESC,
--ius2.row_num, --user_scans&|user_seeks
db_name, table_name,
-- list clustered index first, if any, then other index(es)
CASE WHEN i.index_id IN (0, 1) THEN 1 ELSE 2 END,
key_cols

PRINT 'Ended @ ' + CONVERT(varchar(30), GETDATE(), 120)




SQL DBA,SQL Server MVP('07, '08, '09)

Carl Sagan said: "There is no such thing as a dumb question." Sagan obviously never watched a congressional hearing!
Post #1594727
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse