In volatile database environments, data removal is often required during recurring full load ETL processes that populate these databases. This presents challenges especially in very large databases that utilize referential integrity (RI) and indexed views. When RI is turned on, data removal must follow a certain order, making regular truncations and deletion frustrating, and nearly impossible. Also, SQL Server does not allow you to truncate tables that have schema binding turned on (a requirement for indexed views), so these tables must be identified and subjected to table deletion processes instead. As a result of these constraints, most data removal solutions require creation and maintenance of different DDL scripts.
In this article I have tried to address data removal issues using one SSIS package with a database connection that can be pointed to any SQL Server database either to remove data whether RI is utilized or not. It can remove data in all the tables in the entire database or only in selected tables in the database. The package also identifies and deletes data from tables that have schema binding turned on eliminating the need to create and maintain any scripts.
Data Removal Bottlenecks
SQL Server data deletion can be very expensive where data from all tables or only selected tables needs to be removed during each ETL run. The process is expensive because the SQL Server DELETE statement removes records one row at a time and also logs each of these actions in the transaction log. Table truncation provides a method of efficiently removing all data from a given table because the TRUNCATE statement removes data by deallocating data pages resulting in the use of fewer of system resource and minimal logging. However, tables referenced by indexed views and foreign key constraints cannot be truncated so data in these tables must be deleted. Such data removal processes may often require scripts of some sort, which must be kept and maintained as new objects are added or removed from the database.
In very large databases, a likely effective scenario is to create and maintain separate scripts to disable RI constraints and drop indexed views before data removal operations and another set of scripts to restore the RI constraints and recreate the index views after the data deletion process completes. Besides maintaining scripts these processes often require additional coordinated effort in enterprise projects where many developers are developing separate ETL streams to populate a single database.
Package Design and Usage
The package has been designed to dynamically generate, store and execute all necessary DDL statement using system views from the database you point it to. The package efficiently uses the TRUNCATE statements as the first option to remove data; it only uses DELETE statements where a table is referenced by an indexed view. Since truncation and data deletion are very big commitments, I have made the process very rigid, requiring the list of all tables to be truncated or deleted by the process inserted into a table. The package can be modified to bypass this enforcement, but only do so if you understand how the process works.
The design also allows you to executed the package separately, especially during development and testing of your ETL process. It can also be used in a production environment as a child package that can be called as an Execute Package task when using the standard Parent-Child ETL package design pattern. You can also simply copy and paste to make the one included Sequence Container in this package the first processing step in any package if you are not using the Parent-Child package design.
Outlined below are the required steps needed to use the package without making any changes except changing the connection to point to the correct database.
Before you use the package on any database, the first step is to execute the SQL statements under Listings 1 and 2 below, once in the database in which you run the package. These statements create the two tables required for the use of the package, namely dbo.TableList and dbo.ForeignKeyList. The roles of the two tables are explained below.
Listing 1. Create required constraints storage tables
CREATE TABLE [dbo].[ForeignKeyList]( [CreateFKStmt] [nvarchar](4000) NULL, [DropFKStmnt] [nvarchar](4000) NULL ); GO
Listing 2. Create the TableList table and insert required list of tables
ALTER TABLE [dbo].[TableList] ADD DEFAULT ((1)) FOR [Truncate] GO CREATE TABLE [dbo].[TableList]( [TableID] [int] IDENTITY(1,1) NOT NULL, [SchemaName] [sysname] NOT NULL, [TableName] [sysname] NOT NULL, [Truncate] [bit] NOT NULL DEFAULT(1), CONSTRAINT [PK_TableList] PRIMARY KEY CLUSTERED ( [SchemaName],[TableName] ASC ) ); GO
Secondly, you have to update and execute the sample SQL statements under Listing 3 below to insert all the tables in the database that you want to include in your truncation process. Update this statement by replacing the tables in the statement with your tables names.
Listing3. Insert tables to be exempt from truncation process into the TableList Table
--insert tables to be Truncated INSERT INTO [dbo].[TableList] ([SchemaName], [TableName], [Truncate]) VALUES (N'schema', N'table1', 1), (N'schema', N'table2', 1), (N'schema', N'table3', 1); GO
For instance, in the example below I want to truncate the Dim.Time and Dim.Geography tables from my database, so I will update and execute the SQL statements below in my database before running the package. Remember, failure to run this statement will result in the two table Dim.Time and Dim.Geography not being truncated with all the other user tables that are not included in the insert statement.
--insert tables to be exempt from Truncation INSERT INTO [dbo].[TableList] ([SchemaName], [TableName], [Truncate]) VALUES (N'dim', N'Time', 1), (N'dim', N'Geography', 1); GO
The dbo.TableList table is used to track all user-tables you may want to truncate in the process. Any table listed in the dbo.TableList with a status of "1" in the truncate column will be truncated in the database. Each table-name listed in this table has a flag that can easily be turned on or off to include or exclude that particular table in the truncation process. For instance, if you later decide not to truncate a table in this list, you can update the [Truncate] field to "0" and the table will not be truncated.
Fig 2. Showing sample content of dbo.TableList with two required tables and two other tables
Remember to insert the names of all tables that you do want to truncate in this table. Figure 2 above shows the two tables that I want to included in the truncation process, a time dimension table named [Dim].[Time] and a Geography dimension table [Dim.[Geography].
The the first column in the dbo.ForeignKeyList table is used to retain a list of the statements executed to restore the foreign key. The the second column is used to retain a list of the statements you can use drop the foreign keys in the database. Populating this table serves as backup should the final step fail after dropping the foreign keys constraints. You can always grab and execute the statements in this table to drop or restore all your foreign keys constraints. Note, it is advisable to copy and save these statements as SQL script files after you run the package the first time if you do not have copies of foreign key constraints scripts saved.
Running Package as a Transactional Unit
To enable this option, right click an empty space in the Control flow of the package, select Properties and change the TransactionOption to Required. Again, right click the Sequence Container in the package, select Properties and change the TransactionOption to Required as shown below.
The package was originally designed to run as a transactional unit, but this requires the Distributed Transaction Coordinator (DTC) service running on both the database server and the machine running the package. This can present some bottlenecks in various environments due to permission issues, so I have disabled that option in this version. This means you may have to copy and restore RI in your database from the dbo.ForeignKeyList if the package runs the first step but for some reason fails to finish the last step in the package.
Fig 2. Fig showing the TransactionOption of the sequence container set to Required
Remember to set the package to this option if you are sure the DTC service is running on both the database server and the machine running the package.
SSIS Tasks Explained
There are six SQL tasks and two ForEachLoop containers in the package, the role of each is explained below.
This SQL task generates and store the statements executed by the package to recreate the foreign keys after the tables are truncate. In this task, the first column in result set from the SQL statement below is passed as a variable to the ForEachLoop Container FELC-CreateFKs downstream. The complete result set is also stored in the dbo.ForeignKeyList table as backup statements that can be used to drop or recreate the foreign keys in the database.
Listing 4. Creating and storing RI constraints
truncate table dbo.ForeignKeyList; with FKs as ( select distinct constraint_object_id as id, fk.delete_referential_action as DeleteAction, fk.update_referential_action as UpdateAction from sys.foreign_key_columns fkc inner join sys.foreign_keys fk on fkc.constraint_object_id = fk.object_id ), FKCs as ( select fkc.constraint_object_id as id, OBJECT_NAME(fkc.constraint_object_id) as FK_Name, N'[' +OBJECT_SCHEMA_NAME(fkc.parent_object_id)+ N']' as Prnt_TblSchema, N'[' +OBJECT_NAME(fkc.parent_object_id)+ N']' as Prnt_TableName, N'[' +OBJECT_SCHEMA_NAME(fkc.referenced_object_id)+ N']' as Ref_TblSchema, N'[' +OBJECT_NAME(fkc.referenced_object_id)+ N']' as Ref_TableName, N'[' +COL_NAME(fkc.parent_object_id, fkc.parent_column_id)+ N']' as Prnt_Column_Name, N'[' +COL_NAME(fkc.referenced_object_id, fkc.referenced_column_id)+ N']' as ref_Column_Name from sys.foreign_key_columns fkc ), FKStmnt as ( select FKs.id, 'IF (NOT EXISTS(SELECT * FROM sys.foreign_keys WHERE name =''' + FKCs.FK_Name +'''))' +' Alter Table ' + FKCs.Prnt_TblSchema + '.'+ FKCs.Prnt_TableName + ' Add Constraint ' + FKCs.FK_Name + ' Foreign Key (' + STUFF( (select ',' + Prnt_Column_Name from FKCs a where a.id =FKs.id FOR XML PATH('') ), 1, 1, '' ) + ') References ' + FKCs.Ref_TblSchema + '.'+ FKCs.Ref_TableName + '(' + STUFF( (select ',' + ref_Column_Name from FKCs a where a.id =FKs.id FOR XML PATH('') ), 1, 1, '' ) + ')' + (CASE WHEN FKs.DeleteAction = 1 THEN ' On Delete Cascade ' ELSE '' END) + (CASE WHEN FKs.UpdateAction = 1 THEN ' On Update Cascade ' ELSE '' END) + ';' as createFKStmt, 'IF (EXISTS(SELECT * FROM sys.foreign_keys WHERE name =''' + FKCs.FK_Name +'''))' + ' alter table ' + FKCs.Prnt_TblSchema + '.'+ FKCs.Prnt_TableName + ' drop constraint ' + FKCs.FK_Name + ';' as DropFKStmnt from FKs inner join FKCs on FKs.id=FKCs.id ) Insert into dbo.ForeignKeyList select createFKStmt, DropFKStmnt from FKStmnt; select createFKStmt from dbo.ForeignKeyList;
The SQL statement in this SQL task shown below generates two types of statements; it generates truncation statement for tables without schema binding and deletion statements for tables with schema binding. The entire result set from the SQL statement is passed as a variable to the FELC-Truncate Tables For Each Loop Container downstream where they are executed to either truncate or delete tables.
Listing 5. Generating truncate and delete statements
With CTE1 AS ( select Distinct v.name, v.object_id, d.depid as DepTble_object_id, d.deptype, OBJECTPROPERTY(v.object_id ,'IsSchemaBound') as schemaBound from sys.views v inner join sysdepends d on v.object_id=d.id where d.deptype=1 ), CTE2 AS ( Select distinct coalesce (CTE1.schemaBound,0) as isSchemaBound, ('['+ s.name+ ']' + '.'+ '['+ t.name+ ']' ) AS TableName FROM sys.schemas s INNER JOIN sys.tables t ON s.schema_id = t.schema_id LEFT JOIN CTE1 ON CTE1.DepTble_object_id=t.object_id Where t.type='U' and t.name in (select Distinct st.TableName From dbo.TableList st Where st.[Truncate]=1 ) and t.name not in (N'TableList',N'ForeignKeyList') ) select Distinct case when CTE2.isSchemaBound=1 then 'Delete ' + TableName else 'truncate table ' + TableName end as Stmnt From CTE2;
This SQL task generates the statements executed by the package to drops the foreign key constraints in the database. In this task, the result set from the SQL statement below is passed as a variable to the FELC-CreateDropFKS For Each Loop Container downstream.
Listing 6. Generating statements to drop foreign key constrains
select 'IF ( EXISTS ( SELECT * FROM sys.foreign_keys WHERE name =''' + object_name( f.object_id ) +'''))' + ' alter table [' + s.Name +'].' + '[' + object_name( f.parent_object_id ) + '] drop constraint [' + f.name + ']' from sys.foreign_keys f INNER JOIN sys.schemas s ON f.schema_id=s.schema_id;
This For Each Loop container loops through drop foreign key statements passed as a result set from the SQL statement in the SQL-CreateDropFKStmnt SQL task upstream. During each loop it passes a drop foreignkey statement row as a variable to the SQL-DropFKs SQL task in the container.
This SQL task executes the drop foreignkey statements that are passed and as variables as the FELC-DropFKs ForEachLoop container loops through the drop foreignkey statements rows in its ResultSet.
This For Each Loop container loops through TRUNCATE or DELETE statements passed as a result set from the SQL statement in the SQL-truncateTables SQL task. During each a loop it passes either of these statement rows as a variable to the SQL-TruncateTables SQL task in the container.
This SQL task executes the TRUNCATE or DELETE table statements that are passed and as variables as the FELC-TruncateTables For Each Loop Container loops through the truncate statement rows in its result set.
This For Each Loop container loops through create foreignkey statements passed as a result set from the SQL statement in the SQL-CreateFKs SQL task. During each a loop it passes a Create foreign key statement row as a variable to the SQL-CreateFKs SQL task in the container.
This SQL task executes the create foreignkey statements that are passed and as variables as the FELC-CreateFKs ForEachLoop Container loops through the truncate statement rows in its ResultSet.
The package has been designed as versatile all-purpose tool that can be used in one-off tasks, but it can be a time saver especially in enterprise level end-to-end BI projects. In one such project it was easily incorporated to truncate four staging areas, each with over 200 hundred entities feeding a global data warehouse and further used to truncate some entities in data marts downstream without the need for a single script.