Best Way to Accomplish a Purging Task

  • I am working on a solution to purge data older than a specific time frame. I'm looking to see if there's a high performance solution to this problem.

    This is not the actual data and tables I'm using, but it is the same basic concept.

    Here's the sample schema & data

    CREATE TABLE Orders

    (

    OrderID int NOT NULL,

    CustID int NOT NULL,

    OrderDate datetime NOT NULL

    );

    CREATE TABLE Customer

    (

    CustID int NOT NULL,

    CustName varchar(100) NOT NULL

    );

    INSERT INTO Customer(CustID, CustName) VALUES (1,'Customer1');

    INSERT INTO Customer(CustID, CustName) VALUES (2,'Customer2');

    INSERT INTO Customer(CustID, CustName) VALUES (3,'Customer3');

    INSERT INTO Customer(CustID, CustName) VALUES (4,'Customer4');

    INSERT INTO Orders (OrderID, CustID, OrderDate) VALUES (1, 1, '1/1/2000');

    INSERT INTO Orders (OrderID, CustID, OrderDate) VALUES (2, 2, '1/1/2000');

    INSERT INTO Orders (OrderID, CustID, OrderDate) VALUES (3, 3, '1/1/2007');

    INSERT INTO Orders (OrderID, CustID, OrderDate) VALUES (4, 4, '1/1/2008');

    INSERT INTO Orders (OrderID, CustID, OrderDate) VALUES (5, 1, '1/1/2010');

    The scenario would be to delete orders older than 2005 and to also delete customers if they hadn't placed any orders newer than 2005.

    So for this sample data, I would want to delete Order #'s 1 and 2 as well as Customer # 2, but not Customer # 1 (since they had placed a newer order)

    Deleting the orders is easy and isn't presented here. Deleting the customers efficiently is what I'm concerned about.

    Both real tables have millions of rows, so the solution has to scale.

    Here's what I have so far:

    DECLARE @purged TABLE (OrderID int NOT NULL)

    -- there will be a query that populates this table, but this is just for the example

    INSERT INTO @purged (OrderID) VALUES (1)

    INSERT INTO @purged (OrderID) VALUES (2)

    DELETE c

    FROM Customer c

    WHERE EXISTS(

    SELECT *

    FROM Orders o1

    INNER JOIN @purged p1

    ON o1.OrderID = p1.OrderID

    AND c.CustID = o1.CustID

    )

    AND NOT EXISTS(

    SELECT *

    FROM Orders o2

    LEFT OUTER JOIN @purged p2

    ON o2.OrderID = p2.OrderID

    WHERE p2.OrderID IS NULL

    AND o2.CustID = c.CustID

    )

    This works, but I'm wondering if I'm missing something that would make this more efficient.


    David

  • What version of SQL, what edition?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • SQL 2008 Enterprise


    David

  • Then for the orders at least, consider table partitioning. Partition on order date (or similar), decide on a suitable partition function, maybe one partition per year, then you can delete old orders as a metadata operation, similar to a drop table. It really doesn't get any more efficient than that.

    Customers are a fair bit harder though. Perhaps a batched delete WHERE NOT EXISTS (SELECT 1 FROM Orders WHERE Orders.CustomerID = Customers.CustomerID) (once you've deleted the orders). Delete in chunks of a few thousand rows until the delete deletes no more rows. Often quicker than doing a single delete, less harsh on log and locks.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply