RE: List Transactions that Meet Greater Than Criteria

SSCertifiable

Points: 5956

September 19, 2012 at 9:08 pm

Very interesting problem! Thank you for posting it. There are some subtle "real-world" quirks in here that make this potentially quite complex.

Dwain is correct, posting some SQL Code to create the tables and populate them with test data is very helpful. I've done that here for anyone else who wants to try the problem. I changed the names of two fields for my own sanity: Trans# became TransID (I don't like special characters in database object names) and Date became TransDate because Date is a keyword in SQL Server. Although, as I think about it, PostDate might be a better name, since that seems to better reflect what the data represents: the date that the credit card transaction posted.

Oh, I also put a creation script for a tally table in here. I thought I would need one, but I ended up not using it. I left it for anyone else who might want to use one. I developed this solution on SQL Server 2008 R2, but I believe it would run on version 2005 or later.

CREATE TABLE Customer

( CustKey INT PRIMARY KEY

,SSN NCHAR(9) NOT NULL

);

INSERT Customer (CustKey, SSN)

SELECT 12345, '123456789'

UNION ALL

SELECT 67890, '123456789'

UNION ALL

SELECT 98765, '987654321'

UNION ALL

SELECT 43210, '987654321'

UNION ALL

SELECT 13579, '246801357'

UNION ALL

SELECT 24568, '246801357';

CREATE TABLE TransDtl

( TransID INT PRIMARY KEY

,CustKey NCHAR(9) NOT NULL

,TransDate DATETIME NOT NULL

,Amount MONEY NOT NULL

);

INSERT TransDtl (TransID,CustKey,TransDate,Amount)

SELECT 1, 12345, '01/01/12', 600.00

UNION ALL

SELECT 2, 12345, '01/02/12', 500.00

UNION ALL

SELECT 3, 67890, '01/03/12', 10.00

UNION ALL

SELECT 4, 98765, '04/01/12', 600.00

UNION ALL

SELECT 5, 43210, '04/02/12', 600.00

UNION ALL

SELECT 6, 43210, '04/03/12', 100.00

UNION ALL

SELECT 7, 13579, '04/02/12', 600.00

UNION ALL

SELECT 8, 24568, '04/03/12', 100.00;

SELECT TOP 1000001

IDENTITY(INT,0,1) AS [N]

INTO dbo.Tally

FROM master.sys.all_columns AS AC1

CROSS JOIN master.sys.all_columns AS AC2;

/* Add a tightly packed clustered Primary Key to maximize

* performance and minimize space used.

*/

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N)

WITH FILLFACTOR = 100;

I'd like to make one quick comment about the data before I dive into my solution. I noticed in the Customer table that some of the SSN's are associated with more than one CustKey. Since you said that you work for a credit card company, I assume that the SSN field represents individual customers and the CustKey field represents individual credit cards. In other words, a customer may have more than one card, but you want to create an alert when a customer charges over a certain amount of money within a certain amount of time, whether they put all of the charges on one card or spread them across multiple cards. (That's one of the "real-world quirks" to which I referred above.)

With that observation out of the way, here is my solution. I'll present the whole thing here and then walk through it step by step.

SET STATISTICS IO ON;

WITH CustomerTrans

AS

( SELECT Customer.SSN AS SSN

,TransDtl.TransID AS TransID

,TransDtl.TransDate AS TransDate

,TransDtl.Amount AS Amount

FROM Customer

JOIN TransDtl

ON Customer.CustKey = TransDtl.CustKey

),

TransGroup

AS

( SELECT A.SSN AS SSN

,ROW_NUMBER() OVER (PARTITION BY A.SSN ORDER BY A.SSN) AS GroupID

,MIN(A.TransDate) AS StartDate

,MAX(B.TransDate) AS EndDate

FROM CustomerTrans AS A

JOIN CustomerTrans AS B

ON A.SSN = B.SSN

AND DATEDIFF(Day,A.TransDate,B.TransDate) BETWEEN 0 AND 1

GROUP BY A.SSN

),

GroupTotal

AS

( SELECT TransGroup.SSN AS SSN

,TransGroup.GroupID AS TransGroupID

,SUM(CustomerTrans.Amount) AS TransGroupAmount

FROM TransGroup

JOIN CustomerTrans

ON TransGroup.SSN = CustomerTrans.SSN

AND CustomerTrans.TransDate BETWEEN TransGroup.StartDate AND TransGroup.EndDate

GROUP BY TransGroup.SSN

,TransGroup.GroupID

HAVING SUM(CustomerTrans.Amount) >= 1000.00

),

FlaggedCustomer

AS

( SELECT SSN

FROM GroupTotal

GROUP BY SSN

)

SELECT CustomerTrans.TransID

,FlaggedCustomer.SSN

,CustomerTrans.TransDate

,CustomerTrans.Amount

FROM FlaggedCustomer

JOIN CustomerTrans

ON FlaggedCustomer.SSN = CustomerTrans.SSN;

First, you'll note that I've set Statistics IO on. That's because I want to see how many times my solution scans the data tables. That will give us some clue as to how well the solution might scale. We don't have much data to work with here, but I imagine that you have a massive amount of data to query at work. My solution scans the data a total of 5 times. Ideally, we'd like for the solution to scan the data only once. There are other factors to consider, of course, but this is just a quick and dirty rule of thumb. Based on this, I would expect my solution to perform fairly well on large data sets. (I also tried Dwain's solution with statistics IO on, and his solution does a total of 6 scans. So, the performance of the two solutions would probably be very similar.)

My solution is all one query, but I've broken the "steps" I use into Common Table Expressions (or CTE's). Each CTE builds on the previous CTE's. So, below I'll re-post each CTE separately and explain what it does and how this builds into a solution to your problem.

WITH CustomerTrans

AS

( SELECT Customer.SSN AS SSN

,TransDtl.TransID AS TransID

,TransDtl.TransDate AS TransDate

,TransDtl.Amount AS Amount

FROM Customer

JOIN TransDtl

ON Customer.CustKey = TransDtl.CustKey

),

The first CTE associates all of the individual transactions with the Customer SSN's, rather than the CustKey in the TransDtl table. We query this CTE throughout the rest of the solution rather than the source TransDtl table, because we need to find Customers that posted more than $1,000 in charges over a two day period, not individual Credit Cards.

TransGroup

AS

( SELECT A.SSN AS SSN

,ROW_NUMBER() OVER (PARTITION BY A.SSN ORDER BY A.SSN) AS GroupID

,MIN(A.TransDate) AS StartDate

,MAX(B.TransDate) AS EndDate

FROM CustomerTrans AS A

JOIN CustomerTrans AS B

ON A.SSN = B.SSN

AND DATEDIFF(Day,A.TransDate,B.TransDate) BETWEEN 0 AND 1

GROUP BY A.SSN

),

This second CTE is the core part of my solution. This is loosely based on one of Itzik Ben-Gan's solutions to the "Islands" problem (finding groups of sequential sequences). I got it from the first volume SQL Server MVP Deep Dives book, but I imagine you can find it online somewhere. The SQL Server Central article that BitBucket references above probably mentions it, too.

However, another one of those "real-world quirks" in this problem is that the intervals are not fixed. We want to find the customers with transactions totaling $1,000 or more in any two day period. So, here I join the CustomerTrans CTE I just created to itself on the Customer SSN where the the post date of the transaction in the first data set is within one day of the transaction in the second data set. Because of the way the DATEDIFF function is written here, the A transactions will only join with other transactions posted on the same day or the next day, and the B transactions will only join with other transactions posted on the same day or the previous day.

Because we don't filter out results where A.TransID = B.TransID this self-join will generate duplicate rows, but that is OK. We just want the starting date for this group of transactions, the ending date and some kind of Group Identifier that we can use later. (Besides, if a customer charges over $1,000 in a single transaction, that meets the selection criteria set out in the problem description.)

That Group Identifier is generated by the ROW_NUMBER() function and it is the key to this portion of the solution. (It is also what I borrowed from Itzik Ben-Gan's solution to the "Islands" problem.)

GroupTotal

AS

( SELECT TransGroup.SSN AS SSN

,TransGroup.GroupID AS TransGroupID

,SUM(CustomerTrans.Amount) AS TransGroupAmount

FROM TransGroup

JOIN CustomerTrans

ON TransGroup.SSN = CustomerTrans.SSN

AND CustomerTrans.TransDate BETWEEN TransGroup.StartDate AND TransGroup.EndDate

GROUP BY TransGroup.SSN

,TransGroup.GroupID

HAVING SUM(CustomerTrans.Amount) >= 1000.00

),

The third CTE joins back to our first CTE, CustomerTrans, groups the transactions by Customer SSN and the Group ID we generated in the second CTE and calculates the total transaction amount for each group. Finally it filters out the groups having a total transaction amount less than $1,000.

FlaggedCustomer

AS

( SELECT SSN

FROM GroupTotal

GROUP BY SSN

)

Our final CTE before we present our final output gets a distinct list of the customers meeting the criteria set forth in the problem.

SELECT CustomerTrans.TransID

,FlaggedCustomer.SSN

,CustomerTrans.TransDate

,CustomerTrans.Amount

FROM FlaggedCustomer

JOIN CustomerTrans

ON FlaggedCustomer.SSN = CustomerTrans.SSN;

And here is our final output. We use the list of Customers with more than $1,000 in charges within any two-day period from the last CTE, join it back to the original list of Customer transactions and list all of the transactions for Customers meeting the criteria.

I'd be very interested to see what solutions others can create. I'd also like to see how the various solutions perform on large data sets. I would expect my solution to have reasonable performance on larger data sets, but it could certainly be beaten. Perhaps Phil Factor would be interested in presenting a similar problem on his Speed Phreaks site. 🙂

Thanks again for posting your problem. I hope that this has been helpful to you. 😎