Trying to incorporate RANK or NTILE with LEAD function

  • SoCal_DBD

    Hall of Fame

    Points: 3051

    My scenario is as follows: We have accounts that pay for a particular "premium" service. It's entirely possible an account paid for this service for three consecutive months in 2013, then stopped paying, then started paying again. Why I'm trying to establish is, for the FIRST period of time the accout paid for this service, for how many consecutive months did they pay? Here is my test data:

    if object_id('tempdb..#SampleData') is not null

    drop table #SampleData

    go

    if object_id('tempdb..#DateAnalysis') is not null

    drop table #DateAnalysis

    go

    -- Create temp tables for sample data

    create table #SampleData

    (AccountID int,

    RandomDate datetime)

    create table #DateAnalysis

    (RowID int identity(1,1),

    AccountID int,

    RandomDate datetime,

    NextDate datetime,

    LeadInMonths int)

    -- Insert some sample data

    insert into #SampleData values

    (1, '1/1/13'), (1, '2/1/13'), (1, '3/1/13'), (1, '6/1/13'), (1, '10/1/13'), (1, '11/1/13'), (1, '12/1/13'),

    (2, '1/1/13'), (2, '4/1/13'), (2, '5/1/13'), (2, '6/1/13'),

    (3, '2/1/13'), (3, '3/1/13'), (3, '4/1/13'), (3, '9/1/13'), (3, '10/1/13'), (3, '11/1/13')

    -- Use lead function to determine how many months are between

    -- consecutive dates per account

    ; with DateInterval as

    (select AccountID, RandomDate,

    NextDate = lead (RandomDate, 1, NULL) over (partition by AccountID order by RandomDate)

    from #SampleData)

    insert into #DateAnalysis

    select AccountID, RandomDate, NextDate,

    datediff(mm, RandomDate, NextDate) as 'Lead'

    from DateInterval

    where NextDate is not null -- Last row will contain NULL for NextDate. Don't include these rows.

    -- Show the results

    select *,

    'NTile' = NTILE(3) over (partition by AccountID order by RandomDate),

    'RowNum' = row_number() over (partition by AccountID order by RandomDate)

    from #DateAnalysis

    Results (this is not getting me what I'm looking for):

    RowID AccountID RandomDate NextDate LeadInMonths NTile RowNum

    1 1 2013-01-01 2013-02-01 1 1 1

    2 1 2013-02-01 2013-03-01 1 1 2

    3 1 2013-03-01 2013-06-01 3 2 3

    4 1 2013-06-01 2013-10-01 4 2 4

    5 1 2013-10-01 2013-11-01 1 3 5

    6 1 2013-11-01 2013-12-01 1 3 6

    7 2 2013-01-01 2013-04-01 3 1 1

    8 2 2013-04-01 2013-05-01 1 2 2

    9 2 2013-05-01 2013-06-01 1 3 3

    10 3 2013-02-01 2013-03-01 1 1 1

    11 3 2013-03-01 2013-04-01 1 1 2

    12 3 2013-04-01 2013-09-01 5 2 3

    13 3 2013-09-01 2013-10-01 1 2 4

    14 3 2013-10-01 2013-11-01 1 3 5

    This is what I'm trying to achieve:

    RowID AccountID RandomDate NextDate LeadInMonths RankForThisLeadInMonths

    1 1 2013-01-01 2013-02-01 1 1

    2 1 2013-02-01 2013-03-01 1 1

    3 1 2013-03-01 2013-06-01 3 2

    4 1 2013-06-01 2013-10-01 4 3

    5 1 2013-10-01 2013-11-01 1 4

    6 1 2013-11-01 2013-12-01 1 4

    7 2 2013-01-01 2013-04-01 3 1

    8 2 2013-04-01 2013-05-01 1 2

    9 2 2013-05-01 2013-06-01 1 2

    10 3 2013-02-01 2013-03-01 1 1

    11 3 2013-03-01 2013-04-01 1 1

    12 3 2013-04-01 2013-09-01 5 2

    13 3 2013-09-01 2013-10-01 1 3

    14 3 2013-10-01 2013-11-01 1 3

    The problem comes with accounts like AccountID = 1. They paid consecutively to start, then skipped, then started paying consecutively again. When using window functions, I'm running into trouble attempting to partition by AccountID and LeadInMonths. It's putting all the LeadInMonths = 1 together and that will give me skewed results if I want to know the earliest and latest date within the FIRST consecutive range of dates where the account paid. I've tried NTILE but it expects an integer and there's no telling how many "tiles" would be in AccountID partition.

    I've looked at the OVER clause and the new "ROWS BETWEEN" syntax and still cannot get the desired results. Perhaps I need an entirely different approach?

    Thank you

  • jonharding2204

    SSC-Addicted

    Points: 450

    Hey there,

    This solution gives you what you've described, but with different output columns:

    --First I use LAG to find where a new payment period begins by checking the months between the current row and the previous row

    --split by AccountID

    WITH FlagPaymentPeriods AS

    (

    SELECT

    AccountID

    ,RandomDate

    ,CASE WHEN LAG(RandomDate) OVER (PARTITION BY AccountID ORDER BY RandomDate) IS NULL

    OR DATEDIFF(month, LAG(RandomDate) OVER (PARTITION BY AccountID ORDER BY RandomDate), RandomDate) > 1

    THEN 1

    ELSE 0

    END AS NewPaymentPeriodFlag

    FROM

    SampleData

    )

    --Now I can effectively do a running total of the flags again split by AccountID

    , PaymentPeriods AS

    (

    SELECT

    AccountID

    ,RandomDate

    ,SUM(NewPaymentPeriodFlag) OVER (PARTITION BY AccountID ORDER BY RandomDate) AS PaymentPeriod

    FROM

    FlagPaymentPeriods

    )

    --Now we just find how many rows we have in account where the PaymentPeriod = 1 (the first payment period)

    SELECT

    AccountID

    ,COUNT(PaymentPeriod) AS MonthsPaid

    FROM

    PaymentPeriods

    WHERE

    PaymentPeriod = 1

    GROUP BY

    AccountID

    ,PaymentPeriod

    *EDIT: Changed the aggregate in the last part of the query to use COUNT instead of SUM - makes more sense that way.

    Results are:

    AccountID MonthsPaid

    ----------- -----------

    1 3

    2 1

    3 3

    Hope that helps!

  • SoCal_DBD

    Hall of Fame

    Points: 3051

    Wow!!! That is fantastic! Thank you so much!

  • jonharding2204

    SSC-Addicted

    Points: 450

    No problem, happy to help 🙂

    In case you wanted to extend this to see how many months people pay for in subsequent payment periods, you can change the last part of the query to this:

    SELECT

    AccountID

    ,PaymentPeriod

    ,COUNT(PaymentPeriod) AS MonthsPaid

    FROM

    PaymentPeriods

    GROUP BY

    AccountID

    ,PaymentPeriod

    ORDER BY

    AccountID

    ,PaymentPeriod

  • Dwain Camps

    SSC Guru

    Points: 86883

    Are you looking for something like this?

    SELECT RowID=ROW_NUMBER() OVER (ORDER BY AccountID, RandomDate)

    ,AccountID, RandomDate, NextDate, LeadInMonths

    ,RankForThisLeadInMonths=SUM(lm) OVER

    (

    PARTITION BY AccountID ORDER BY RandomDate

    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    )

    FROM

    (

    SELECT AccountID, RandomDate, NextDate, LeadInMonths

    ,lm=CASE WHEN LeadInMonths = LAG(LeadInMonths, 1, 0) OVER (PARTITION BY AccountID ORDER BY RandomDate)

    THEN 0 ELSE 1 END

    FROM

    (

    SELECT AccountID, RandomDate

    ,NextDate=LEAD(RandomDate, 1) OVER (PARTITION BY AccountID ORDER BY RandomDate)

    ,LeadInMonths=DATEDIFF(month, RandomDate

    ,LEAD(RandomDate, 1) OVER (PARTITION BY AccountID ORDER BY RandomDate))

    FROM #SampleData

    ) a

    WHERE NextDate IS NOT NULL

    ) a

    ORDER BY AccountID, RandomDate;


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • SoCal_DBD

    Hall of Fame

    Points: 3051

    Yes dwain.c, that would work as well! Thanks for the input.

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply