Deriving a Session Number

  • Hi Guys,

    Could really do with some help trying to derive a field (session_number).

    I have a staging table which is populated with around 10 million rows daily, each day I need to identify the session number for each row by user_id.

    To identify the session_number we need to check if there is a gap greater than 5 minutes between the previous and current start_time.

    I have been able to do this on a small subset, but when dealing with the entire 10 million rows the database seems to process for days, any ideas?

    e.g.

    user_id start_time session_number

    123 09:00:00 1

    123 09:01:00 1

    123 09:02:00 1

    123 09:03:00 1

    123 09:09:00 2

    123 09:22:00 3

    Thanks,

    Rich

  • Would this work?

    create table SessionNumber (userID int, startTime time, sessionNum int)

    insert into SessionNumber

    values (123, '09:00:00', 0)

    , (123, '09:01:00', 0)

    , (123, '09:02:00', 0)

    , (123, '09:05:00', 0)

    , (123, '09:08:00', 0)

    , (123, '09:11:00', 0)

    , (123, '09:16:00', 0)

    , (123, '09:35:00', 0)

    select userID, StartTime, (DATEPART(mi,StartTime) /5)+1 as SessionNum

    from SessionNumber

    I'm not sure how you handle different hours, i.e. does 10:00 get reset back to 1? If not, do you need to find the earliest start time for a userID and start from there?



    Microsoft Certified Master - SQL Server 2008
    Follow me on twitter: @keith_tate

    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Thanks Keith, but I don't think I explained it very well, I need to compare the start_time between the current row and previous row ordered by userid and start_time.

    e.g.

    userid starttime diffrence_from_previous_row session_number

    123 09:00 0 1

    123 09:02 2 1

    123 09:05 3 1

    123 09:11 6 2

    123 09:14 3 2

    123 09:16 2 2

    123 09:25 9 3

    As you can see above, when there is a difference of 5 or greater the session number is incremented.

    The issue I have is that we're talking about doing this everyday for around 10 million rows, so need to do this in the most efficient manner.

    Thanks,

    Rich

  • This will work, but I'm not sure how the performance will be on 10 million rows.

    insert into SessionNumber

    values (123, '09:00:00')

    , (123, '09:01:00')

    , (123, '09:02:00')

    , (123, '09:05:00')

    , (123, '09:08:00')

    , (123, '09:11:00')

    , (123, '09:19:00')

    , (123, '09:35:00')

    , (321, '09:01:00')

    , (321, '09:03:00')

    , (321, '09:14:00');

    with SessionByUser as

    (

    select userID, StartTime, 1 as SessionNumber, ROW_NUMBER() over (Partition by userID order by StartTime) as RowNum

    from SessionNumber

    )

    , cte as

    (

    select a.userID, a.startTime, datediff(mi,coalesce(b.startTime,a.StartTime),a.startTime) / 5 as MinDiff

    from SessionByUser a

    left join SessionByUser b on a.RowNum = b.RowNum + 1

    and a.userID = b.userID

    )

    select userID, StartTime, dense_RANK() over (partition by userID order by MinDiff) as SessionID

    from cte

    order by UserID,StartTime



    Microsoft Certified Master - SQL Server 2008
    Follow me on twitter: @keith_tate

    Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Since you're using SQL2012, have you looked at LEAD and LAG?

    My first naive stab would be:

    ;WITH

    priors AS (

    SELECT

    *

    ,prev = LAG(startTime) OVER (

    PARTITION BY userID

    ORDER BY startTime

    )

    FROM SessionNumber

    ),

    starts AS (

    SELECT DISTINCT

    *

    ,sessionID = ROW_NUMBER() OVER (

    PARTITION BY userID

    ORDER BY startTime

    )

    FROM priors s

    WHERE prev IS NULL

    OR ABS(DATEDIFF(SECOND, s.startTime, s.prev)) >= (5 * 60)

    ),

    boundaries AS (

    SELECT DISTINCT

    userID

    ,sessionID

    ,startTime

    ,endTime = COALESCE(

    LEAD(startTime) OVER (

    PARTITION BY userID

    ORDER BY startTime

    ),

    CONVERT(TIME, SYSDATETIME())

    )

    FROM starts

    )

    SELECT *

    FROM boundaries b

    LEFT JOIN SessionNumber s ON

    s.userID = b.userID AND

    s.startTime >= b.startTime AND

    s.startTime < b.endTime

    ...but I'm not sure how well it will work across a large table

    J.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply