join/merge 2 status tables (check statuses at any change)

  • Hi,

    I have 2 tables indicating since when is the status valid for an object. I need to kind of join/merge/union them so I know in one table what are the both stuses at any given change.

    Table1:

    Object status1 date

    Obj1 Open 1998-01-01 00:00:00.000

    Obj1 Closed 2000-06-01 00:00:00.000

    Obj1 Open 2008-03-06 00:00:00.000

    Obj1 Closed 2013-01-01 00:00:00.000

    Table2:

    Object status2 date

    Obj1 ACTIVE 1999-01-01 00:00:00.000

    Obj1 INACTIV 2000-08-29 00:00:00.000

    Obj1 UNKNOWN 2004-05-06 00:00:00.000

    Obj1 check 2014-05-01 00:00:00.000

    as result I would need such thing:

    Object status1 status2 date

    Obj1 Open NULL 1998-01-01 00:00:00.000

    Obj1 Open ACTIVE 1999-01-01 00:00:00.000

    Obj1 Closed ACTIVE 2000-06-01 00:00:00.000

    Obj1 Closed INACTIV 2000-08-29 00:00:00.000

    Obj1 Closed UNKNOWN 2004-05-06 00:00:00.000

    Obj1 Open UNKNOWN 2008-03-06 00:00:00.000

    Obj1 Closed UNKNOWN 2013-01-01 00:00:00.000

    Obj1 Closed check 2014-05-01 00:00:00.000

    here the script to create the source tables:

    CREATE TABLE [dbo].[table2](

    [Object] [nvarchar](1000) NULL,

    [status2] [nvarchar](1000) NULL,

    [date] [datetime] NULL

    ) ON [PRIMARY]

    GO

    INSERT [dbo].[table2] ([Object], [status2], [date]) VALUES (N'Obj1', N'ACTIVE', CAST(0x00008D3F00000000 AS DateTime))

    INSERT [dbo].[table2] ([Object], [status2], [date]) VALUES (N'Obj1', N'INACTIV', CAST(0x00008F9D00000000 AS DateTime))

    INSERT [dbo].[table2] ([Object], [status2], [date]) VALUES (N'Obj1', N'UNKNOWN', CAST(0x000094DF00000000 AS DateTime))

    INSERT [dbo].[table2] ([Object], [status2], [date]) VALUES (N'Obj1', N'check', CAST(0x0000A31E00000000 AS DateTime))

    go

    CREATE TABLE [dbo].[table1](

    [Object] [nvarchar](1000) NULL,

    [status1] [nvarchar](1000) NULL,

    [date] [datetime] NULL

    ) ON [PRIMARY]

    GO

    INSERT [dbo].[table1] ([Object], [status1], [date]) VALUES (N'Obj1', N'Open', CAST(0x00008BD200000000 AS DateTime))

    INSERT [dbo].[table1] ([Object], [status1], [date]) VALUES (N'Obj1', N'Closed', CAST(0x00008F4400000000 AS DateTime))

    INSERT [dbo].[table1] ([Object], [status1], [date]) VALUES (N'Obj1', N'Open', CAST(0x00009A5700000000 AS DateTime))

    INSERT [dbo].[table1] ([Object], [status1], [date]) VALUES (N'Obj1', N'Closed', CAST(0x0000A13900000000 AS DateTime))

  • Well....

    I kind of have a solution, but it's messy (especially that I acutally have more tables like these to merge, and obviuosly there come joins with other tables).

    Maybe someone could suggest something better?

    ;WITH cte as(

    select DISTINCT t.[Object], t.date FROM table1 t

    union

    select DISTINCT t.[Object], t.date FROM table2 t

    )

    select * FROM cte

    outer APPLY (SELECT TOP 1 status1 FROM table1 tx

    WHERE tx.date<=cte.date AND tx.[Object]=cte.[OBJECT]

    ORDER BY tx.date DESC) t1

    outer APPLY (SELECT TOP 1 status2 FROM table2 tx

    WHERE tx.date<=cte.date and tx.[Object]=cte.[OBJECT]

    ORDER BY tx.date DESC) t2

  • I suppose it depends how you define 'messy'.

    You can probably achieve a better solution with a calendar table, but if you have a lot of these status tables, thats still a lot of steps.

    Here is a sample of what I'm talking about:

    CREATE TABLE #table2(

    [Object] [nvarchar](1000) NULL,

    [status2] [nvarchar](1000) NULL,

    [date] [datetime] NULL

    ) ON [PRIMARY]

    GO

    INSERT #table2 ([Object], [status2], [date]) VALUES (N'Obj1', N'ACTIVE', CAST(0x00008D3F00000000 AS DateTime))

    INSERT #table2 ([Object], [status2], [date]) VALUES (N'Obj1', N'INACTIV', CAST(0x00008F9D00000000 AS DateTime))

    INSERT #table2 ([Object], [status2], [date]) VALUES (N'Obj1', N'UNKNOWN', CAST(0x000094DF00000000 AS DateTime))

    INSERT #table2 ([Object], [status2], [date]) VALUES (N'Obj1', N'check', CAST(0x0000A31E00000000 AS DateTime))

    go

    CREATE TABLE #table1(

    [Object] [nvarchar](1000) NULL,

    [status1] [nvarchar](1000) NULL,

    [date] [datetime] NULL

    ) ON [PRIMARY]

    GO

    INSERT #table1 ([Object], [status1], [date]) VALUES (N'Obj1', N'Open', CAST(0x00008BD200000000 AS DateTime))

    INSERT #table1 ([Object], [status1], [date]) VALUES (N'Obj1', N'Closed', CAST(0x00008F4400000000 AS DateTime))

    INSERT #table1 ([Object], [status1], [date]) VALUES (N'Obj1', N'Open', CAST(0x00009A5700000000 AS DateTime))

    INSERT #table1 ([Object], [status1], [date]) VALUES (N'Obj1', N'Closed', CAST(0x0000A13900000000 AS DateTime))

    Create table #Calendar (

    calendar_datedate

    )

    insert into #Calendar

    SELECT top 10000 dateadd(dd, row_number() over (order by s1.name), 'Dec 31, 1997')

    FROM sys.syscolumns S1

    CROSS JOIN sys.syscolumns S2

    ;

    WITH table1CTE AS (

    SELECT [Object], [status1], [date], row_number() over (partition by [Object] order by [date]) AS RowNum

    FROM #table1

    ),

    table1SelfJoin AS (

    SELECT t1.Object,t1.status1,t1.date AS StartDate,COALESCE (t2.date,'Dec 31, 2030') AS EndDate

    FROM table1CTE t1

    LEFT OUTER JOIN table1CTE t2

    ON t1.Object = t2.Object

    AND t1.RowNum +1 =t2.RowNum

    ),

    table1Calendar AS (

    SELECT t1.Object,t2.status1,t2.StartDate,c.calendar_date

    FROM #Calendar c

    CROSS JOIN (select distinct [Object] from #table1) t1

    LEFT OUTER JOIN table1SelfJoin t2

    ON t1.Object = t2.Object

    AND c.calendar_date BETWEEN t2.StartDate AND t2.EndDate

    ),

    table2CTE AS (

    SELECT [Object], [status2], [date], row_number() over (partition by [Object] order by [date]) AS RowNum

    FROM #table2

    ),

    table2SelfJoin AS (

    SELECT t1.Object,t1.status2,t1.date AS StartDate,COALESCE (t2.date,'Dec 31, 2030') AS EndDate

    FROM table2CTE t1

    LEFT OUTER JOIN table2CTE t2

    ON t1.Object = t2.Object

    AND t1.RowNum +1 =t2.RowNum

    ),

    table2Calendar AS (

    SELECT t1.Object,t2.status2,t2.StartDate,c.calendar_date

    FROM #Calendar c

    CROSS JOIN (select distinct [Object] from #table2) t1

    LEFT OUTER JOIN table2SelfJoin t2

    ON t1.Object = t2.Object

    AND c.calendar_date BETWEEN t2.StartDate AND t2.EndDate

    )

    SELECT t1.Object,t1.status1,t2.status2,t1.calendar_date

    FROM table1Calendar t1

    INNER JOIN table2Calendar t2

    ON t1.Object = t2.Object

    AND t1.calendar_date = t2.calendar_date

    WHERE t1.calendar_date in (t1.StartDate,t2.StartDate)

    ORDER by t1.calendar_date

    DRop table #Calendar

    DRop table #table1

    DRop table #table2

    As you can see, as is it is FAR more convoluted than your solution.

    But, not only should the calendar table be permanent and built in advance (with some other handy reference columns in case you want to use it elsewhere), but you could also pre-build a view for each of your status tables which essentially mirrors the calendar CTEs.

    If you pre-do those steps for your status tables, you can merge any of them per your requirements with just the select at the end. And those views will also help other queries where you want to know what a status was on a specific date.

  • tomek tomek (10/16/2013)


    Well....

    I kind of have a solution, but it's messy (especially that I acutally have more tables like these to merge, and obviuosly there come joins with other tables).

    Maybe someone could suggest something better?

    ;WITH cte as(

    select DISTINCT t.[Object], t.date FROM table1 t

    union

    select DISTINCT t.[Object], t.date FROM table2 t

    )

    select * FROM cte

    outer APPLY (SELECT TOP 1 status1 FROM table1 tx

    WHERE tx.date<=cte.date AND tx.[Object]=cte.[OBJECT]

    ORDER BY tx.date DESC) t1

    outer APPLY (SELECT TOP 1 status2 FROM table2 tx

    WHERE tx.date<=cte.date and tx.[Object]=cte.[OBJECT]

    ORDER BY tx.date DESC) t2

    First, I think this is an excellent solution and, particularly, an excellent use of OUTER APPLY. Here are my suggestions:

    1) You can lose the DISTINCT operators in your cte; it causes an extra distinct sort in your query plan which adds cost but does not change your result set. You are using the UNION set operator which will guarantee unique values.

    2) I would avoid using reserved SQL words such as 'date' and 'object' in your column names.

    3) If you can add an clustered index to that table you would see notable performance improvements, especially if you will be doing a lot of joins.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • I agree with Alan.B that you've got a pretty good way already figured out. I also agree with his comment about named columns, but I assume these are just for example.

    There is another way that might be just a smidgen swifter, but you'd need to run a timing test to be sure. That is to use a Quirky Update.

    SELECT [Object], [Date], status1, status2

    INTO #Temp

    FROM

    (

    SELECT [Object], status1=NULL, status2, [date]

    FROM table2

    UNION ALL

    SELECT [Object], status1, NULL, [date]

    FROM table1

    ) a;

    DECLARE @status1 nvarchar(1000)

    ,@status2 nvarchar(1000);

    ALTER TABLE #Temp ALTER COLUMN [Date] DATETIME NOT NULL;

    ALTER TABLE #Temp ADD PRIMARY KEY([Date]);

    UPDATE #Temp WITH(TABLOCKX)

    SET @Status1 = Status1 = CASE WHEN Status1 IS NULL THEN @Status1 ELSE Status1 END

    ,@Status2 = Status2 = CASE WHEN Status2 IS NULL THEN @Status2 ELSE Status2 END

    OPTION (MAXDOP 1);

    SELECT * FROM #Temp ORDER BY [Object], [Date];

    GO

    DROP TABLE #Temp;

    You should carefully read the section on the QU in this article: Solving the Running Total and Ordinal Rank Problems[/url] by Jeff Moden so that you understand the rules that apply when using the QU.

    Initially I wrote that SQL 2012 might have a analytic function (LAG) that might make short work of this problem but I'm now not so sure, as an initial attempt didn't result in anything fruitful. There still could be a way I just don't have the time at the moment to play with it.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • I forgot to add that if there's a possibility that both status will change on the same day, you should replace the UNION ALL where the #Temp table is created with a FULL OUTER JOIN on date/object.

    BTW. Excellent job of providing DDL/consumable sample data for a first time poster. Wish all understood the value of that.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Hi All,

    First of all - Big Thanks for all your input (and good words for newbie as well:).

    @Nevyn - thanks for your solution, but as you yourself suggested - it's quite complicated and I also wonder about performance.

    If possible I try to keep my code as simple as possible (unless performance requires to make it more complicated), as it is possible that some other people will work on that code and they need to understand fast.

    @alan.B. - I'm happy you liked my way - I'm starting to see it's not that bad at all:)

    1. UNION - I feel so stupid - I always thought that union leaves the doubles in first query and removes those from the next.

    I just checked and (of course you're right). thanks for that:)

    2. those were just for example. But as I work in ETL and DWH it happens to me a lot to get the tables with SQL Words in names.

    3. I have to check the performance with clustered index. But it won't be that much of a problem, as the table isn't really that big (so again I will take cleaner code if the performance isn't the issue).

    @Dwain.C

    1. quirky update - I've seen it before. I didn't know the name. it's a great article you've sent me. Especially why one should use maxdop(1) and tablockx.

    2. full outer join vs union all - yep. I've just noticed it. It could also happen that the same status type could change more than one time during a day. But I gues I would have to filter those somehow first...

    3. I know there's lag/lead in sql 2012. I'm doing it on SQL 2008r2 so no go.

    4. DDL - it's just something I've seen in many post. Obviously if you want somebody's help, you need to make it as simple as possible.

    btw. out of 4 forums I've sent the same request, this is were I got the best responses.

    Thanks again.

  • additional question:

    The actual situation I have is like this:

    StatusType STATUS Object date

    ---------- ---------- ---------- -----------------------

    status1 Closed Obj1 2000-06-01 00:00:00.000

    status1 Closed Obj1 2013-01-01 00:00:00.000

    status1 Open Obj1 1998-01-01 00:00:00.000

    status1 Open Obj1 2008-03-06 00:00:00.000

    status2 ACTIVE Obj1 1999-01-01 00:00:00.000

    status2 check Obj1 2014-05-01 00:00:00.000

    status2 INACTIV Obj1 2000-08-29 00:00:00.000

    status2 UNKNOWN Obj1 2004-05-06 00:00:00.000

    ... with more status types.

    What I do now is I create separate CTE (or some kind of query/table) for every StatusType. I could of course pivot it as well.

    But maybe there's better solution in this situation?

    code for source table:

    CREATE TABLE [dbo].[StatusTable](

    [StatusType] [varchar](7) NOT NULL,

    [STATUS] [varchar](10) NULL,

    [Object] [varchar](10) NULL,

    [date] [datetime] NULL

    ) ON [PRIMARY]

    GO

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status1', N'Closed', N'Obj1', CAST(0x00008F4400000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status1', N'Closed', N'Obj1', CAST(0x0000A13900000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status1', N'Open', N'Obj1', CAST(0x00008BD200000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status1', N'Open', N'Obj1', CAST(0x00009A5700000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status2', N'ACTIVE', N'Obj1', CAST(0x00008D3F00000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status2', N'check', N'Obj1', CAST(0x0000A31E00000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status2', N'INACTIV', N'Obj1', CAST(0x00008F9D00000000 AS DateTime))

    INSERT [dbo].[StatusTable] ([StatusType], [STATUS], [Object], [date]) VALUES (N'status2', N'UNKNOWN', N'Obj1', CAST(0x000094DF00000000 AS DateTime))

  • tomek tomek (10/17/2013)


    Hi All,

    First of all - Big Thanks for all your input (and good words for newbie as well:).

    @Dwain.C

    1. quirky update - I've seen it before. I didn't know the name. it's a great article you've sent me. Especially why one should use maxdop(1) and tablockx.

    Long live the Quirky Update, long may it reign! To see its performance characteristic on another classic problem check here:Calculating Values within a Rolling Window in Transact SQL[/url].

    tomek tomek (10/17/2013)


    2. full outer join vs union all - yep. I've just noticed it. It could also happen that the same status type could change more than one time during a day. But I gues I would have to filter those somehow first...

    Because you're using a DATETIME, the FULL JOIN will still work for this case. You will simply get multiple records for the same day with different times in your output results.

    tomek tomek (10/17/2013)


    btw. out of 4 forums I've sent the same request, this is were I got the best responses.

    Thanks again.

    SSC rocks! Tell the world and drive them all to this site!

    Now, as to your additional question in your post that follows this one. I don't understand the expected results you want. You mention PIVOT but I'm not sure what you want to PIVOT.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply