Temporal Project

  • One of the basis relational operators is PROJECT, which Dr Codd defined in his seminal paper "A Relational Model of Large Shared Data Banks" as

    2.1.2. Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation which is said to be a projection of the given relation.

    PROJECT is implemented in SQL by specifying the columns in the SELECT section of a statement such as:

    Create table dbo.Person

    (PersonIdinteger not null

    ,CityNamevarchar(50) not null

    ,CountryNamevarchar(50) not null

    ,constraint Person_P primary key (PersonId))

    An example SQL statement for a PROJECT is:

    SELECT PersonId

    , CountryName

    from Person

    However, when the relation is made temporal, so that a history is stored, PRJECT becomes more complex. Given the below table, the desired results is the Person, Country, and a start/end when they continuously resided in that country.

    An English description of the solution is: For the each person and country,

    1. Get possible time periods - this is a triangle join

    2. Exclude where the person resided in a different country during the time period

    3. For the end, use the latest end.

    4. For the lastest end, use the earliest start

    Does anyone have a simplier or more effective solution that can be included as the definition of a view?

    Create table dbo.PersonAddress

    (PersonIdint not null

    ,EffectiveTsdatetimenot null

    ,ExpirationTsdatetimenot null

    ,CityNamevarchar(50)not null

    ,CountryNamevarchar(50)not null

    ,constraint PersonAddress_P primary key (PersonId , EffectiveTs)

    )

    The SQL is:

    Create view PersonCountry AS

    SELECT PersonCountry.PersonId

    ,PersonCountry.CountryName

    ,MIN(PersonCountry.EffectiveTs)as EffectiveTs

    ,PersonCountry.ExpirationTs

    FROM(SELECT F.PersonId

    ,F.CountryName

    ,F.EffectiveTs

    ,MAX(L.ExpirationTs) as ExpirationTs

    FROM PersonAddress F

    JOINPersonAddress L

    ONL.PersonId= F.PersonId

    ANDL.CountryName= F.CountryName

    ANDL.EffectiveTs< L.ExpirationTs

    ANDNOT EXISTS

    --Did not reside in a different country within the time period

    (SELECT * FROM PersonAddress M

    WHEREM.PersonId= F.PersonId

    ANDM.CountryName<> F.CountryName

    ANDM.EffectiveTs>F.EffectiveTs

    ANDM.EffectiveTs< L.ExpirationTs

    )

    group by F.PersonId

    ,F.CountryName

    ,F.EffectiveTs

    ) as PersonCountry

    GROUP BY PersonCountry.PersonId

    ,PersonCountry.CountryName

    ,PersonCountry.ExpirationTs

    ORDER BY PersonCountry.PersonId

    ,EffectiveTs

    Here is test data:

    truncate table dbo.PersonAddress

    insert into dbo.PersonAddress

    (PersonId , EffectiveTs, ExpirationTs ,CityName,CountryName)

    select 1, '2000-01-01','9999-12-31T23:59:59.997','Chicago','USA' union all

    select 2, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all

    select 2, '2001-01-01','9999-12-31T23:59:59.997','Detroit','USA' union all

    select 3, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all

    select 3, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all

    select 3, '2002-01-01','9999-12-31T23:59:59.997','Stratford','CAN' union all

    select 4, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all

    select 4, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all

    select 4, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all

    select 4, '2003-01-01','9999-12-31T23:59:59.997','Toronto','CAN' union all

    select 5, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all

    select 5, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all

    select 5, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all

    select 5, '2003-01-01','2003-12-31T23:59:59.997','Toronto','CAN' union all

    select 5, '2004-01-01','9999-12-31T23:59:59.997','Columbus','USA'

    -- Expected Results

    selectPersonId,CountryName

    ,CAST( EffectiveTs as datetime) as EffectiveTs

    ,CAST( ExpirationTs as datetime) as ExpirationTs

    from(

    select 1,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all

    select 2,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all

    select 3,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all

    select 3,'CAN','2002-01-01T00:00:00.000','9999-12-31T23:59:59.997'union all

    select 4,'USA','2000-01-01T00:00:00.000','2002-12-31T23:59:59.997' union all

    select 4,'CAN','2003-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all

    select 5,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all

    select 5,'CAN','2002-01-01T00:00:00.000','2003-12-31T23:59:59.997' union all

    select 5,'USA','2004-01-01T00:00:00.000','9999-12-31T23:59:59.997'

    ) as PersonCountry (PersonId,CountryName,EffectiveTs,ExpirationTs)

    SQL = Scarcely Qualifies as a Language

  • Hello Carl,

    ït took some time to think of a different approach, but in the end I came up with this. It is based on the idea, that if someone changed address inside of a country at certain point, there will be the same number of future addresses in different countries. So, all rows with the same count of future abroad addresses can be merged into one period of residence in a country.

    SELECT Q.PersonId, Q.CountryName, MIN(Q.EffectiveTs), MAX(Q.ExpirationTs)

    FROM

    /*count how many entries in future in other country?*/

    (SELECT PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs , COUNT(C.PersonId) as counter

    FROM PersonAddress PA1

    LEFT JOIN PersonAddress C ON C.PersonId=PA1.PersonId

    AND C.EffectiveTs > PA1.EffectiveTs

    AND C.CountryName <> PA1.CountryName

    GROUP BY PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs) AS Q

    GROUP BY Q.PersonId, Q.CountryName, Q.counter /*those with equal counter can be merged, because the country didn't change*/

    ORDER BY Q.PersonId, MIN(Q.EffectiveTs)

    I don't know whether there is some effect on performance, I just saw that your execution plan does 2 index scans and 1 index seek, while mine only 2 index scans... so there is a chance that my solution will perform the same as yours or a little better. But you have to test it, of course.

    BTW, check your posted desired output... there is a mistake for PersonId 4. This person moved to Canada in 2002, not 2003.

  • Read your post and an running tests including scalability tests.

    The testing is being performed versus a table containing database file io statistics that have been gathered each hourly for about 150 SQL Servers for the last 60 days. The table has about 2 million rows.

    SQL = Scarcely Qualifies as a Language

  • For a Temporal Project, where the project includes all of the primary key columns, benchmarks of three SQL Server solutions were performed and documented below. The three different solutions were 1) a cursor solution 2) Vladan's solution, 3) solution from "Developing Time-Oriented Database Applications in SQL" by Dr. Richard T. Snodgrass of the University of Arizona.

    Below is the table used for the benchmarks, is populated based on running the system function fn_virtualfilestats hourly versus 86 SQL Server 2000 instances, and the information is retained for 60 days. The table has a total of 2.2 million rows. To eliminate network IO as a factor, the results were written to a permanent table, which was then used to validate that all three solutions produced identical results.

    For an single chosen instance, here are the various cardinalities:

    Table Projected

    Database Name 17 17

    Database Files 34 34

    Total Rows 49,054 21,575

    Average Rows Per Database File 1,442 634

    Maximum Rows Per Database File 1,447 980

    Minimum Rows Per Database File 1,354 1

    The benchmarks for a single SQL Server instance that consists of thirty four database files are:

    Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825

    Logical I/O – Cursor is 104,571 , Vladan's is 2,118 and Literature is 473,825

    Elapsed Seconds – Cursor is 24, Vladan's is 109 and Literature is 3,105.

    The benchmarks for all 86 SQL Server instance are (note that the benchmark for the Literature was cancelled after running for 12 hours !)

    Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825

    Logical I/O – Cursor is 6,581,739 and Vladan's is 38,274,525

    Elapsed Seconds – Cursor is 62 and Vladan's is 6,700

    Attached are four files:

    InstanceFileIO.txt is the table DDL.

    TemporalProject.txt is the SQL solution from the literature.

    TemporalProject_Cursor.txt is a cursor based solution.

    TemporalProject_Vladan.txt is Vladan's solution.

    SQL = Scarcely Qualifies as a Language

  • Thanks for posting the results, Carl... very interesting. I knew that triangular joins don't perform well on large amounts of data, but anyway I was surprised about that much better performance of a cursor.

  • Carl: can you post your data so that we can replicate your test?

    Easiest way is probably to BCP the table and then include as an attachment in a .ZIP file. The BCP command would be something like:

    BCP {dbname}.dbo.InstancefileIO OUT D:\{path}\InstancefileIO.nat -n -T -S {ServerName}

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • As I used live production data and sp_spaceused reports about one and half million rows and 256Mb of space used, posting that much data is not possible.

    SQL = Scarcely Qualifies as a Language

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply