Temporal Project

Question

Temporal Project

Carl Federl

One Orange Chip

Points: 25384
More actions
May 29, 2008 at 2:38 pm

#207836

One of the basis relational operators is PROJECT, which Dr Codd defined in his seminal paper "A Relational Model of Large Shared Data Banks" as
2.1.2. Projection. Suppose now we select certain columns of a relation (striking out the others) and then remove from the resulting array any duplication in the rows. The final array represents a relation which is said to be a projection of the given relation.
PROJECT is implemented in SQL by specifying the columns in the SELECT section of a statement such as:
Create table dbo.Person
(PersonIdinteger not null
,CityNamevarchar(50) not null
,CountryNamevarchar(50) not null
,constraint Person_P primary key (PersonId))
An example SQL statement for a PROJECT is:
SELECT PersonId
, CountryName
from Person
However, when the relation is made temporal, so that a history is stored, PRJECT becomes more complex. Given the below table, the desired results is the Person, Country, and a start/end when they continuously resided in that country.
An English description of the solution is: For the each person and country,
1. Get possible time periods - this is a triangle join
2. Exclude where the person resided in a different country during the time period
3. For the end, use the latest end.
4. For the lastest end, use the earliest start
Does anyone have a simplier or more effective solution that can be included as the definition of a view?
Create table dbo.PersonAddress
(PersonIdint not null
,EffectiveTsdatetimenot null
,ExpirationTsdatetimenot null
,CityNamevarchar(50)not null
,CountryNamevarchar(50)not null
,constraint PersonAddress_P primary key (PersonId , EffectiveTs)
)
The SQL is:
Create view PersonCountry AS
SELECT PersonCountry.PersonId
,PersonCountry.CountryName
,MIN(PersonCountry.EffectiveTs)as EffectiveTs
,PersonCountry.ExpirationTs
FROM(SELECT F.PersonId
,F.CountryName
,F.EffectiveTs
,MAX(L.ExpirationTs) as ExpirationTs
FROM PersonAddress F
JOINPersonAddress L
ONL.PersonId= F.PersonId
ANDL.CountryName= F.CountryName
ANDL.EffectiveTs< L.ExpirationTs
ANDNOT EXISTS
--Did not reside in a different country within the time period
(SELECT * FROM PersonAddress M
WHEREM.PersonId= F.PersonId
ANDM.CountryName<> F.CountryName
ANDM.EffectiveTs>F.EffectiveTs
ANDM.EffectiveTs< L.ExpirationTs
)
group by F.PersonId
,F.CountryName
,F.EffectiveTs
) as PersonCountry
GROUP BY PersonCountry.PersonId
,PersonCountry.CountryName
,PersonCountry.ExpirationTs
ORDER BY PersonCountry.PersonId
,EffectiveTs
Here is test data:
truncate table dbo.PersonAddress
insert into dbo.PersonAddress
(PersonId , EffectiveTs, ExpirationTs ,CityName,CountryName)
select 1, '2000-01-01','9999-12-31T23:59:59.997','Chicago','USA' union all
select 2, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 2, '2001-01-01','9999-12-31T23:59:59.997','Detroit','USA' union all
select 3, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 3, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 3, '2002-01-01','9999-12-31T23:59:59.997','Stratford','CAN' union all
select 4, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 4, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 4, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all
select 4, '2003-01-01','9999-12-31T23:59:59.997','Toronto','CAN' union all
select 5, '2000-01-01','2000-12-31T23:59:59.997','Chicago','USA' union all
select 5, '2001-01-01','2001-12-31T23:59:59.997','Detroit','USA' union all
select 5, '2002-01-01','2002-12-31T23:59:59.997','Stratford','CAN' union all
select 5, '2003-01-01','2003-12-31T23:59:59.997','Toronto','CAN' union all
select 5, '2004-01-01','9999-12-31T23:59:59.997','Columbus','USA'
-- Expected Results
selectPersonId,CountryName
,CAST( EffectiveTs as datetime) as EffectiveTs
,CAST( ExpirationTs as datetime) as ExpirationTs
from(
select 1,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 2,'USA','2000-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 3,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all
select 3,'CAN','2002-01-01T00:00:00.000','9999-12-31T23:59:59.997'union all
select 4,'USA','2000-01-01T00:00:00.000','2002-12-31T23:59:59.997' union all
select 4,'CAN','2003-01-01T00:00:00.000','9999-12-31T23:59:59.997' union all
select 5,'USA','2000-01-01T00:00:00.000','2001-12-31T23:59:59.997' union all
select 5,'CAN','2002-01-01T00:00:00.000','2003-12-31T23:59:59.997' union all
select 5,'USA','2004-01-01T00:00:00.000','9999-12-31T23:59:59.997'
) as PersonCountry (PersonId,CountryName,EffectiveTs,ExpirationTs)
SQL = Scarcely Qualifies as a Language

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Vladan SSC-Insane Points: 22072 More actions · Answer 1

Hello Carl,

ït took some time to think of a different approach, but in the end I came up with this. It is based on the idea, that if someone changed address inside of a country at certain point, there will be the same number of future addresses in different countries. So, all rows with the same count of future abroad addresses can be merged into one period of residence in a country.

SELECT Q.PersonId, Q.CountryName, MIN(Q.EffectiveTs), MAX(Q.ExpirationTs)

FROM

/*count how many entries in future in other country?*/

(SELECT PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs , COUNT(C.PersonId) as counter

FROM PersonAddress PA1

LEFT JOIN PersonAddress C ON C.PersonId=PA1.PersonId

AND C.EffectiveTs > PA1.EffectiveTs

AND C.CountryName <> PA1.CountryName

GROUP BY PA1.PersonId, PA1.CountryName, PA1.EffectiveTs, PA1.ExpirationTs) AS Q

GROUP BY Q.PersonId, Q.CountryName, Q.counter /*those with equal counter can be merged, because the country didn't change*/

ORDER BY Q.PersonId, MIN(Q.EffectiveTs)

I don't know whether there is some effect on performance, I just saw that your execution plan does 2 index scans and 1 index seek, while mine only 2 index scans... so there is a chance that my solution will perform the same as yours or a little better. But you have to test it, of course.

BTW, check your posted desired output... there is a mistake for PersonId 4. This person moved to Canada in 2002, not 2003.

Carl Federl One Orange Chip Points: 25384 More actions · Answer 2

Read your post and an running tests including scalability tests.

The testing is being performed versus a table containing database file io statistics that have been gathered each hourly for about 150 SQL Servers for the last 60 days. The table has about 2 million rows.

SQL = Scarcely Qualifies as a Language

Carl Federl One Orange Chip Points: 25384 More actions · Answer 3

For a Temporal Project, where the project includes all of the primary key columns, benchmarks of three SQL Server solutions were performed and documented below. The three different solutions were 1) a cursor solution 2) Vladan's solution, 3) solution from "Developing Time-Oriented Database Applications in SQL" by Dr. Richard T. Snodgrass of the University of Arizona.

Below is the table used for the benchmarks, is populated based on running the system function fn_virtualfilestats hourly versus 86 SQL Server 2000 instances, and the information is retained for 60 days. The table has a total of 2.2 million rows. To eliminate network IO as a factor, the results were written to a permanent table, which was then used to validate that all three solutions produced identical results.

For an single chosen instance, here are the various cardinalities:

Table Projected

Database Name 17 17

Database Files 34 34

Total Rows 49,054 21,575

Average Rows Per Database File 1,442 634

Maximum Rows Per Database File 1,447 980

Minimum Rows Per Database File 1,354 1

The benchmarks for a single SQL Server instance that consists of thirty four database files are:

Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825

Logical I/O – Cursor is 104,571 , Vladan's is 2,118 and Literature is 473,825

Elapsed Seconds – Cursor is 24, Vladan's is 109 and Literature is 3,105.

The benchmarks for all 86 SQL Server instance are (note that the benchmark for the Literature was cancelled after running for 12 hours !)

Scans – Cursor is 1, Vladan's is 8 and Literature is 473,825

Logical I/O – Cursor is 6,581,739 and Vladan's is 38,274,525

Elapsed Seconds – Cursor is 62 and Vladan's is 6,700

Attached are four files:

InstanceFileIO.txt is the table DDL.

TemporalProject.txt is the SQL solution from the literature.

TemporalProject_Cursor.txt is a cursor based solution.

TemporalProject_Vladan.txt is Vladan's solution.

SQL = Scarcely Qualifies as a Language

Vladan SSC-Insane Points: 22072 More actions · Answer 4

Thanks for posting the results, Carl... very interesting. I knew that triangular joins don't perform well on large amounts of data, but anyway I was surprised about that much better performance of a cursor.

RBarryYoung SSC Guru Points: 143329 More actions · Answer 5

Carl: can you post your data so that we can replicate your test?

Easiest way is probably to BCP the table and then include as an attachment in a .ZIP file. The BCP command would be something like:

BCP {dbname}.dbo.InstancefileIO OUT D:\{path}\InstancefileIO.nat -n -T -S {ServerName}

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Carl Federl One Orange Chip Points: 25384 More actions · Answer 6

As I used live production data and sp_spaceused reports about one and half million rows and 256Mb of space used, posting that much data is not possible.

SQL = Scarcely Qualifies as a Language