Stairway to Data

Stairway to Data, Level 4: Temporal Data

,

Temporal Data

Temporal data is the hardest type of data for people to handle conceptually. It is a continuum and it is always moving. A continuum is finitely divisible, but computers cannot represent infinite units. Time is infinite back into the past as well as into the future, but the database model assumes that there is a minimum and maximum date and time which can be represented. Time is also everywhere, as Einstein's fourth dimension. Events occur in parallel and simultaneity is very important for queries.

Another major problem with time is that it is both an abstract physical measurement and a legal concept. Let me first deal with the physical side of time, then move to the legal side.

The history of time and calendars is really interesting and there are a lot of good books about that topic. I am going to worry about the way databases handle time, since I don't want to write a book.

Let's take a look at basic units:

second = The most basic unit of measurement of time in the International System of Units. Everything else is not SI; other groups give us the common units we use. It is defined as the duration of 9,192,631,770 cycles of microwave light absorbed or emitted by the hyperfine transition of Cesium-133 atoms in their ground state undisturbed by external fields. That just rolls off the tongue, doesn't it?

minute = A period of time of 60 seconds. Well, not quire. There was leap second which defined a minute as 59, 60 or 61 seconds. This has to do with “leap seconds” and it might be obsolete when you read this. It existed to adjust the astronomical clock with the atomic clock. More and more, we use the atomic clock and the traditional astronomical timekeeping system. But we don't know the difference unless we we are a scientist.

hour = A period of time of 60 minutes.

day = A period of time of 24 hours. Or maybe one rotation of the Earth.

month = A named period of 28, 29, 30 or 31 days on the Common Era calendar.

year = 365.2422 days, or maybe one rotation of the Earth around the sun.

A good website for temporal information is: http://www.timeanddate.com

Events

An event is a single, anchored moment in time. But it depends on the granularity used. Christmas day is an event, if your level of granularity is one day because you are using a wall calendar instead of a watch. The sentence "My English class starts at 15:00 Hrs" talks about an event and probably assumes a granularity of 15 minute intervals, since that is how class schedules are often posted.

In Standard SQL, this is represented by a TIMESTAMP data type. T-SQL unfortunately calls this a DATETIME or DATETIME2 and a totally different meaning for timestamp. This data type has year, month, day, hours, minutes, seconds and a decimal fraction of a second. The SQL Standard is mute on the precision required, but the FIPS-127 conformance tests looked for five decimal places. The new DATETIME2 and TIME in T-SQL goes to seven decimal places, so you are covered.

Durations or Intervals

An interval or duration is a period of time without regard to when it starts and finishes. It is usually associated with an action or state of being. The sentence "A television show takes one hour" talks about the duration of the show.

In Standard SQL, this is represented by an INTERVAL data type which T-SQL does not have as such. The size of the interval in shown with the keyword INTERVAL, a string with dashes and/or colons to separate the fields (YEAR, MONTH, DAY, HOUR, MINUTE and SECOND) and a descriptor of the fields involved. For example

UPDATE Foobar
SET test1 = test1 + INTERVAL '12:15' HOUR TO MINUTE,
test2 = test2 + INTERVAL '5' DAY,
test3 = test3 + INTERVAL '12:15' MINUTE TO SECOND;

This will add twelve hours and fifteen minutes to all the test1 values, five days to test2 values and twelve minutes and fifteen seconds to the test3 values. In T-SQL, you will need to use a DATEADD() function and decide on which interval unit you want to use.

As an aside, the term “field” in SQL refers to the parts of a temporal expression. Columns in a table are not fields. When you hear someone confusing columns and fields or rows and records, you know that we have conceptual problems.

Periods and Cycles

A period is a duration with a fixed starting point in time. Cycles are repeated periods – usually holidays. The sentence "My English class is third period, Mondays and Wednesdays" talks about the both the duration and starting time of the class and again probably assumes a granularity of minutes. Periods repeat and there is no syntax for it. You have to do this with data.

Date Display Formats

Officially, we use the the Common Era calendar. It is not called the Gregorian any more. We also do not use “A.D.” and “B.C.” any more; the correct terms are “CE” (Common Era) and “BCE” (Before Common Era). Using the ISO-8601 date formats is not a matter of choice any more. Too many of the international standards, including Standard SQL, specify the calendar date for transmission and storage of data.

In Standard SQL we use only the ISO-8601 “yyyy-mm-dd” format. There are several other display formats in the ISO-8601 Standards, but the SQL Standard does not recognize them. The reason is simple; we wanted one and only one way to show a date. If you need something fancier, then do it in the front end and not the database. ISO-8601 has been adopted as European Standard EN 28601 and is therefore now a valid standard in all EU countries and all conflicting national standards have been changed accordingly.

Standard SQL has a date range of '0001-01-01' until '9999-12-31', which should be enough for anything you are ever going to do in your lifetime. Before you ask, there are features for historical dates. The MySQL people have a handy extension to ISO-8601 using zeros for years and months. That is; '2011-01-00' is a shorthand for '2011-01-01' to '2011-01-31' range and '2011-00-00' is a shorthand for '2011-01-01' to '2011-12-31' range.

Time of Day

In the ISO Standards, the time of day is based on the UTC, which stands for Universal Coordinated Time. There is no GMT or Greenwich Mean Time and there has not been for years. GMT now refers a time zone. The display format for the time of day is: “hh:mm:ss.sss..” with the FIPS-127 (Federal Information Processing Standards) requiring at least five decimal places.

where hh is the number of complete hours that have passed since midnight (00-24), mm is the number of complete minutes that have passed since the start of the hour (00-59), and ss is the number of complete seconds since the start of the minute (00-59). The separator between the fields is a colon. If the hour value is 24, then the minute and second values must be zero. That is, 24:00:00 on this day is really the first moment of the next day.

Local Time, Lawful Time and UTC

Time zones were set up by railroads and ratified by 27 nations at a conference in Washington, DC in 1884. Today, the International Air Transportation Association (IATA) has replaced the railroads as the industry most concerned about time zones. They publish a table of the time zones of the world (SSIM) twice a year, giving the dates that certain zones switch from regular to Daylight Saving Time (DST).

Unfortunately, political changes has caused very frequent changes in the names of the time zones as well as their DST schedules. When a time is written as shown above, it is assumed to be in some local time zone. In order to indicate that a time is measured in Universal Time (UTC), you can append a capital letter Z to a time: “23:59:59Z” ; The Z stands for the "zero meridian", which goes through Greenwich, England; Hence the Greenwich Mean Time (GMT) standard before 1972. There is a spoken alphabet code used in radio communication where 'Z' is coded as "Zulu", so the ISO Standard was also called "Zulu Time", however this term should no longer

be used. The suffix “±hh:mm” on a time is used to show the displacement from UTC. For time zones east of the zero meridian, which are ahead of UTC, the notation uses a plus sign. For time zones west of the zero meridian, which are behind UTC, the notation uses a minus sign. For example, Central European Time (CET) is +01:00 and

U.S./Canadian Eastern Standard Time (EST) is -05:00. Strangely enough, there are no international standard that specifies abbreviations for civil time zones. However, the use of three letter codes is common.

In addition, politicians enjoy modifying the rules for civil time zones, especially for daylight saving times, every few years, so the only really reliable way of describing a local time zone is to specify numerically the difference of local time to UTC.

Calendar Tables

The Common Era Calendar is irregular. Trying to use computational methods does not work, so we have a SQL programming idiom; the Calendar table. Build a calendar table with one column for the calendar data as its primary key and other columns to show whatever your business needs in the way of temporal information. Do not try to calculate holidays in SQL -- Easter alone requires too much math.

CREATE TABLE Calendar
(cal_date DATE NOT NULL PRIMARY KEY,
fiscal_year INTEGER NOT NULL,
fiscal_month INTEGER NOT NULL,
week_in_year INTEGER NOT NULL, -- SQL Server is not ISO standard
holiday_type INTEGER NOT NULL
CHECK(holiday_type IN ( ..), --
day_in_year INTEGER NOT NULL,
julian_business_day INTEGER NOT NULL,
..);

The Julian business day is a good trick. I will get to it in awhile. Derek Dongray came up with a classification of the public holidays and weekends he needed to work with in multiple countries. Here is his list with more added.

  1. Fixed date every year.
  2. Days relative to Easter.
  3. Fixed date but will slide to next Monday if on a weekend
  4. Fixed date but slides to Monday if Saturday or Tuesday if Sunday (UK Boxing Day is the only one).
  5. Specific day of week after a given date (usually first/last Monday in a month but can be other days, e.g. First Thursday after November 22 = Thanksgiving)
  6. Days relative to Greek Orthodox Easter (not always the same as Western Easter)
  7. Fixed date in Hijri (Muslim) Calendar - this turns out to only be approximate due to the way the calendar works. An Imam has to see a full moon to begin the cycle and declare it.
  8. Days relative to previous Winter Solstice (Chinese holiday of Qing Ming Jie)
  9. Civil holidays set by decree, such as a National Day Of Mourning.
  10. Fixed date except Saturday slides to Friday, and Sunday slides to Monday.
  11. Fixed date, but Tuesday slides to Monday, and Thursday to Friday. (Argentina celebrates October 12, the day Columbus discovered America is a national holiday in Argentina. Except when it's a Tuesday, they back it one day to Monday.

As you can see, some of these are getting a bit esoteric and a bit fuzzy. A calendar table for US Secular holidays can be built from the data at this website, so you will get the three-day weekends: http://www.smart.net/~mmontes/ushols.html

Julianized Calendars

You can sequentially number the dates in your calendar table, play with the numbers and get a simple way to compute the number of business days between dates. The skeleton is simple.

CREATE TABLE WorkDayCalendar
(cal_date DATE NOT NULL PRIMARY KEY,
julian_business_nbr INTEGER NOT NULL,
..);

Loading the calendar is where you need to do some research. Load the table with 50 or 100 years of data and then update the Saturdays and Sundays to the julian_business_nbr of the Friday before them. Finally, renumber the holidays as needed. Here is an example from 2007:

CREATE TABLE WorkDayCalendar
(cal_date DATE NOT NULL PRIMARY KEY,
julian_business_nbr INTEGER NOT NULL,
..);

To compute the business days from Thursday of this week to next Tuesday:

SELECT (C2.julian_business_nbr – C1.julian_business_nbr) AS workday_cnt

FROM WorkDayCalendar AS C1, WorkDayCalendar AS C2

WHERE C1.cal_date = '2007-04-05',

AND C2.cal_date = '2007-04-10';

Report Period Calendars

These tables guarantee that the report periods are documented and shared by everyone in the enterprise. The skeleton looks like this:

CREATE TABLE Report_Periods
(report_period_name VARCHAR (25) NOT NULL PRIMARY KEY,
period_start_date DATE NOT NULL,
period_end_date DATE NOT NULL,
CHECK (period_start_date <= period_end_date),
report_period_type CHAR(5) NOT NULL
CHECK (report_period_type IN (..)),
etc);

The report_period_type tells you if this is fiscal, marketing, sales or whatever.

Use a predicate like "my_date BETWEEN period_start_date AND period_end_date" to put each row into the right report period. Then use "GROUP BY report_period_name" with ROLLUP, CUBE and grouping sets for the fancier stuff.

You can overlap report periods (the "Bikini Madness 2010" promo overlapped with "Swimming Pool Promo”). You can use a fiscal calendar with irregular units of measure. You can think of all kinds of options with fully Standard, portable code.

Time Slot Tables

One of the classic problems for T-SQL programmers before the introduction of TIME as a data type was to put rows with a DATETIME column into buckets or time slots over a day. The usual example is something like counting the number of hits on a website in 5 minute blocks.

The usual solution was to write procedural code to build DATETIME ranges and cut out the time part of the column. It was slow, proprietary and hard to maintain.

Instead, create a table of time slots with more precision than the data it will classify. The extra precision will make sure that you do not have boundary problems. Assume the time stamps are taken in five minutes buckets, or (24 hours * 60 minutes)/ 5 = 288 buckets.

CREATE TABLE TimeSlots
(time_slot_name CHAR(7) NOT NULL PRIMARY KEY,
slot_start_time TIME(2) NOT NULL,
slot_end_time TIME(2) NOT NULL,
CHECK (slot_start_time <= slot_end_time));

Now load the table with one day's worth of time slots. But it is not as easy as it seems. The ISO temporal model is based on half open intervals. That means the final bucket cannot go into midnight (00:00:00 or 24:00:00) the next day – that is a boundary problem.

INSERT INTO TimeSlots
VALUES ('slot_001', '00:00:00.00', '00:05:00.00'),
('slot_002', '00:05:00.10', '00:10:00.00'),
('slot_003', '00:10:00.10', '00:15:00.00'),
..
('slot_288', '23:55:00.10', '23:59:59.99');

The skeleton of the basic query to classify the raw data is:

WITH Classified_Hits (report_date, hit_id, time_slot_name)
AS
(SELECT CAST(hit_date AS DATE), H.hit_id, T.time_slot_name
FROM TimeSlots AS T, Hits AS H
WHERE H.hit_date BETWEEN T.slot_start_time AND T.slot_end_time)

SELECT report_date, time_slot_name, COUNT(hit_id) AS hit_cnt
FROM Classified_Hits
GROUP BY report_date, time_slot_name;

Again, you can extend the "GROUP BY" with ROLLUP, CUBE and grouping sets for the fancier stuff. But the most useful trick is get a running total with a simple VIEW:

CREATE VIEW Running_Buckets (time_slot_name, slot_start_time, slot_end_time)
AS
SELECT time_slot_name, '00:00:00.00', slot_end_time
FROM TimeSlots;

This article is part of the parent stairway Stairway to Data

Rate

3.67 (3)

You rated this post out of 5. Change rating

Share

Share

Rate

3.67 (3)

You rated this post out of 5. Change rating