Data Extraction Methods Part 1

, 2006-04-25

To populate the data warehouse we need to get data from the source systems.

The source system could be an RDBMS, it could be a file based system (including

XML), it could be a spreadsheet, or it could be something else (a web service,

a web page, a document, etc). In this article we discuss RDBMS source system,

i.e. the source data is located in a table on SQL Server, Oracle, Informix,

DB2, Access, etc. File based source systems and spreadsheets will be discussed

in a separate article.

There are 3 main methods to extract a table out of an RDBMS source system into

our data warehouse: get the whole table every time, get it incrementally (only

the changes), or using a fixed period. Each of these 3 basic approaches will

be explained below one by one.

There are several considerations when choosing which method to use. The first

and foremost is normally the time window, i.e. the data extraction process should

be completed in certain amount of time. The less time it requires to do data

extraction, the better it is for a data warehouse system, because this means

that we can update the data warehouse more frequently. And it generally means

that we put less load on the source system.

The second consideration is practicality, i.e. what are possible and what are

not possible. Ideally we would like to extract it incrementally, but if it is

not possible then we will have to use other method. Or the other way around:

ideally we would like to extract the whole table (just to be sure), but if there

are 100 millions records in the table, then we may have to do it incrementally

otherwise the extract could take a very long time.

You can read Part 2 here.

1. Whole Table Every time

The simplest way to download a table from the source system into data warehouse

is to extract all records every time, e.g. "SELECT

* FROM table1". For example, a store table. Say a retail group has

500 stores. Each store has a record on this table containing the store name,

address and various attributes such as store grade or classification, last inspection

date and ID of the store manager. For 500 records, even if the store table has

timestamp for both create and update, it is probably quicker to get the entire

table every time.

If we just do a select without a where clause, the extraction begins instantly.

If we put something on the where clause, such as timestamp, the extraction process

begins a few moments later (it could be a second later, it could be a few minutes

later, depending on how many records are on the table on whether the columns

used on the where clause are indexed or not). If you are connected on a LAN

with the source system, say 100 Meg, 58 records should take less than a second.

So, if the delay caused by the where clause is a few seconds, it is quicker

to download the whole table every time.

A quick test using an ETL software can easily measure how long the delay is.

But as a guideline, for a table that has less than 1000 rows, we should definitely

consider getting the whole table every time. If the table has 1000 to 10,000

rows, we should still probably consider getting the whole table every time.

If the table has more than 1 million rows , we should definitely consider downloading

it incrementally (unless the record length is very small, such as 10 bytes).

One sure way of determining whether download the whole table every time is suitable

is to do it and measure the time. If the time it takes is not possible to be

accommodated, then we need to extract the data incrementally.

2. Incremental Extract

If the table contains 2 million records, say it is the customer table, it could

take 15-20 minutes to download the whole table. The purpose giving the stats/numbers

here is only for illustration, so we can get the feel of it. It is not exact

and it is not scientific. It could vary significantly, for example if you are

using RAID 10 or using Gigabyte network. OK, back to customer table. If there

are timestamp columns for creation and updates, then it could be a sensible

idea to try to download this table incrementally, which may decrease the download

time to, say, 3-5 minutes. The number of records decreases from 2 million to

2000 (daily changes only) and the amount of data transferred decreases from

2 GB to 2 MB (assuming the record length is 1000 bytes).

In OLTP systems, customer and product are probably the biggest master tables.

But transaction tables are normally a lot larger. It does depends on the size

of the company (multinational or local), and the industry (telecommunication

and retail normally have huge tables). A transaction table with 100 million

rows are not uncommon. At early stages of building the data warehouse, the one

of the key question to answer is whether it is possible to extract the transaction

tables incrementally. It is often not practical (in terms of time window) to

extract the whole table every time. This question is so important so that for

many projects. So important that if the answer is no then the project could

not proceed.

There are 6 main ways to perform incremental extract:

a. Using timestamp columns

b. Using identity column

c. Using triggers

d. Using transaction date

e. Combination of the above

f. Using off-the-shelf tools

a. Incremental Extract Using Timestamp Columns

If the table is large (>100,000), it is always worth to find out if the

table has creation and update timestamp or datestamp columns, and whether the

timestamps are indexed. And more importantly if the timestamps are updated every

time a new record is created and updated. The index is a simple and straightforward.

We could easily prove if a column is indexed or not. Your source system DBA

would know how to. Different RDBMS has different implementation for indexing

(some platforms call it logical file), but they all speed up the query. Adding

the timestamp on the where clause of the select statement could greatly vary

the extraction time, especially for tables with millions of rows. Sadly, more

often than not, the timestamp columns are not indexed. In this case it may be

worth to find out if they could index it for you. But before you put forward

the request consider the impact to the source system - as you know the more

index there are on a table, the slower the DML e.g. insert, update and delete

process.

The second one is a little bit more difficult to find out: whether the timestamps

are updated every time a new record is created and every time an existing record

is updated. The supplier of the ERP system may confidently say that the timestamps

are always updated on insert and update throughout the application, but your

internal IT colleagues may issue a direct DML statement to the database. "It

is just a normal annual exercise every beginning of fiscal year", so you

may hear from them. Even if the timestamps are maintained by triggers, we can

not be 100% scientifically sure that the timestamps are always kept up-to-date.

Step 4 of the 'normal annual exercise' could be disabling the triggers on that

table!

To be 100% certain, put the table on daily extract based on timestamp and leave

it running for a few weeks. This period does depend on how frequent the table

is updated. If the table is updated daily then a few weeks is a sensible evaluation

period. After this period, compare the table on the data warehouse with the

table on the source system using the primary keys. Find out if there is a "leak",

i.e. records which exist on the source system but you can not find on your DW.

If you find a leak, check your logic on the date movement. Also if the date

window moved when the download process failed. If there is no leak, we need

to check that the updates are working, i.e. all columns on all rows on DW table

matches those on the source system. We do this by using CRC comparison. The

3rd thing we need to check is if there are any deletion on the source system.

This is done using the same means as checking for "leak" above, in

fact they are both done using the same process. Deletion is identified by finding

out if there are records which exist on the data warehouse but not on the source

system. If these 3 checks are satisfied, then we can use the timestamps columns

for incremental loading.

Once the timestamp columns are tested to our satisfaction, we can use them

for incremental extract. The basic syntax for the select statement is: "SELECT

* FROM table1 WHERE date_col > LSET". LSET = Last successful

extract time. Before the extract begins, we store the data warehouse system

time in a variable. If the extract is successful, we store this variable on

data warehouse control database as LSET. If the extract fails, it is important

that we don't store this variable on the database. Next time the extract runs

it will use the stored LSET so it will only get the records since the last successful

extract.

It is important to set the LSET to the time before the extraction begins, not

when the extraction finishes. And it is important to limit the maximum time

we want to extract to. For example: if the data extraction routine runs at 1

am for 10 minutes, and if the data extraction routine is running daily, then

we want to limit the time window for today's data extraction from 09/3/06 01:00:00

to 10/03/06 01:00:00. We do not want records created at 10/03/06 01:00:01, because

this record is for tomorrow's extract. Thanks to Steve Wright from C&C group

for making me aware of the maximum limit. So the where clause becomes: "SELECT

* FROM table1 WHERE date_col > LSET and col1 <= CET", where

CET = Current extract time, i.e. time before the extraction began.

The reason why we don't want records created or updated after the data extraction

began is because we can not guarantee if the record will be extracted or not.

Say for instance we have a record with timestamp of 10/03/06 01:05:00. Data

extract is running for 10 minutes, from 10/03/06 01:00:00 to 10/03/06 01:10:00.

If our select statement is "SELECT * FROM table1

WHERE date_col > 10/03/06 01:00:00" (without the CET), will we

get that record? It depends on the ISOLATION LEVEL, or CONCURRENCY CONTROL of

the source system. Different RDBMS have different implementations of isolation

level: IBM

DB2 on z/OS, IBM

DB2 UDB on iSeries, Microsoft

SQL Server 2005, Oracle

10g, IBM

Informix DS10.

So in the data warehouse control system for each table we are extracting, we

have 2 timestamps defining the range of data extraction: Last Successful Extract

Time and Current Extract Time. In the previous example the LSET is 09/3/06 01:00:00

and the CET is 10/03/06 01:00:00. If the data extraction is successful (for

this particular table), then we set the LSET = CET, so both of them becomes

10/3/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06

01:00:00 and it will extract from 10/3/06 01:00:00 to 11/03/06 01:00:00. On

the other hand, if the extract fails, we don't set the LSET = CET, so the LSET

would still be 09/3/06 01:00:00 and the CET would still be 09/03/06 01:00:00.

Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will

extract from 09/3/06 01:00:00 to 11/03/06 01:00:00.

If the date column is separated from time column, for example date column contains

'28/02/2006' or '28-02-2006' or '2006-02-28' and the time column contains '11.03.21'

or '11:03:21', the algorithm we could use is:

WHERE (date_col > min_date AND date_col < max_date)

-- first segment

OR (date_col = min_date AND time_col > min_time) -- second segment

OR (date_col = max_date AND time_col <= max_time) -- third segment

min_date and min_time are determined from LSET whilst max_date and max_time

are determined from CET.

For example: we want to extract records from 07/03/06 01:00:00 to 10/03/06 01:00:00

(because the data extraction route failed to connect to the source system from

3 days). The first segment above is to get records created/updated on 08/03/06

and 09/03/06. The second segment is to get records created/updated on 07/03/06

after 1 am. The third segment is to get records created/updated on 10/03/06

before 1 am.

It is very very important to ensure that we do not update the LSET if the extraction

fails, otherwise we will loose data. I can not stress this important enough.

If there is only 1 thing you can get from this article, I hope you get this

one: ensure and test that LSET is not updated when the extraction fails.

One apparent weakness of incremental extract using timestamp is that we can

not identify deletion. If the source system is using soft deletion (records

to be deleted are marked), we have nothing to worry about. But if they are using

hard deletion (records to be deleted are physically removed), then we need to

use other technique to identify deletion, for example: using trigger or identity

column. Fortunately, most ERP / business systems do not delete from the main

transaction table, particularly when it is on header-detail format. Instead,

they normally use status column, e.g. order status, call status or transaction

status.

This is because of cascade delete / cascade update on the PK-FK relationship,

or more widely known as "cascading referential integrity constraints".

Basically this means: if the table is a foreign key table (a child table if

you like), you would think twice before deleting rows from that table, because

you have to delete all corresponding rows from the other table (the parent table).

And vice versa: if you want to delete rows from the primary key table (the parent

table if you like) then you need to delete all corresponding rows from the foreign

key table (the child table). Otherwise you will break the referential integrity

constraints.

When the timestamp column is tested for reliability (see 9 paragraphs above,

the one begins with "To be 100% certain ..."), we will know whether

there are deletion involved. It is good to ask the client / system provider

(if there is or there isn't deletion on the transaction table), but we still

have to test it.

One last point on using timestamp for incremental extract: some times we need

to get the timestamp from other table(s), or from combination of this table

(that we want to extract) and other table. For example: order header and order

detail, linked on order number. In early stages, we may think that it is enough

to extract the header table based on the timestamp columns of the header table.

And to extract the detail table based on the timestamp columns of the detail

table. But sometimes it is possible that we will find that it is not enough.

To extract the header table, we may need to use the timestamp columns from both

the header table and the detail table. And to extract the detail table, we also

need to use the timestamp columns from both tables. The code looks like this:

SELECT h.* FROM order_header h

LEFT JOIN order_detail d

ON h.order_number = d.order_number

WHERE (h.timestamp > LSET and h.timestamp <= CET)

OR (d.timestamp > LSET and h.timestamp <= CET)

b. Incremental Extract Using Identity Column

If there is no timestamp column in the transaction table (this is unlikely),

or if the timestamp column is not reliable (this is more likely), then we could

use identity column for incrementally extract the data. Identity column is unique.

We can test the data to make sure there is no duplications as follows: (id_col

being the identity column)

SELECT id_col, count(*) FROM table1

GROUP BY id_col HAVING count(*) > 1

Identity column is sequential. This enables us to easily identify new records.

The basic syntax is WHERE id_col > LSEI. LSEI

= Last successful extract identity, i.e the value of the identity column of

the last record successfully extracted. Again, just like the time window, we

want to limit the maximum. So it becomes WHERE id_col

> LSEI and id_col <= CEI, CEI = current extract identity.

To identify deletion we compare the identity column between the source and

data warehouse. This is done by downloading the identity columns (all records)

from the source system. In many cases, especially if the table is a transactional

table, the identity column is often a primary key, but it's not always. Basically

we identify the missing ones, i.e. rows that exist on source system but not

in the data warehouse. We then delete (or move/archive) these rows from our

data warehouse.

OK, that's new records and deletion. How about updates? To identify updates

we need to download the columns that we want to load to our data warehouse (yes,

all records < LSET unfortunately), and compare these columns with the records

on the data warehouse. If they are different, we update the records in the data

warehouse.

After the source data is downloaded, the basic statement for updating incrementally

is as follows:

UPDATE dw_table dw

SET dw.col1 = src.col1, dw.col2 = src.col2

FROM stage_table src

WHERE dw.col1 <> src.col2, dw.col2 <> src.col2

In most cases they have an archiving or purging system implemented on the ERP

system, resulting in not too many rows on the active transaction table. For

example, only keeping last 12 months data on the order table, resulting in say

5 million rows, with download time of say, 30 minutes. Sometimes they keep many

years records without ever purging them to an archive database, resulting in

(say) 30 million records on the active order detail table. If this is the case,

try to identify whether it is possible to impose a time range for identifying

updates, say last 6 months records. For the purpose of identifying this 6 months

date range, if there is no timestamp / datestamp columns (created_date, last_updated_date,

etc), then transaction date column (order date, delivery date, etc) will do.

This will limit the amount of records downloaded to compare for identifying

updates.

We can also use other mechanism to identify updates, such as update trigger,

update timestamp or log files. The update timestamp doesn't always need to be

from the same table, it could be from other table. For example, in a header

detail relationship.

If the source system is Oracle, the identity column is probably implemented

on primary key column using a trigger and CREATE SEQUENCE, something like this:

oracle-base.com.

In DB2 and SQL Server, identity column is built in. Note that DB2 also support

CREATE SEQUENCE. In Informix it is SERIAL data type.

c. Incremental Extract Using Triggers

Triggers are the most convenient way of doing data extraction. It is the belt

and braces approach. There are 3 types of triggers, e.g. for insert, for update

and for delete. Most RDBMS also differentiate before and after trigger, i.e.

whether the trigger is executed before or after the insert/update/delete. By

creating triggers on the source system table, we can program the source system

to provide us with the records every time there is a new record, when updates

are made to the existing records, or when the records are deleted. We need to

be careful when implementing triggers because it slows down the source system

performance.

One way of implementing triggers in the source system is to keep the primary

keys of the changed table in a specially created table (let's call this delta

table). In the delta table we only keep the primary keys, not the whole record.

This is because different tables have different columns so it is not possible

to create a delta table which can store all tables. In delta table, the primary

keys could be stored in 1 column, using separators. The name of the primary

key columns are stored in another column, also using separators. Delta table

also contains 2 important columns: the creation timestamp column (no need to

have updated timestamp column) and the table name column. This enable us to

get to the right table and to extract it incrementally. The delta table looks

like this:

ID table_name PK_columns PK_values time_stamp
1 inventory store_id|product_id 23|5643_G 17/03/2006 10:10:46
2 order_detail order_id|line_no 454AS1|2 17/03/2006 10:11:01
3 customer customer_id 343 17/03/2006 10:11:98

The delta table needs to be cleared out when it has been processed. When processing

the delta table, it is important to impose a CET, Current Extract Time, i.e.

the system time before the data extraction begins. So the process is: get all

records from delta table where timestamp <= CET, then delete all records

from delta table where timestamp <= CET. Alternatively (preferred, for trace/history

reason, and also for performance - see having a good PK below), we can also

not clearing delta table when we have extracted them, but we simply store the

CET as LSET (Last Successful Extract Time). See the section on timestamp above

for details. If we don't clear the delta table, we need to implement 1) a good

integer identity (1,1) primary key, such as the ID column in the example above,

and please don't forget to cluster index it for performance, and 2) a good purging

mechanism (say leaving last 3 months data) otherwise the system will be slower

and slower every time - in this case a non cluster index on the time_stamp column

is useful. One note on the time_stamp column if your source system in a SQL

Server, I would prefer to put is column as datetime data type rather than timestamp

data type, for compatibility reasons both with future version and with other

RDBMS (portability of implementation).

Depending on the structure and complexity of the source system, sometimes it

is better to have a separate delta table for each table on the source system,

rather than using a single delta table. It is also common to have the delta

tables extracted to files at certain periods of the day, ready for the data

warehouse to get it.

The main difference between doing data extraction using triggers and the previously

mentioned 2 methods is that if we use trigger, we "touch" the source

system. The word touch may be a little bit too soft, the word "change"

is probably more appropriate. Yes, we change the source system. This is not

a luxury that everyone have. In many cases, the source system is an off-the-shelf

ERP system, e.g. SAP, Oracle 11i, Sage, MFG, JD Edwards, PeopleSoft, Axapta,

Sieble CRM, SalesForce, etc. In these cases, most probably, we don't have the

luxury of modifying the source system as we like. If we ask the supplier to

modify it for us, we are risking ourselves to be in the position where our ERP

is not a standard version any more, and therefore a) not upgradable to the next

version, and b) not easily supported by other software when interfacing. Luckily,

the big players in the ERP market such as SAP and Oracle already prepared themselves

and have their own data warehousing tools. Tools such as Oracle Change Data

Capture can extract data incrementally, and it is implemented either using triggers

(synchronous) or using redo log file (asynchronous).

This is the end of part 1. In part 2 we will discuss:

2c. Incremental Extract Using Transaction Date

2d. Incremental Extract Using combination of the above

2e. Incremental Extract Using Off-The-Shelf Tools

3. Fixed Period Extract

Vincent Rainardi

23/03/06

You can read Part 2 here.

Rate

4.8 (5)

Share

Share

Rate

4.8 (5)

Related content

ENTERPRISE METADATA FRAMEWORK

Over the past few years, I’ve had the opportunity to discuss enterprise metadata to a wide variety of audiences and much of this conversation is captured in this “Best Practices” implementation framework. The model has evolved over the past few years as our program continues to do the same. Of course, this summary can only be a few pages long so the depth of the content here will be a tad shallow but you should be able to get the basics from the diagram and the description that follows. Figure 1 provides the new framework and the content follows to describe each section.

2005-05-19

1,821 reads

Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3

To kick off our first column of the year, we're going to take on a challenging subject that all designers face: how to deal with changing dimensions. Unlike most OLTP systems, a major objective of a data warehouse is to track history. So, accounting for change is one of the analyst's most important responsibilities. A sales force region reassignment is a good example of a business change that may require you to alter the dimensional data warehouse. We'll discuss how to apply the right technique to account for the change historically. Hang on to your hats — this is not an easy topic.

2005-02-21

1,868 reads

Surrounding the ETL Requirements

November 13, 2004 / Issue TOC

Surrounding the ETL Requirements

Before designing an ETL system, you must first understand all of your business needs.

By Ralph Kimball , Margy Ross

Ideally, the design of your extract, transform, and load (ETL) system begins with one of the toughest challenges: surrounding the requirements. By this we mean gathering in one place all the known requirements, realities, and constraints affecting the ETL system. The list of requirements is pretty overwhelming, but it's essential to lay them on the table before launching a data warehouse project.

The requirements are mostly things you must live with and adapt your system to. Within the framework of your requirements, you'll have many places where you can make your own decisions, exercise your judgment, and leverage your creativity, but the requirements are just what they're named. They are required.

2005-02-15

1,855 reads