Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Data Extraction Methods Part 1

By Vincent Rainardi,

To populate the data warehouse we need to get data from the source systems. The source system could be an RDBMS, it could be a file based system (including XML), it could be a spreadsheet, or it could be something else (a web service, a web page, a document, etc). In this article we discuss RDBMS source system, i.e. the source data is located in a table on SQL Server, Oracle, Informix, DB2, Access, etc. File based source systems and spreadsheets will be discussed in a separate article.

There are 3 main methods to extract a table out of an RDBMS source system into our data warehouse: get the whole table every time, get it incrementally (only the changes), or using a fixed period. Each of these 3 basic approaches will be explained below one by one.

There are several considerations when choosing which method to use. The first and foremost is normally the time window, i.e. the data extraction process should be completed in certain amount of time. The less time it requires to do data extraction, the better it is for a data warehouse system, because this means that we can update the data warehouse more frequently. And it generally means that we put less load on the source system.

The second consideration is practicality, i.e. what are possible and what are not possible. Ideally we would like to extract it incrementally, but if it is not possible then we will have to use other method. Or the other way around: ideally we would like to extract the whole table (just to be sure), but if there are 100 millions records in the table, then we may have to do it incrementally otherwise the extract could take a very long time.

You can read Part 2 here.

1. Whole Table Every time

The simplest way to download a table from the source system into data warehouse is to extract all records every time, e.g. "SELECT * FROM table1". For example, a store table. Say a retail group has 500 stores. Each store has a record on this table containing the store name, address and various attributes such as store grade or classification, last inspection date and ID of the store manager. For 500 records, even if the store table has timestamp for both create and update, it is probably quicker to get the entire table every time.

If we just do a select without a where clause, the extraction begins instantly. If we put something on the where clause, such as timestamp, the extraction process begins a few moments later (it could be a second later, it could be a few minutes later, depending on how many records are on the table on whether the columns used on the where clause are indexed or not). If you are connected on a LAN with the source system, say 100 Meg, 58 records should take less than a second. So, if the delay caused by the where clause is a few seconds, it is quicker to download the whole table every time.

A quick test using an ETL software can easily measure how long the delay is. But as a guideline, for a table that has less than 1000 rows, we should definitely consider getting the whole table every time. If the table has 1000 to 10,000 rows, we should still probably consider getting the whole table every time. If the table has more than 1 million rows , we should definitely consider downloading it incrementally (unless the record length is very small, such as 10 bytes). One sure way of determining whether download the whole table every time is suitable is to do it and measure the time. If the time it takes is not possible to be accommodated, then we need to extract the data incrementally.

2. Incremental Extract

If the table contains 2 million records, say it is the customer table, it could take 15-20 minutes to download the whole table. The purpose giving the stats/numbers here is only for illustration, so we can get the feel of it. It is not exact and it is not scientific. It could vary significantly, for example if you are using RAID 10 or using Gigabyte network. OK, back to customer table. If there are timestamp columns for creation and updates, then it could be a sensible idea to try to download this table incrementally, which may decrease the download time to, say, 3-5 minutes. The number of records decreases from 2 million to 2000 (daily changes only) and the amount of data transferred decreases from 2 GB to 2 MB (assuming the record length is 1000 bytes).

In OLTP systems, customer and product are probably the biggest master tables. But transaction tables are normally a lot larger. It does depends on the size of the company (multinational or local), and the industry (telecommunication and retail normally have huge tables). A transaction table with 100 million rows are not uncommon. At early stages of building the data warehouse, the one of the key question to answer is whether it is possible to extract the transaction tables incrementally. It is often not practical (in terms of time window) to extract the whole table every time. This question is so important so that for many projects. So important that if the answer is no then the project could not proceed.

There are 6 main ways to perform incremental extract:
a. Using timestamp columns
b. Using identity column
c. Using triggers
d. Using transaction date
e. Combination of the above
f. Using off-the-shelf tools

a. Incremental Extract Using Timestamp Columns

If the table is large (>100,000), it is always worth to find out if the table has creation and update timestamp or datestamp columns, and whether the timestamps are indexed. And more importantly if the timestamps are updated every time a new record is created and updated. The index is a simple and straightforward. We could easily prove if a column is indexed or not. Your source system DBA would know how to. Different RDBMS has different implementation for indexing (some platforms call it logical file), but they all speed up the query. Adding the timestamp on the where clause of the select statement could greatly vary the extraction time, especially for tables with millions of rows. Sadly, more often than not, the timestamp columns are not indexed. In this case it may be worth to find out if they could index it for you. But before you put forward the request consider the impact to the source system - as you know the more index there are on a table, the slower the DML e.g. insert, update and delete process.

The second one is a little bit more difficult to find out: whether the timestamps are updated every time a new record is created and every time an existing record is updated. The supplier of the ERP system may confidently say that the timestamps are always updated on insert and update throughout the application, but your internal IT colleagues may issue a direct DML statement to the database. "It is just a normal annual exercise every beginning of fiscal year", so you may hear from them. Even if the timestamps are maintained by triggers, we can not be 100% scientifically sure that the timestamps are always kept up-to-date. Step 4 of the 'normal annual exercise' could be disabling the triggers on that table!

To be 100% certain, put the table on daily extract based on timestamp and leave it running for a few weeks. This period does depend on how frequent the table is updated. If the table is updated daily then a few weeks is a sensible evaluation period. After this period, compare the table on the data warehouse with the table on the source system using the primary keys. Find out if there is a "leak", i.e. records which exist on the source system but you can not find on your DW. If you find a leak, check your logic on the date movement. Also if the date window moved when the download process failed. If there is no leak, we need to check that the updates are working, i.e. all columns on all rows on DW table matches those on the source system. We do this by using CRC comparison. The 3rd thing we need to check is if there are any deletion on the source system. This is done using the same means as checking for "leak" above, in fact they are both done using the same process. Deletion is identified by finding out if there are records which exist on the data warehouse but not on the source system. If these 3 checks are satisfied, then we can use the timestamps columns for incremental loading.

Once the timestamp columns are tested to our satisfaction, we can use them for incremental extract. The basic syntax for the select statement is: "SELECT * FROM table1 WHERE date_col > LSET". LSET = Last successful extract time. Before the extract begins, we store the data warehouse system time in a variable. If the extract is successful, we store this variable on data warehouse control database as LSET. If the extract fails, it is important that we don't store this variable on the database. Next time the extract runs it will use the stored LSET so it will only get the records since the last successful extract.

It is important to set the LSET to the time before the extraction begins, not when the extraction finishes. And it is important to limit the maximum time we want to extract to. For example: if the data extraction routine runs at 1 am for 10 minutes, and if the data extraction routine is running daily, then we want to limit the time window for today's data extraction from 09/3/06 01:00:00 to 10/03/06 01:00:00. We do not want records created at 10/03/06 01:00:01, because this record is for tomorrow's extract. Thanks to Steve Wright from C&C group for making me aware of the maximum limit. So the where clause becomes: "SELECT * FROM table1 WHERE date_col > LSET and col1 <= CET", where CET = Current extract time, i.e. time before the extraction began.

The reason why we don't want records created or updated after the data extraction began is because we can not guarantee if the record will be extracted or not. Say for instance we have a record with timestamp of 10/03/06 01:05:00. Data extract is running for 10 minutes, from 10/03/06 01:00:00 to 10/03/06 01:10:00. If our select statement is "SELECT * FROM table1 WHERE date_col > 10/03/06 01:00:00" (without the CET), will we get that record? It depends on the ISOLATION LEVEL, or CONCURRENCY CONTROL of the source system. Different RDBMS have different implementations of isolation level: IBM DB2 on z/OS, IBM DB2 UDB on iSeries, Microsoft SQL Server 2005, Oracle 10g, IBM Informix DS10.

So in the data warehouse control system for each table we are extracting, we have 2 timestamps defining the range of data extraction: Last Successful Extract Time and Current Extract Time. In the previous example the LSET is 09/3/06 01:00:00 and the CET is 10/03/06 01:00:00. If the data extraction is successful (for this particular table), then we set the LSET = CET, so both of them becomes 10/3/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 10/3/06 01:00:00 to 11/03/06 01:00:00. On the other hand, if the extract fails, we don't set the LSET = CET, so the LSET would still be 09/3/06 01:00:00 and the CET would still be 09/03/06 01:00:00. Tomorrow when the extract begins, the CET will be 11/03/06 01:00:00 and it will extract from 09/3/06 01:00:00 to 11/03/06 01:00:00.

If the date column is separated from time column, for example date column contains '28/02/2006' or '28-02-2006' or '2006-02-28' and the time column contains '11.03.21' or '11:03:21', the algorithm we could use is:
WHERE (date_col > min_date AND date_col < max_date) -- first segment
OR (date_col = min_date AND time_col > min_time) -- second segment
OR (date_col = max_date AND time_col <= max_time) -- third segment

min_date and min_time are determined from LSET whilst max_date and max_time are determined from CET.
For example: we want to extract records from 07/03/06 01:00:00 to 10/03/06 01:00:00 (because the data extraction route failed to connect to the source system from 3 days). The first segment above is to get records created/updated on 08/03/06 and 09/03/06. The second segment is to get records created/updated on 07/03/06 after 1 am. The third segment is to get records created/updated on 10/03/06 before 1 am.

It is very very important to ensure that we do not update the LSET if the extraction fails, otherwise we will loose data. I can not stress this important enough. If there is only 1 thing you can get from this article, I hope you get this one: ensure and test that LSET is not updated when the extraction fails.

One apparent weakness of incremental extract using timestamp is that we can not identify deletion. If the source system is using soft deletion (records to be deleted are marked), we have nothing to worry about. But if they are using hard deletion (records to be deleted are physically removed), then we need to use other technique to identify deletion, for example: using trigger or identity column. Fortunately, most ERP / business systems do not delete from the main transaction table, particularly when it is on header-detail format. Instead, they normally use status column, e.g. order status, call status or transaction status.

This is because of cascade delete / cascade update on the PK-FK relationship, or more widely known as "cascading referential integrity constraints". Basically this means: if the table is a foreign key table (a child table if you like), you would think twice before deleting rows from that table, because you have to delete all corresponding rows from the other table (the parent table). And vice versa: if you want to delete rows from the primary key table (the parent table if you like) then you need to delete all corresponding rows from the foreign key table (the child table). Otherwise you will break the referential integrity constraints.

When the timestamp column is tested for reliability (see 9 paragraphs above, the one begins with "To be 100% certain ..."), we will know whether there are deletion involved. It is good to ask the client / system provider (if there is or there isn't deletion on the transaction table), but we still have to test it.

One last point on using timestamp for incremental extract: some times we need to get the timestamp from other table(s), or from combination of this table (that we want to extract) and other table. For example: order header and order detail, linked on order number. In early stages, we may think that it is enough to extract the header table based on the timestamp columns of the header table. And to extract the detail table based on the timestamp columns of the detail table. But sometimes it is possible that we will find that it is not enough. To extract the header table, we may need to use the timestamp columns from both the header table and the detail table. And to extract the detail table, we also need to use the timestamp columns from both tables. The code looks like this:

SELECT h.* FROM order_header h
LEFT JOIN order_detail d
ON h.order_number = d.order_number
WHERE (h.timestamp > LSET and h.timestamp <= CET)
OR (d.timestamp > LSET and h.timestamp <= CET)

b. Incremental Extract Using Identity Column

If there is no timestamp column in the transaction table (this is unlikely), or if the timestamp column is not reliable (this is more likely), then we could use identity column for incrementally extract the data. Identity column is unique. We can test the data to make sure there is no duplications as follows: (id_col being the identity column)
SELECT id_col, count(*) FROM table1
GROUP BY id_col HAVING count(*) > 1

Identity column is sequential. This enables us to easily identify new records. The basic syntax is WHERE id_col > LSEI. LSEI = Last successful extract identity, i.e the value of the identity column of the last record successfully extracted. Again, just like the time window, we want to limit the maximum. So it becomes WHERE id_col > LSEI and id_col <= CEI, CEI = current extract identity.

To identify deletion we compare the identity column between the source and data warehouse. This is done by downloading the identity columns (all records) from the source system. In many cases, especially if the table is a transactional table, the identity column is often a primary key, but it's not always. Basically we identify the missing ones, i.e. rows that exist on source system but not in the data warehouse. We then delete (or move/archive) these rows from our data warehouse.

OK, that's new records and deletion. How about updates? To identify updates we need to download the columns that we want to load to our data warehouse (yes, all records < LSET unfortunately), and compare these columns with the records on the data warehouse. If they are different, we update the records in the data warehouse.

After the source data is downloaded, the basic statement for updating incrementally is as follows:
UPDATE dw_table dw
SET dw.col1 = src.col1, dw.col2 = src.col2
FROM stage_table src
WHERE dw.col1 <> src.col2, dw.col2 <> src.col2

In most cases they have an archiving or purging system implemented on the ERP system, resulting in not too many rows on the active transaction table. For example, only keeping last 12 months data on the order table, resulting in say 5 million rows, with download time of say, 30 minutes. Sometimes they keep many years records without ever purging them to an archive database, resulting in (say) 30 million records on the active order detail table. If this is the case, try to identify whether it is possible to impose a time range for identifying updates, say last 6 months records. For the purpose of identifying this 6 months date range, if there is no timestamp / datestamp columns (created_date, last_updated_date, etc), then transaction date column (order date, delivery date, etc) will do. This will limit the amount of records downloaded to compare for identifying updates.

We can also use other mechanism to identify updates, such as update trigger, update timestamp or log files. The update timestamp doesn't always need to be from the same table, it could be from other table. For example, in a header detail relationship.

If the source system is Oracle, the identity column is probably implemented on primary key column using a trigger and CREATE SEQUENCE, something like this: oracle-base.com. In DB2 and SQL Server, identity column is built in. Note that DB2 also support CREATE SEQUENCE. In Informix it is SERIAL data type.

c. Incremental Extract Using Triggers

Triggers are the most convenient way of doing data extraction. It is the belt and braces approach. There are 3 types of triggers, e.g. for insert, for update and for delete. Most RDBMS also differentiate before and after trigger, i.e. whether the trigger is executed before or after the insert/update/delete. By creating triggers on the source system table, we can program the source system to provide us with the records every time there is a new record, when updates are made to the existing records, or when the records are deleted. We need to be careful when implementing triggers because it slows down the source system performance.

One way of implementing triggers in the source system is to keep the primary keys of the changed table in a specially created table (let's call this delta table). In the delta table we only keep the primary keys, not the whole record. This is because different tables have different columns so it is not possible to create a delta table which can store all tables. In delta table, the primary keys could be stored in 1 column, using separators. The name of the primary key columns are stored in another column, also using separators. Delta table also contains 2 important columns: the creation timestamp column (no need to have updated timestamp column) and the table name column. This enable us to get to the right table and to extract it incrementally. The delta table looks like this:

ID table_name PK_columns PK_values time_stamp
1 inventory store_id|product_id 23|5643_G 17/03/2006 10:10:46
2 order_detail order_id|line_no 454AS1|2 17/03/2006 10:11:01
3 customer customer_id 343 17/03/2006 10:11:98

The delta table needs to be cleared out when it has been processed. When processing the delta table, it is important to impose a CET, Current Extract Time, i.e. the system time before the data extraction begins. So the process is: get all records from delta table where timestamp <= CET, then delete all records from delta table where timestamp <= CET. Alternatively (preferred, for trace/history reason, and also for performance - see having a good PK below), we can also not clearing delta table when we have extracted them, but we simply store the CET as LSET (Last Successful Extract Time). See the section on timestamp above for details. If we don't clear the delta table, we need to implement 1) a good integer identity (1,1) primary key, such as the ID column in the example above, and please don't forget to cluster index it for performance, and 2) a good purging mechanism (say leaving last 3 months data) otherwise the system will be slower and slower every time - in this case a non cluster index on the time_stamp column is useful. One note on the time_stamp column if your source system in a SQL Server, I would prefer to put is column as datetime data type rather than timestamp data type, for compatibility reasons both with future version and with other RDBMS (portability of implementation).

Depending on the structure and complexity of the source system, sometimes it is better to have a separate delta table for each table on the source system, rather than using a single delta table. It is also common to have the delta tables extracted to files at certain periods of the day, ready for the data warehouse to get it.

The main difference between doing data extraction using triggers and the previously mentioned 2 methods is that if we use trigger, we "touch" the source system. The word touch may be a little bit too soft, the word "change" is probably more appropriate. Yes, we change the source system. This is not a luxury that everyone have. In many cases, the source system is an off-the-shelf ERP system, e.g. SAP, Oracle 11i, Sage, MFG, JD Edwards, PeopleSoft, Axapta, Sieble CRM, SalesForce, etc. In these cases, most probably, we don't have the luxury of modifying the source system as we like. If we ask the supplier to modify it for us, we are risking ourselves to be in the position where our ERP is not a standard version any more, and therefore a) not upgradable to the next version, and b) not easily supported by other software when interfacing. Luckily, the big players in the ERP market such as SAP and Oracle already prepared themselves and have their own data warehousing tools. Tools such as Oracle Change Data Capture can extract data incrementally, and it is implemented either using triggers (synchronous) or using redo log file (asynchronous).

This is the end of part 1. In part 2 we will discuss:
2c. Incremental Extract Using Transaction Date
2d. Incremental Extract Using combination of the above
2e. Incremental Extract Using Off-The-Shelf Tools
3. Fixed Period Extract

Vincent Rainardi
23/03/06

You can read Part 2 here.

Total article views: 15084 | Views in the last 30 days: 8
 
Related Articles
FORUM

System error in replication - Cannot insert an explicit value into a timestamp column.

System error in replication - Cannot insert an explicit value into a timestamp column.

FORUM

How to extract records when table does not have unique column?

How to extract records when table does not have unique column?

FORUM

Identify Last Modified rows without TimeStamp column

Identify Last Modified rows without TimeStamp column

FORUM

Generate extract with varrying column size

Generate extract with varrying column size

FORUM

TIMESTAMP column

query table to return records between date1 and daet2

Tags
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones