In part 1 we discussed data extraction which gets the whole table every time,
and the first 3 methods of incremental data extraction, e.g. using using timestamp
columns, using identity column, and using triggers. We will now discuss the
remaining three, e.g. using transaction date, combination of the above and using
off-the-shelf tools, and finally the fixed period method.
2d. Incremental Extract Using Transaction Date
If there is no good create and update datestamp or timestamp columns in the
source table, and it is not possible to put triggers in the source system, and
there is no identity column either, then our next best option for incremental
data extraction is to use transaction date column. By this I mean columns which
contains the date the transaction happen.
Generally speaking, there are 2 major categories of tables in an ERP system
or business system, or as the data warehouse community call it, OLTP system
(in contrast with OLAP that we have in data warehouse system). The first category
is the 'master' tables. Tables such as product, customer, supplier, account,
status code, product group, store, and disposal site. In the data warehouse
star schema, these tables become the dimension tables. The second category is
"transaction" tables. Tables such as sales order header, sales order
detail, purchase order header, purchase order detail, inventory transaction,
account transaction, price review and disposal. In the data warehouse star schema,
these transaction tables become the fact tables.
If the table we extract is a transaction table, it normally has one or more
transaction date columns. In an order header table, it is common to find the
following transaction date columns
- order date - the date the order was placed by the customer
- system order date - the date the order header record was created
- estimated delivery date - the date the order supposed to be out of our door
- actual delivery date - the date the order really out of our door
The difference between order date and system order date is that: if the order
came by fax yesterday and it is typed into the system today, order date will
be typed as yesterday whilst the system order date will be today's date. Operator
has no control on this column, it is maintained internally by the system. Please
note that in better ERP systems, to enable partial delivery, the estimated and
actual delivery dates are put on the order detail table rather than order header
In the above example, system order date is an excellent transaction date. It
is guaranteed to be greater than or equal to today. Orders which happened later
have later system order date as well. That makes this column, in a way, sequential,
just like an identity column. The order date column, on the other hand, can
be deceiving, because user can type a past date on it. This cause a problem,
as we can not be sure which orders come after which orders.
The basic syntax of the ETL select statement is generally the same as the ones
for datestamp / timestamp column, e.g. SELECT col1, col2
FROM table1 WHERE trans_date_col >= LSED. If we only deals with date
(not with time as well), there is little point constraining the maximum.
So that's for inserts. How about deletes and updates? Well, simply speaking,
we can't get deletes and updates using transaction dates. We need to use other
methods to get them. This restriction made us often have to combine this method
with fixed period method (see point 3 below), for example. we download last
6 weeks every time, based on transaction date column.
2e. Combination Of The Above
Sometimes we need to combine the above 4 methods, to suit the condition of
the source table, to get all inserts, updates and deletes. Inserts are generally
the easiest to get. We can use all for methods to get inserts: timestamp column,
incremental column, triggers and transaction date column. For deletes we can
can use timestamp, incremental column or triggers. To get updates we can use
CRC and triggers. Yes triggers can be use to get all 3 kinds of changes, incrementally.
2f. Using Off-The-Shelf Tools
Now a days to get or extract the data out of business system is getting easier.
Some BI companies have written ETL package for most popular business systems
such as SAP, Oracle applications, Siebel, PeopleSoft and J.D. Edwards. In the
BI industry, these ETL packages are normally called Data Integration (DI) software.
We install the DI software, set some parameters, and off you go. It extracts
the data from our business systems into our data warehouse. Isn't that wonderful?
Well in reality it's not as straight forward as that. We still have to do data
analysis (profiling) and data quality process. We still have to customise it
a little to suit our business system and to suit our data warehouse. But we
don't start from stratch, most definitely not. Below we try to illustrate the
situation by describing a few DI software that are around in the market, along
with the links.
- IBM Web Sphere Information
Integration (formerly Ascentia)
DataStage extracts, transforms and loads data. ProfileStage analyses source
data, creates data model, generate source to target mappings, and generate
ETL routine. QualityStage detects data duplication, inconsistencies and missing
data. Package Application Connectivity Kits (PACKS) helps connect to SAP R/3
and BW, Oracle, Siebel, PeopleSoft and JD Edwards systems and extract data
from them. Supports most RDBMS.
- Informatica PowerExchange
PowerExchange Changed Data Capture operates within the source environment
to detect data updates and to deliver them in real time or batch. PowerCenter
transform and load data; it also offers cleansing and profiling. Most RDBMS
Define data mapping and transformation. Cleanse (integration with FirstLogic),
preview and profile data. Has prebuilt transformations. Creates portable ETL
jobs. Integrates with SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards.
Support most RDBMS.
- Pervasive Data Integrator
(formerly Data Junction)
Extract, transform and load data. Creates ODS. Real time or batch execution.
Integrates with SAP, Goldmine, Solomon, Great Plains, Navision, SAS, Market
Force, Salesforce, and many accounting packages: AccountMate, ACT, Champion,
Platinum , Macola, Peachtree, RealWorld, CYMA, etc. Support a comprehensive
list of RDBMS.
Extract, transform and load data. Integrates with SAP.
- SAS DataFlux
Extract, transform and load data. Has data profiling and data quality capabilities.
Integrates with SAP R/3, Siebel, Oracle Warehouse Builder, Informatica. Support
8 BI Data Integration (formerly DecisionStream)
Extract, transform and load data. Integrates with SAP R/3 and Siebel CRM.
Support most RDBMS.
Uses database log files to capture changed data to deliver them real time.
- Microsoft SQL Server
2005 Integration Services (formerly DTS)
Extract, transform and load data. Support most RDBMS.
- CA Advantage
Extract, transform and load data. Integrates with SAP R/3. Support most RDBMS.
Platinum InfoPump, InfoHub, and InfoRefiner were acquired by CA.
Extract, transform and load data. Integrates with SAP (produces ABAP code).
Carleton Pasport was acquired by Oracle.
Extract, transform and load data. Changed data capture delta agent. Data profiling
and metadata import/export. Support most RDBMS.
Replication Agents for Informix, Oracle, and Adaptive Server Anywhere. Support
- ETI Solution
Extract, transform and load data. Integrates with Siebel, WebSphere MQ and
ClickStream. Support most RDBMS.
Ascentia (now IBM) and Informatica are probably the most popular data extraction
tools in the BI market now a days. BusinessObject, Genio and DataFlux are decent,
but I prefer to use SSIS as personal choice. Informatica, DataMirror and Oracle
CDC (changed data capture) could be a life saver in some situation where we
must extract incrementally. Alkis
Simitsis compiled a comprehensive list of ETL
tools, which might worth having a look if you want to explore some of the
29th March 2006