Data Extraction Methods Part 2

,

In part 1 we discussed data extraction which gets the whole table every time,

and the first 3 methods of incremental data extraction, e.g. using using timestamp

columns, using identity column, and using triggers. We will now discuss the

remaining three, e.g. using transaction date, combination of the above and using

off-the-shelf tools, and finally the fixed period method.

2d. Incremental Extract Using Transaction Date

If there is no good create and update datestamp or timestamp columns in the

source table, and it is not possible to put triggers in the source system, and

there is no identity column either, then our next best option for incremental

data extraction is to use transaction date column. By this I mean columns which

contains the date the transaction happen.

Generally speaking, there are 2 major categories of tables in an ERP system

or business system, or as the data warehouse community call it, OLTP system

(in contrast with OLAP that we have in data warehouse system). The first category

is the 'master' tables. Tables such as product, customer, supplier, account,

status code, product group, store, and disposal site. In the data warehouse

star schema, these tables become the dimension tables. The second category is

"transaction" tables. Tables such as sales order header, sales order

detail, purchase order header, purchase order detail, inventory transaction,

account transaction, price review and disposal. In the data warehouse star schema,

these transaction tables become the fact tables.

If the table we extract is a transaction table, it normally has one or more

transaction date columns. In an order header table, it is common to find the

following transaction date columns

  • order date - the date the order was placed by the customer
  • system order date - the date the order header record was created
  • estimated delivery date - the date the order supposed to be out of our door
  • actual delivery date - the date the order really out of our door

The difference between order date and system order date is that: if the order

came by fax yesterday and it is typed into the system today, order date will

be typed as yesterday whilst the system order date will be today's date. Operator

has no control on this column, it is maintained internally by the system. Please

note that in better ERP systems, to enable partial delivery, the estimated and

actual delivery dates are put on the order detail table rather than order header

table.

In the above example, system order date is an excellent transaction date. It

is guaranteed to be greater than or equal to today. Orders which happened later

have later system order date as well. That makes this column, in a way, sequential,

just like an identity column. The order date column, on the other hand, can

be deceiving, because user can type a past date on it. This cause a problem,

as we can not be sure which orders come after which orders.

The basic syntax of the ETL select statement is generally the same as the ones

for datestamp / timestamp column, e.g. SELECT col1, col2

FROM table1 WHERE trans_date_col >= LSED. If we only deals with date

(not with time as well), there is little point constraining the maximum.

So that's for inserts. How about deletes and updates? Well, simply speaking,

we can't get deletes and updates using transaction dates. We need to use other

methods to get them. This restriction made us often have to combine this method

with fixed period method (see point 3 below), for example. we download last

6 weeks every time, based on transaction date column.

2e. Combination Of The Above

Sometimes we need to combine the above 4 methods, to suit the condition of

the source table, to get all inserts, updates and deletes. Inserts are generally

the easiest to get. We can use all for methods to get inserts: timestamp column,

incremental column, triggers and transaction date column. For deletes we can

can use timestamp, incremental column or triggers. To get updates we can use

CRC and triggers. Yes triggers can be use to get all 3 kinds of changes, incrementally.

2f. Using Off-The-Shelf Tools

Now a days to get or extract the data out of business system is getting easier.

Some BI companies have written ETL package for most popular business systems

such as SAP, Oracle applications, Siebel, PeopleSoft and J.D. Edwards. In the

BI industry, these ETL packages are normally called Data Integration (DI) software.

We install the DI software, set some parameters, and off you go. It extracts

the data from our business systems into our data warehouse. Isn't that wonderful?

Well in reality it's not as straight forward as that. We still have to do data

analysis (profiling) and data quality process. We still have to customise it

a little to suit our business system and to suit our data warehouse. But we

don't start from stratch, most definitely not. Below we try to illustrate the

situation by describing a few DI software that are around in the market, along

with the links.

  1. IBM Web Sphere Information

    Integration (formerly Ascentia)

    DataStage extracts, transforms and loads data. ProfileStage analyses source

    data, creates data model, generate source to target mappings, and generate

    ETL routine. QualityStage detects data duplication, inconsistencies and missing

    data. Package Application Connectivity Kits (PACKS) helps connect to SAP R/3

    and BW, Oracle, Siebel, PeopleSoft and JD Edwards systems and extract data

    from them. Supports most RDBMS.

  2. Informatica PowerExchange

    and PowerCenter

    PowerExchange Changed Data Capture operates within the source environment

    to detect data updates and to deliver them in real time or batch. PowerCenter

    transform and load data; it also offers cleansing and profiling. Most RDBMS

    are supported.

  3. BusinessObject

    DataIntegrator

    Define data mapping and transformation. Cleanse (integration with FirstLogic),

    preview and profile data. Has prebuilt transformations. Creates portable ETL

    jobs. Integrates with SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards.

    Support most RDBMS.

  4. Pervasive Data Integrator

    (formerly Data Junction)

    Extract, transform and load data. Creates ODS. Real time or batch execution.

    Integrates with SAP, Goldmine, Solomon, Great Plains, Navision, SAS, Market

    Force, Salesforce, and many accounting packages: AccountMate, ACT, Champion,

    Platinum , Macola, Peachtree, RealWorld, CYMA, etc. Support a comprehensive

    list of RDBMS.

  5. Hummingbird

    Genio

    Extract, transform and load data. Integrates with SAP.

  6. SAS DataFlux

    Extract, transform and load data. Has data profiling and data quality capabilities.

    Integrates with SAP R/3, Siebel, Oracle Warehouse Builder, Informatica. Support

    most RDBMS.

  7. Cognos

    8 BI Data Integration (formerly DecisionStream)

    Extract, transform and load data. Integrates with SAP R/3 and Siebel CRM.

    Support most RDBMS.

  8. DataMirror

    Transformation Server

    Uses database log files to capture changed data to deliver them real time.

  9. Microsoft SQL Server

    2005 Integration Services (formerly DTS)

    Extract, transform and load data. Support most RDBMS.

  10. CA Advantage

    Data Transformer

    Extract, transform and load data. Integrates with SAP R/3. Support most RDBMS.

    Platinum InfoPump, InfoHub, and InfoRefiner were acquired by CA.

  11. Oracle

    Warehouse Builder

    Extract, transform and load data. Integrates with SAP (produces ABAP code).

    Carleton Pasport was acquired by Oracle.

  12. Embarcadero

    DT/Studio

    Extract, transform and load data. Changed data capture delta agent. Data profiling

    and metadata import/export. Support most RDBMS.

  13. Sybase

    Enterprise Connect

    Replication Agents for Informix, Oracle, and Adaptive Server Anywhere. Support

    most RDBMS.

  14. ETI Solution

    Extract, transform and load data. Integrates with Siebel, WebSphere MQ and

    ClickStream. Support most RDBMS.

Ascentia (now IBM) and Informatica are probably the most popular data extraction

tools in the BI market now a days. BusinessObject, Genio and DataFlux are decent,

but I prefer to use SSIS as personal choice. Informatica, DataMirror and Oracle

CDC (changed data capture) could be a life saver in some situation where we

must extract incrementally. Alkis

Simitsis compiled a comprehensive list of ETL

tools, which might worth having a look if you want to explore some of the

nice players.

Vincent Rainardi

29th March 2006

 

 

Rate

2.5 (2)

Share

Share

Rate

2.5 (2)