SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Data Extraction Methods Part 2

By Vincent Rainardi,

In part 1 we discussed data extraction which gets the whole table every time, and the first 3 methods of incremental data extraction, e.g. using using timestamp columns, using identity column, and using triggers. We will now discuss the remaining three, e.g. using transaction date, combination of the above and using off-the-shelf tools, and finally the fixed period method.

2d. Incremental Extract Using Transaction Date

If there is no good create and update datestamp or timestamp columns in the source table, and it is not possible to put triggers in the source system, and there is no identity column either, then our next best option for incremental data extraction is to use transaction date column. By this I mean columns which contains the date the transaction happen.

Generally speaking, there are 2 major categories of tables in an ERP system or business system, or as the data warehouse community call it, OLTP system (in contrast with OLAP that we have in data warehouse system). The first category is the 'master' tables. Tables such as product, customer, supplier, account, status code, product group, store, and disposal site. In the data warehouse star schema, these tables become the dimension tables. The second category is "transaction" tables. Tables such as sales order header, sales order detail, purchase order header, purchase order detail, inventory transaction, account transaction, price review and disposal. In the data warehouse star schema, these transaction tables become the fact tables.

If the table we extract is a transaction table, it normally has one or more transaction date columns. In an order header table, it is common to find the following transaction date columns

  • order date - the date the order was placed by the customer
  • system order date - the date the order header record was created
  • estimated delivery date - the date the order supposed to be out of our door
  • actual delivery date - the date the order really out of our door

The difference between order date and system order date is that: if the order came by fax yesterday and it is typed into the system today, order date will be typed as yesterday whilst the system order date will be today's date. Operator has no control on this column, it is maintained internally by the system. Please note that in better ERP systems, to enable partial delivery, the estimated and actual delivery dates are put on the order detail table rather than order header table.

In the above example, system order date is an excellent transaction date. It is guaranteed to be greater than or equal to today. Orders which happened later have later system order date as well. That makes this column, in a way, sequential, just like an identity column. The order date column, on the other hand, can be deceiving, because user can type a past date on it. This cause a problem, as we can not be sure which orders come after which orders.

The basic syntax of the ETL select statement is generally the same as the ones for datestamp / timestamp column, e.g. SELECT col1, col2 FROM table1 WHERE trans_date_col >= LSED. If we only deals with date (not with time as well), there is little point constraining the maximum.

So that's for inserts. How about deletes and updates? Well, simply speaking, we can't get deletes and updates using transaction dates. We need to use other methods to get them. This restriction made us often have to combine this method with fixed period method (see point 3 below), for example. we download last 6 weeks every time, based on transaction date column.

2e. Combination Of The Above

Sometimes we need to combine the above 4 methods, to suit the condition of the source table, to get all inserts, updates and deletes. Inserts are generally the easiest to get. We can use all for methods to get inserts: timestamp column, incremental column, triggers and transaction date column. For deletes we can can use timestamp, incremental column or triggers. To get updates we can use CRC and triggers. Yes triggers can be use to get all 3 kinds of changes, incrementally.

2f. Using Off-The-Shelf Tools

Now a days to get or extract the data out of business system is getting easier. Some BI companies have written ETL package for most popular business systems such as SAP, Oracle applications, Siebel, PeopleSoft and J.D. Edwards. In the BI industry, these ETL packages are normally called Data Integration (DI) software. We install the DI software, set some parameters, and off you go. It extracts the data from our business systems into our data warehouse. Isn't that wonderful?

Well in reality it's not as straight forward as that. We still have to do data analysis (profiling) and data quality process. We still have to customise it a little to suit our business system and to suit our data warehouse. But we don't start from stratch, most definitely not. Below we try to illustrate the situation by describing a few DI software that are around in the market, along with the links.

  1. IBM Web Sphere Information Integration (formerly Ascentia)
    DataStage extracts, transforms and loads data. ProfileStage analyses source data, creates data model, generate source to target mappings, and generate ETL routine. QualityStage detects data duplication, inconsistencies and missing data. Package Application Connectivity Kits (PACKS) helps connect to SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards systems and extract data from them. Supports most RDBMS.
  2. Informatica PowerExchange and PowerCenter
    PowerExchange Changed Data Capture operates within the source environment to detect data updates and to deliver them in real time or batch. PowerCenter transform and load data; it also offers cleansing and profiling. Most RDBMS are supported.
  3. BusinessObject DataIntegrator
    Define data mapping and transformation. Cleanse (integration with FirstLogic), preview and profile data. Has prebuilt transformations. Creates portable ETL jobs. Integrates with SAP R/3 and BW, Oracle, Siebel, PeopleSoft and JD Edwards. Support most RDBMS.
  4. Pervasive Data Integrator (formerly Data Junction)
    Extract, transform and load data. Creates ODS. Real time or batch execution. Integrates with SAP, Goldmine, Solomon, Great Plains, Navision, SAS, Market Force, Salesforce, and many accounting packages: AccountMate, ACT, Champion, Platinum , Macola, Peachtree, RealWorld, CYMA, etc. Support a comprehensive list of RDBMS.
  5. Hummingbird Genio
    Extract, transform and load data. Integrates with SAP.
  6. SAS DataFlux
    Extract, transform and load data. Has data profiling and data quality capabilities. Integrates with SAP R/3, Siebel, Oracle Warehouse Builder, Informatica. Support most RDBMS.
  7. Cognos 8 BI Data Integration (formerly DecisionStream)
    Extract, transform and load data. Integrates with SAP R/3 and Siebel CRM. Support most RDBMS.
  8. DataMirror Transformation Server
    Uses database log files to capture changed data to deliver them real time.
  9. Microsoft SQL Server 2005 Integration Services (formerly DTS)
    Extract, transform and load data. Support most RDBMS.
  10. CA Advantage Data Transformer
    Extract, transform and load data. Integrates with SAP R/3. Support most RDBMS. Platinum InfoPump, InfoHub, and InfoRefiner were acquired by CA.
  11. Oracle Warehouse Builder
    Extract, transform and load data. Integrates with SAP (produces ABAP code). Carleton Pasport was acquired by Oracle.
  12. Embarcadero DT/Studio
    Extract, transform and load data. Changed data capture delta agent. Data profiling and metadata import/export. Support most RDBMS.
  13. Sybase Enterprise Connect
    Replication Agents for Informix, Oracle, and Adaptive Server Anywhere. Support most RDBMS.
  14. ETI Solution
    Extract, transform and load data. Integrates with Siebel, WebSphere MQ and ClickStream. Support most RDBMS.

Ascentia (now IBM) and Informatica are probably the most popular data extraction tools in the BI market now a days. BusinessObject, Genio and DataFlux are decent, but I prefer to use SSIS as personal choice. Informatica, DataMirror and Oracle CDC (changed data capture) could be a life saver in some situation where we must extract incrementally. Alkis Simitsis compiled a comprehensive list of ETL tools, which might worth having a look if you want to explore some of the nice players.

Vincent Rainardi
29th March 2006



Total article views: 14159 | Views in the last 30 days: 3
Related Articles

Alter column order

Alter column order


Data Extraction Methods Part 1

The third part of a great data warehousing series from Vincent Rainardi looks different methods of e...


Columns order in the index

Columns order in the index


Generate extract with varrying column size

Generate extract with varrying column size


Extract Objects from SSRS Reports

A discrete list of database objects is extracted from SSRS using SQL, SSIS, and RegExtractor in orde...