I was recently asked to provide some performance tuning recommendations for an ETL process that extracts data from Oracle using Informatica (Unix) and loads it into some SQL Server Staging tables.
Initially the ETL process was taking 20+ hrs to complete. After some preliminary analysis it was determined that the first major bottle neck was that Informatica was doing row-by-row inserts into SQL Server stage tables !
Informatica could read data at about 3,500 rows per second from a single Oracle table but was only inserting into SQL Server at about 300 rows per second. The night ETL load is 4+ terabytes so then it became apparent why it was taking 20+ hours to complete!
Note: This installation of Informatica was on a Unix server and so it did not have the SQL drivers to do bulk inserts. If you install informatica on a windows server then you can take advantage of the SQL drivers and do bulk inserts.
Our recommendations came in the form of three options:
1. Informatica Oracle flatfile dump with SSIS Load to SQL Server Staging Tables. see image below
While this technique greatly improved ETL performance it would require completely redesigning the existing Informatica mappings and buying another server for just holding the flatfiles.
Plus it seems a tad outdated to still being doing flatfile dumps in 2011 between 2 perfectly good Database Engines simply because the ETL tool can’t do bulk inserts!
2. Upgrade DataDirect to version 6.1 which has a driver which can do a bulk insert into SQL Server. (Actually, DataDirect 6 ships with the driver disabled!!!). The downside is that costs some serious $$$ to upgrade but there is no additional development work needed once the upgrade in complete.
3. Use the SSIS Oracle Connector to pull directly form Oracle. Here are the performance results. There are many factors which can effect performance. So your results will vary.
|Oracle||SQL Server DB Local Laptop||1,000,000||6||2,778|
|Oracle||SQL Server DB||1,000,000||6.1||2,732|
Of course I prefer using the SSIS option because it performs roughly equal to Informatica but without all the additional cost. With more optimizations SSIS can even out perform Informatica. See links at the bottom of the post for more information.
Steps to Install SSIS Oracle Connector on dev box which uses BIDS
1. Download and install 32bit and 64bit SSIS Oracle Connector. BIDS debug runs in 32bit ! so you need both!
3. Setup the connection to the Oracle DB in SQL Developer
4. Setup the Oracle Connection in BIDS.
**Make sure that TNS service name is in the following format <IP>:<port>/<service name>. Also think about how to manage the credentials used to access oracle.
Also for ’Data access mode’ in the ‘Oracle Source’ dialogue box use “SQL command”. It performs a lot better!
**To install the SSIS Oracle connector follow the same step above but you won’t need to install the 32bit stuff
Good Performance Tuning Articles that are related to SSIS and ETL:
This document described techniques for bulk loading large data sets into SQL Server. It covers both the available techniques as well as methodologies to performance tune and optimize the bulk loading process.
In February 2008, Microsoft announced a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don’t have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.
Providing for high volume extract, transformation, and load (ETL) throughput requires consideration of both best practices for standard relational databases, and of high performance techniques for ETL.
This article presents a number of suggestions relating to high volume ETL throughput, specifically using SQL Server Integration Services. Aspects of both ETL performance and management are described
This paper defines a reference configuration model (known as Fast Track Data Warehouse) using an I/O balanced approach to implementing a symmetric multiprocessor (SMP)-based SQL Server data warehouse with proven performance and scalability expectations for data warehouse workloads. The goal of a Fast Track Data Warehouse reference configuration is to achieve a cost-effective balance between SQL Server data processing capability and realized component hardware throughput.
There is considerable evidence that successful data warehousing projects often produce a very high return on investment. Over the years a great deal of information has been collected about the factors that lead to a successful implementation versus an unsuccessful one. These are encapsulated here into a set of best practices, which are presented with particular reference to the features in SQL Server 2008 R2. The application of best practices to a data warehouse project is one of the best investments you can make toward the establishment of a successful Business Intelligence infrastructure.