This post is part of a series on this blog that will help me, and hopefully you, pass exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012. So far, we’ve covered:
- Tables and Schemas in the data warehouse
- Dimensions and Slowly Changing Dimensions
- Fact tables and measures
- Intro to columnstore indexes
- Columnstore indexes and partitioning
- Introduction to SQL Server Data Tools (SSDT)
- Connection Managers
- Control Flow Tasks and Containers
- Data Flow Source Adapters
- Data Flow Destination Adapters
If all we had to do was pull data from our source and shove it into a destination, developing SSIS packages would be a lot simpler. Unfortunately, in the real world, we need to massage our data before it’s ready for prime time. That’s the Transform part of the ETL cycle. And as you might imagine, SSDT comes with a bunch of different operators to help meet your transformation needs.
Transformations take input rows, modify or manipulate the incoming data, and then send the data to the next transformation or destination. Some transformations work on a single row at a time. Others operate on sets of rows. How a particular transformation operates has a big impact on performance. In SQL development, we’re often told that set-based operations are better than row-by-row or cursor-based operations. However, the same may not hold true for SSIS.
Row level transformations are applied, you guessed it, at the row level. A row is passed in, the transformation is performed, the row is sent on to the next operator. This type of transformation is called a non-blocking transformation, because rows are processed and passed on without waiting for additional data. This is what I was referring to earlier when I said that row-by-row operations may not be as bad as you think in SSIS.
- Audit – this transformation can add additional audit columns to each row, with values like the package name, the execution start time, machine name, user name, etc. The values come from system package variables within SSIS.
- Cache Transform – store a temporary copy of data in cache for use with lookup transformations
- Character Map – perform text operations like uppercase, lowercase, or converting traditional Chinese characters to simplified Chinese characters to incoming columns. The column’s data can be converted in place, or it can be output as a new column.
- Copy Column – create a duplicate copy of a column with a new name.
- Data Conversion – convert a column to a new data type, outputting it as a new column.
- Derived Column – Use an SSIS expression to replace an existing column or create a new column. For example, you could use the ISNULL expression to replace NULL values with some other value.
- Export Column – extracts BLOB data to an external file.
- Import Column – import a binary file and add it to the row’s output. For example, if I have a directory of employee photos and my input data contains a column with the path and filename for each employee’s photo, I could use the full path column as the input column to retrieve the image.
- Row Count – keep a running tally of rows that flow through the transformation.
Multi-input transformations take in data from multiple sources and output a single recordset. Some of these operations are non-blocking, just like the row-level transformations. Some, however, are partially-blocking, meaning SSIS will wait for a sufficient number of rows to be passed in before it performs the transformation and sends the resulting rows along.
- Lookup – perform a lookup from an external dataset based on incoming data. This is a non-blocking transformation.
- Merge – Combine rows from two sorted inputs. The inputs must be sorted on the same key. For example, if I’m loading orders from 2 separate sales systems, I can combine them into a single destination as long as they’re both sorted by, let’s say, order date. This is a partially-blocking transformation.
- Merge Join – Join the rows from two sorted inputs on a defined join column or set of columns. This transformation operates just like a join in T-SQL, you can configure it to be an inner join, left outer join, or full outer join. The IsSorted property for each source must be set to true and you must define the SortKeyPosition for each column the data is sorted on. These properties can be configured through the Advanced Editor. This is another partially-blocking transformation.
- Union All – Just like a UNION ALL in TSQL, this transformation combines multiple similar input sources into one output with no sorting and no de-duplication. The first input you add to the Union All transformation will define the output columns. Columns from additional input sources must be mapped to the appropriate output columns. This is a partially-blocking transformation.
Multi-output transformations take a single input an split it into multiple outputs. All of these are non-blocking operations.
- CDC Splitter – takes input from a CDC source and splits into different outputs based on the operation that was performed on the source row (e.g. insert, update, delete)
- Conditional Split – splits input based on a boolean expression
- Multicast – splits input data into multiple identical output streams
Multi-row transformations either perform work on sets of input rows or they take a single input row and output multiple rows.
- Aggregate – performs aggregations like sum, max, min, etc. based on a defined grouping criteria. This is a blocking transformation.
- Percent Sampling – only passes a defined percentage of the input rows to the output stream. This is a non-blocking transformation.
- Pivot – just like a pivot operation in TSQL, this will take multiple input rows and, well, pivot them to output multiple columns. The input data for this transformation must be sorted on the set key. This is a partially blocking transformation.
- Row Sampling – only passes a defined number of rows to the output stream. This is a blocking transformation.
- Sort – Order the input based on defined sort column(s). This is a blocking transformation.
- Unpivot – Unpivots the input data to produce multiple rows from input columns. This is a partially blocking transformation.
Advanced Data-Preparation Transformations
Finally we have some advanced transformations that are probably less common, but still powerful when used appropriately.
- DQS Cleansing – Uses Data Quality Services (DQS) to clean up your data. Partially blocking.
- OLE DB Command – Perform database operations based on parameters mapped from input rows. Non-blocking.
- Slowly Changing Dimension – Produces the transformations you need to handle type1 and type 2 SCDs. Non-blocking.
- Data Mining Query – applies rows against a data mining model. Partially blocking.
- Fuzzy Grouping – Performs de-duplication of input data based on similar string values. Blocking.
- Fuzzy Lookup – Joins input data to a reference table based on similar string values. Blocking.
- Script Component – Use .NET scripts to manipulate the input data. Non-blocking.
- Term Extraction – pulls terms from input column(s) and passes them to the output stream. Partially-blocking.
- Term Lookup – pulls terms from input column(s) and uses them to lookup the term in a reference table. The output is the term and a count of how often it appears. Partially-blocking.
Whew. That’s a lot of transformations. I’m not even going to pretend to remember it all.