SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

Tim Mitchell

Tales of my travels through SQL Server
Add to Technorati Favorites Add to Google
Author Bio
Tim Mitchell is a Microsoft SQL Server consultant, developer, speaker, and trainer. He has been working with SQL Server for over 6 years, working primarily in database development, business intelligence, ETL/SSIS, and reporting. You can find his complete profile at TimMitchell.net.
Browse by Tag : ETL (RSS)

Eliminating Empty Output Files in SSIS

By Tim Mitchell in Tim Mitchell | 09-30-2009 8:28 AM | Categories: Filed under: , ,
Rating: (not yet rated) Rate this |  Discuss | 2,428 Reads | 1338 Reads in Last 30 Days |9 comment(s)

So you’ve got some packages that regularly extract data to one or more text files, but you know that from time to time some of the queries will not return any data.  However, you find in SSIS that, in a flat file export package, the output file is created regardless of whether any rows are written to the file, and in cases where there are no rows returned from the source, you’ll end up with empty data files in your output directory.  Although not a critical failure, having empty output files can be a nuisance, and unfortunately, there aren’t any native settings in the Flat File Destination or the Flat File Connection Manager objects that will clean up empty files.  Fortunately, SSIS does provide a combination of other tools that you can use to emulate this behavior.

In our example, I’m going to create a package to extract data from a table using a query for which I know that no rows would be returned.  When I connect this data source to a flat file destination and execute the package, I’ll see an empty data file in my export directory.  Next, to demonstrate the intended behavior, I’ll add a Row Count transformation to store the number of affected rows in a variable, and create a File System Task object to delete the output file.  Finally, I’ll use an expression constraint to only delete the file if the variable attached to the row count is zero.

First, let’s create a test table to query:

USE testdb
GO

CREATE TABLE invoices (
  invoiceid     INT    IDENTITY ( 1 , 1 ),
  vendorid      INT,
  invoicedate   DATETIME,
  invoiceamount DECIMAL(10,2))

INSERT invoices
VALUES(12, '8/3/2009', 4125.50),
      (53, '8/13/2009', 1095.25),
      (46, '8/15/2009', 729.50),
      (33, '8/23/2009', 3421.50)


Now, I’ll create a basic package that will export to text the invoices for the past 30 days, a reasonable business requirement.  Since we don’t currently have any invoices matching that criteria, we’ll end up with an empty output file. The original package is shown below:

df_f1

 

Now for the new-and-improved version, I’m going to drop a Row Count transformation into the data flow to save the number of affected rows to a variable:

 

df_f2

 

Finally, I’ll create a File System Task and configure it to delete the output file.  To insure that a valid data file is not deleted, I’ll create a precedence constraint using an expression to only execute the delete if the row count variable is equal to 0:

 

df_f3


When you execute this package, you’ll see that the File System Task object is executed because there are no rows matching our query.  You can test the package by inserting another row into the database that will be returned by the query, and you’ll see that the data file is exported but not deleted.

I’ve attached the before and after SSIS packages if you’d like to take it for a test drive.  Enjoy!


Space Sensitivity in SSIS Lookups

By Tim Mitchell in Tim Mitchell | 09-29-2009 7:15 AM | Categories: Filed under: , ,
Rating: (not yet rated) Rate this |  Discuss | 1,448 Reads | 942 Reads in Last 30 Days |2 comment(s)

It's been well-documented through myriad blogs and forum posts about the case sensitivity of the comparisons in the SSIS lookup transformation (a good review can be found here). In a nutshell, a comparison using the lookup transformation is case sensitive when using the default setting of Full Cache, even if the values in the database are stored in a case insensitive collation. This happens because, in Full Cache mode, the comparison is done by the SSIS engine rather than the database engine, the former of which differentiates between "Value", "VALUE", and "VaLuE" regardless of the collation setting.

But did you know that this transformation is space sensitive as well? Consider the following T-SQL code:

SELECT Cast('char value' AS CHAR(20)) [val1]
INTO   test1

SELECT Cast('char value' AS CHAR(40)) [val2]
INTO   test2

SELECT t1.val1,
       t2.val2
FROM   test1 t1
       INNER JOIN test2 t2
         ON t1.val1 = t2.val2


As you would expect, executing this code results in a successful match (INNER JOIN), even though we're comparing CHAR values of differing lengths (for more information, see this article for more information about spaces and padding in SQL Server).

However, when the same comparison is run through an SSIS lookup transformation in Full Cache mode, the lookup on our sample data will not result in a match. Similar to the case sensitive lookup, you'll find that the SSIS engine would treat the strings 'Hello World' and 'Hello World  ' (note the trailing spaces) as dissimilar values.  Unlike in SQL Server, trailing whitespace is significant in SSIS value comparisons.

As a workaround, you can use the TRIM() function in SSIS and the RTRIM() T-SQL function to insure that your comparisons are ignorant of trailing whitespace.  Alternatively, you could use a cache mode other than Full Cache, but you should be aware of the other implications before making such a change. 

Note that this behavior is limited to fixed-length character fields, but could lead to some unexpected and hard-to-detect problems if you aren’t aware of the behavior.


SQL Saturday 14 - Pensacola

By Tim Mitchell in Tim Mitchell | 06-07-2009 10:07 PM | Categories: Filed under: , , ,
Rating: (not yet rated) Rate this |  Discuss | 2,424 Reads | 295 Reads in Last 30 Days |2 comment(s)

I’ve just arrived home from a quick trip to Pensacola to speak at SQL Saturday 14 in Pensacola, FL.  I’m quite happy with the event; the planning and organization was handled very well, and I believe the event was a big success in the eyes of the attendees.

I flew into Pensacola on American Eagle, arriving at 3:30pm on Friday.  I usually dread flying American, but this time wasn’t so bad; the flight was on time and the flight attendant (just one – it was a small plane) was polite and attentive.  My rental car company, which shall remain nameless, was less impressive, since I had to deal with the cigarette residue from the previous renter, and I had to return all the way to the ticket counter to report some undocumented damage to the car before leaving.  Nevertheless, I did get one good surprise upon arrival: Thinking that I had crossed into Eastern Time, I subconsciously added an hour to the time, only to realize upon arrival that western Florida is actually on Central Time.  A free hour!

I got settled into my hotel room, which was about 5 miles from the event location.  Note to self: Don’t be cheap, spend the extra $50/night next time and stay nearby.  In a nice hotel.  Andy Warren and I had tentatively planned to meet before the speaker reception, so I called him and we met up at McGuire’s Irish Pub.  It’s a neat little place, and I recommend it if you find yourself near downtown Pensacola.  One word of caution: read the signs on the outside of the restrooms carefully.  By the way, the ladies room at McGuire’s is quite nice.

The speaker’s reception at the Fish House was a big hit.  I don’t know if anyone kept count, but I’m guessing that we had 35-40 people that showed up over the course of several hours.  It got a little loud when the peripheral crowds arrived, but a good time nonetheless.

The event itself went well, in my opinion.  It got off to a slow start since the building was still locked at 7:45, but there were a lot of volunteers to help out with the heavy lifting.  Traffic flow was very good and there were few bottlenecks during registration.  According to the last figures I heard, there were 180 or so registered, and 170+ showed up (this included a number of walk-ups).  I heard of one speaker that didn’t make it, but it was due to a family issue and Karla was able to adjust the schedule beforehand to allow for this.  The book giveaway was done throughout the day rather than at the end, which helped to thin out the end-of-the-day swag bottleneck.  Among the items given away was a pass to go on a deep sea fishing trip the day after (Sunday), which I personally would have loved to do but just couldn’t fit it into my Sunday schedule.

I spoke again on SSIS Scripting, and the session was full and seemed to be well-received.  I neglected to plug in my laptop during setup – and believe me, a warning message stating “Your battery is at 7% – plug into a power source immediately” will throw a kink into the best presentation.  I had an issue with the resolution on the projector which kept me from going full-screen on my VM, and this slowed me down during the demos.  Despite these minor glitches, everything else went well, and we had some excellent questions and good discussion during and after the presentation.  If you’re interested, I’ll be publishing the slide deck and code on my website in just a while.

I talked to several attendees during and after the event, and all of them that I spoke with spoke very highly of this SQL Saturday event.  Those who attended were gracious, attentive, and polite, and I got the sense that they felt their time was well spent.

I did get to meet a number of people whom I’ve “known” for months or years but had never had a face-to-face chat, including Jack Corbett, Steve Jones, Brad McGehee, and Ken Simmons.  I also got to visit with several folks that I had met at previous events, including Rodney Landrum, Karla Remail, Nathan Heaivilin, Pam Shaw, Jessica Moss, Joe Healy, and a number of others.  I also shared a flight home on Sunday with Brad McGehee, so he and I got to visit for a while as we waited to fly back to Dallas.  It’s great to share the company of these smart folks, because it 1) makes for interesting conversation and 2) gives me hope that I’ll collect a few IQ points by osmosis.  For me, this event was also an experiment in personal networking, but that will be covered in another post.

My hat is off to Karla and the volunteers from the Pensacola SQL Server User Group.  This event was well planned, organized, publicized, and executed.  I’m hopeful that we can replicate this event, and its success, in the Dallas area sometime next calendar year.


Data Quality, Part 1: The Discovery

By Tim Mitchell in Tim Mitchell | 09-28-2008 10:42 AM | Categories: Filed under: ,
Rating: (not yet rated) Rate this |  Discuss | 2,396 Reads | 215 Reads in Last 30 Days |no comments

I wrote a post a few months back about a healthcare data conversion project that I’ve been working on for the better part of 2 years.  My task on this project is to convert data from an old UNIX-based Universe database to a SQL Server-based application; the database we are extracting from is quite old, both in terms of technology as well as the length of time this application has been in use at this facility.  Universe, and IBM product, is a multivalue database, which in my understanding is like the anti-RDBMS, allowing for multiple values within each row/column intersection.  Getting data stored in this manner extracted and imported to SQL Server has been the quite challenging, but it pales in comparison with the difficulty we have had with the quality of the data itself. 

We discovered early in the conversion spec process that we would need to make some significant changes to the data to allow it to fit the data model expected by our new software vendor.  For example, the current application allows registration staff to enter a patient’s employer information by either direct text entry or through a lookup of existing employers in a lookup table.  This was not a problem in itself, as we could reference the lookup table during the conversion to retrieve the employer information.  However, a “feature” of the application is that a person can delete records from the lookup table, and since the patient’s visit information stored only the employer code, that information became orphaned; Bob’s Lumber Yard might be BLMYD or similar jibberish, and unless a tenured employee knew the code by heart, that data was lost to the bit bucket.

Another oddity we found was that the application allowed for the deletion of general ledger accounts, even if those accounts had transactions attached to them.  Now I’m not an accounting guy (at least I wasn’t before this project – I’ve since gotten a crash course) but even I can tell you that orphaning those transactions can throw off your bottom line, and the effects could be felt for many quarters to come.  Even though the transactions in question were several years old and I was able to address them relatively easily in the conversion, the overall quality of the data briefly came into question simply due to this little oddity.

Not all of the data quality issues we found were related to shortcomings of the system.  The most dangerous threat to data quality – user input – remains the most unpredictable wildcard in the entire process.  Staff turnover, training, and administrative policy changes through the years all contributed to inconsistencies in the data, causing us to create complex and often case-specific algorithms to correct.

Through the course of this conversion, we found a number of other similar issues that caused us grief.  However, this old application has been a workhorse and has served its purpose well despite being very difficult to learn and expensive to maintain.  For years, users have been able to use the system effectively on a day-to-day basis, and there is very little downtime.  Reporting, on the other hand, has been the key shortcoming of this dinosaur.  These data quality issues, along with the lack of an intuitive reporting tool, have made reporting from this system all but impossible.  This organization has had to spend a painful amount of staff time creating reports because of the manual work required to do so.  More often than not, the reporting process goes like this:

 The user generates a text-based report using one of two integrated reporting tools (neither of which is user-friendly or intuitive).
 Said user runs a copy/paste operation from the text report into Excel.
 Extraneous text such as page headers, etc., must be manually removed from the Excel file.
 The user creates extra rows or columns and copies/pastes the data from one cell to another to properly align the data as required for reporting.
 The user then performs all of the summarizing/sorting on the Excel document.

This is done for most of the critical reports each month.  Some reports must be generated weekly, which of course further increases staff workload.  The most discouraging thing, though, was that many users simply accepted this and didn’t know how much easier reporting could – and should – be.

In this first of what will hopefully be many posts about data quality, I’ve painted sort of a bleak picture – I generally start out on a more positive note, so please bear with me for now.  In future installments I’ll break down some of the particulars I ran into and some common (and a few uncommon) methods that can be used to improve the quality of data.


A little fun with SSIS Expression Language

By Tim Mitchell in Tim Mitchell | 05-08-2008 4:21 PM | Categories: Filed under: ,
Rating: |  Discuss | 3,578 Reads | 258 Reads in Last 30 Days |2 comment(s)

The SSIS expression language is a powerful yet enigmatic entity.  Once you get used to its syntax - which is part C#, part T-SQL and part *WTH?!?* - it's actually somewhat fun to use.  However, one thing it is lacking (as far as I can tell - correct me if I've missed something) is the ability to use an IF...ELSE IF...ELSE statement.

So I ran into a situation earlier in which I needed to do an advanced conditional expression.  Here's the situation:  My input has a column named provider.  This column comes to me as an integer but has to be padded with zeros where necessary to make it a total length of 3 ("2" becomes "002", "13" becomes "013", etc.).  A special case exception is the value "1" which is to be left as is.

I knew that the *proper* way to do this was to either create a script component and drop down into VB.NET to map the value appropriately, or use a Conditional Split transform and then merge the values together.  However, a colleague had just asked an unrelated question about the ternary operator in the expression language (officially called the Conditional Operator in Redmond), and I decided to push it a bit and force this processing into the Derived Column transform using the Conditional Operator.

So just for fun, I've posted the expression below.  It's actually three ternary Conditional Operators working together to provide an advance IF...ELSE IF...ELSE statement.  [For the record and in case it's not clear from the code below, the syntax for the Conditional Operator is (value_to_test) ? (value_to_return_if_true) : (value_to_return_if_false).]

(LEN(provider) < 3 && LEN(provider) > 0 && provider != "1") ? (LEN(provider) == 1 ? "00" + provider: "0" + provider) : (LEN(provider) == 0 ? "000" : provider)