Conditional Split and Expressions issue

  • I have an issue, with a flat file which has a lot of dates in it which are not correctly formatted.

    The issue with the dates is the following:
    - Some dates get delivered as 0 or 200, these should be caught and the value changed to 99991231
    - Some dates only have 3 digits, for example a 112 means actually 20000112
    - Some dates only have 4, 5 digits, for example 1212 actually means 20001212 and 11212 actually means 20011212
    - Most dates have 6 digits, for example 911212 means 19911212 and 101212 means 20101212 (if there is anything good to this, there's no 80's data)
    - there is no 7 or 8 digit date on the flat file

    So I extract all available data to a nchar(8) column (they are formatted as YYYYMMDD)

    The table looks like this:

    CREATE TABLE dbo.ImportDates(
    [PAMA-DA-LAMO] nchar(8) null,
    [PAMA-DA-REC] nchar(8) null
    ) ON PRIMARY

    from there I want the Columns to be converted to date datatype, so they look like this: YYYY-MM-DD and then inserted into this table:

    CREATE TABLE dbo.CleanedDates(
    [PAMA-DA-LAMO] date null,
    [PAMA-DA-REC] date null
    ) ON PRIMARY

    So my DFT looks like this:

    Datasource ImportDates -> Derived Column (TRIM both Columns) -> Conditional Split -> Derived Column Based on LEN(Column) -> Derived Column (Reformat to YYYY-MM-DD) -> Insert to Destination Table

    Unfortunately, results are not as expected. Most dates seem to be processed correctly, unfortunately not all of them so once I get to the DC which reformats to YYYY-MM-DDD, all rows fail because he seems to fail on either of those columns, it fails either on the first or second column so for the insert the whole row is discarded.

    My Conditional Split looks like this:

    Order: 1 / Output Name: PAMA-DA-LAMO SL3 / Condition: (LEN([PAMA-DA-LAMO])) == 0 || (LEN([PAMA-DA-LAMO])) == 3 || (LEN([PAMA-DA-LAMO])) == 1
    Order: 2 / Output Name: PAMA-DA-REC SL3 / Condition: (LEN([PAMA-DA-REC])) == 0 || (LEN([PAMA-DA-REC])) == 3 || (LEN([PAMA-DA-LAMO])) == 1
    Order: 3 & 4 / Output Name PAMA-DA* SL4 / Condition: (LEN([PAMA-DA-LAMO])) == 4 respectively (LEN([PAMA-DA-REC])) == 4
    Order 5 & 6 / Same as 3&4 just with ==5
    Order 7 & 8 / Same as 3&4 just with ==6
    Conditional Split Default Output

    Except for the Conditional Split Default Output, all other outputs are connected to Derived Columns
    Question here: Is this actually the right order for me to start with or should I have the String Length 6 Outputs on Order 1 & 2? Is this actually correctly split? I do on Order 1 & 2 not just 3 and 1 lenght checks (if a 0 was actually inserted, this should catch it?) but aswell zero length strings, if this would be empty.

    Next, the Derived Column Définitions:

    String Length 6:

    (LEN([PAMA-DA-REC])) == 6 && (((SUBSTRING([PAMA-DA-REC],1,2)) >= "90") && ((SUBSTRING([PAMA-DA-REC],1,2)) <= "99")) ? ("19" + ([PAMA-DA-REC])) : ("20" + ([PAMA-DA-REC]))


    Here I first check if the SL (String Length) is 6 and if the first 2 characters of the string are between 90 and 99, if so add a 19 otherwise 20. At least that's the plan. There is a DC (Derived Column) with the same expression for the other columns

    SL 5:

    (LEN([PAMA-DA-LAMO])) == 5 && (([PAMA-DA-LAMO]) == "0") ? ("99991231") : ("200" + ([PAMA-DA-LAMO]))

    The assumption for SL 5 and SL 4 is: Conditional Split worked as desired, so just rule out the value is 0 and apply 200(it's 2000 for SL 4) to the string. I'm not entirely sure if I should have the SL Check here again.

    SL 3:

    LEN([PAMA1-DA-LAMO]) <= 3 && (([PAMA1-DA-LAMO] == "0") || ([PAMA1-DA-LAMO] == "200")) ? ("99991231") : ("20000" + ([PAMA1-DA-LAMO]))

    So for SL 3 and below (checking table output it's either 3 or 1 as SL) we want to check if the string is either 0 or 200 and make that 99991231, otherwise apply 20000 to the front of the string.

    casting whatever comes out of this to YYYY-MM-DD fails with either column failing on the conditional split or DC for the SL transformations, some rows the first column looks good and the second bad, others are just vice versa.

    What do I miss? Any better approach to something like this? (except telling the customer to get his dates sorted properly)

    Thanks and Cheers

  • I've had to import some fairly rough looking data in the past and I import it (as you have) into a column that will accept pretty much anything.
    Then I call a stored procedure that is written to sort it all out - rejecting any that it can't process to another table.

    Personally I find this approach better because it doesn't rely on an ever-expanding list of data transformations and (more importantly) it is isolated from the SSIS package. I can change the stored procedure time and again until I'm happy with it (testing it in isolation too), without having to rebuild/import the SSIS package every time.

    I do like SSIS but I try to keep what it does as simple as possible and rely on stored procedures to do the complicated stuff - it makes maintenance easier and I can break a problem down into simpler steps with this approach.

  • I'm not sure if in this case going for a stored procedure would be any better as for me that sounds even more complicated to me, or well at least as messy as SSIS Expressions fiddling. Sounds a bit like a CASE for a "How many CASEs can you state in one Query?" Challenge 😉

    but yeah, SSIS is fun at times ...

    SELECT r.*
    FROM dbo.Tbl1 r
    WHERE [LOAD_DATE_TIMESTAMP] =
    ( SELECT MAX([LOAD_DATE_TIMESTAMP])
    FROM dbo.Tbl1
    where Col1 = r.Col1)

    gives me: Could not convert Col2 because of a potential loss of data, he recognizes Col2 as DT_WSTR and changing in Input / Output Column Datatypes does not make SQL Server care any more than before (he resets back to DT_WSTR).

    Just to clarify: If I specify all the colums on the first SELECT, it's working fine. 🙂

  • In the past I've used a staging table that has columns for the original data (as a column that can contain anything) and a column in the format that I actually require. A series of statements will look for specific forms of data in the original column and insert the corrected data into the new column. That way I can see the original data and what I made from that, in the new column. 
    It doesn't require complex CASE statements - deal with each problem as a separate piece of code. Try to keep it as simple as possible. It makes for lengthy code but very easy to understand and maintain.

    I'm away for a few days but if you do have problems with this method, post some example data and I'll endeavour to produce an example upon my return.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply