SSIS Data flow with Case statements

Question

SSIS Data flow with Case statements

Andre 425568

SSCrazy

Points: 2314
More actions
October 29, 2014 at 4:26 am

#294583

Hi
I would like to know how I can add the following sample code to my Source data on Data Flow on SSIS, or what other options there are. The main issue is time as we have talking about 100's of millions of rows
select Sample,
CASE
WHEN Sample IS NULL
THEN NULL
WHEN SUBSTRING(Sample, 1, 6) IS NULL
THEN ' '
ELSE RTRIM(SUBSTRING(Sample, 1, 6))
END AS [Sample_1_6]
from TestTable
what I have done at this stage is just to Create a SQL task with a
Insert into
INSERT INTO [dbo].[TestTable1]
([Sample]
,[Sample_1_6])
select Sample,
CASE WHEN Sample IS NULL =THEN NULL
WHEN SUBSTRING(Sample, 1, 6) IS NULL THEN ' '
ELSE RTRIM(SUBSTRING(Sample, 1, 6))
END AS [Sample_1_6]
from TestTable
If there is a way adding this to a dataflow so I van use fast load that would really be the best solution. I know there are derived columns, but would this really be faster than the straight insert into in a SQL Task? If this is the way to go what is the code I would use in the derived column or any other option

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 1

Your entire case statement can be rewritten as RTRIM(SUBSTRING(Sample, 1, 6)).

This will return NULL when Sample is NULL. SUBSTRING on a non-null string will never return NULL, so you can ignore that branch of the case statement.

So if no CASE is needed, I would just keep your INSERT statement as-is. SSIS will just have more network overhead than the pure T-SQL solution. (unless SSIS is on the same box of course)

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Andre 425568 SSCrazy Points: 2314 More actions · Answer 2

Would that also work for this Case statement

CASE

WHEN Sample IS NULL

THEN NULL

WHEN SUBSTRING(REPLICATE('0', 9 - LEN(Sample )) + CAST(Sample AS VARCHAR(9)), 1, 3) IS NULL

THEN 0

ELSE RTRIM(SUBSTRING(REPLICATE('0', 9 - LEN(Sample )) + CAST(Sample AS VARCHAR(9)), 1, 3))

END AS [Sample _1_3]

We have a lot of these types of scripts in triggers so if you can give me the script to simplify this. At times we populate it with '' other times with 0 as with this specific example

I am currently doing an ETL where a long method was used running everything through a cursor to populate about 5 or 6 fields like this and the other example you helped with. If I can find a way to get this straight into a fast load that would really be super. Thanks sofar for the help

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 3

This CASE statement does the following:

* if sample is NULL, return NULL

* if sample has a length bigger than 9, return 0

* else return first 3 characters of "0-prefixed sample with a length of 9"

Is this the correct business logic?

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Andre 425568 SSCrazy Points: 2314 More actions · Answer 4

I see you point but this data is weird and we do have substrings for example for 4,6 or 6,20, so some of the values may in fact be nulls

So for example the field has a length of 30 characters, say in one field the length is only 10

the substring 1,3 would not be null

the substing 2,15 is totally different,

In the first example modified a bit,

say your data looks like this

row 1 - 1234567890

row 2 - null

row 3 - 123456789012

running the select below would give you the following values

row 1 - ''

row 2 - null

row 3 - '12'

select Sample,

CASE

WHEN Sample IS NULL

THEN NULL

WHEN SUBSTRING(Sample, 11, 2) IS NULL

THEN ' '

ELSE RTRIM(SUBSTRING(Sample, 11, 2))

END AS [Sample_11_2]

from TestTable

Hope this makes sense

Andre 425568 SSCrazy Points: 2314 More actions · Answer 5

The issue with this really strange data is that I need to get a way to do a fast load of the data if this is going to be possible, with quite a number of these fields, my options are doing a fast load with a trigger on the destination table which I think is going to be very slow, considering there are over 500,000,000 rows to update, at this stage I am doing it with a insert into with the case statements but it's also running for a while, not as much as the previous method currently in use which runs it through a cursor to update.. Off the topic though part of the ETL is changing data types, field names etc so I create a new table with the correct field names and data types and then run a fast load on tables without data updates, and now I am stuck on those where we have some fields that we want to autopopulate using the Case statements

Koen Verbeeck SSC Guru Points: 259085 More actions · Answer 6

OK.

Keep in mind though that in the second CASE example, you will get a NULL when the length of sample is bigger than 9.

Personally I would check the length of sample instead of taking the substring and then verifying if it is NULL or not. This makes the intention more clear.

The first branch of the CASE can always be dropped. You don't need to check for NULL and then return NULL.

I wouldn't convert this to SSIS. Chances are this will be as least as fast in TSQL.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Andre 425568 SSCrazy Points: 2314 More actions · Answer 7

Thanks, makes sense what you say, I will though then keep the insert into in a SQL Task, as we are adding all the other tables via SSIS, the performance is almost similar for this select, the fast load on the other tables are great though and then big improvement why I keep in in SSIS is cutting out all the other steps, renaming fields, changing data types etc, the run times are roughly similar between SSIS and SQL, but I am working on settings on SSIS which is gradually improving performance on SSIS compared to SQL. Our large Clients ETL's are currently taking about 2-4 weeks, so by adding it to SSIS I can also running different tasks at the same time. I think this is the greatest thing with SSIS at this point unless I can get really increases in performance on SSIS which would be great.

Any extra tips etc would be great on getting performance on SQL and or SSIS. The tables are really huge, one table had 1,800,000,000 rows.

Thanks for the advice, will definitely try implementing as much as possible.