Parse Info Out of FileName

Question

Parse Info Out of FileName

monkeypoi

Newbie

Points: 8
More actions
September 20, 2013 at 1:40 pm

#278215

I am new to SSIS and have a package I have been working on. My issue is I don't know how to parse out a portion of the incoming file name so I can use that to help name the final flat file. I have files coming in as both csv and txt files that prevents me from just looking for the file extension. The structure of the incoming file name is C:/Location/Folder/Subfolder/filename20130101.(csv) or (txt). What I need is to get the numbers on the end of the file, store them than do a manipulation to them so I can add the previous day's date in the same format to my final flat file. I also have to include in the logic for end of month, year, etc.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63428 More actions · Answer 1

I'll give you that I'm not an SSIS guy, but if the format is always the same, you should be able to create a variable for the filename, then strip off the extension (last 4 characters?), and then if the date is always 8 characters, you should be able to use SUBSTRING() to split the pieces out, convert the strings to numbers, and then that whole thing to a real date that you can do math on.

monkeypoi Newbie Points: 8 More actions · Answer 2

Unfortunately the files that come in aren't spaced the exact the same. One group of files comes in a csv and are named/spaced one way, however the other file comes in as a txt and is spaced differently. I already tried that route. What I need is to verify that the files have the same date on them then use that date to derive the previous date to name my final file with the previous day's date.

reitzen Newbie Points: 9 More actions · Answer 3

ltrim(rtrim(SUBSTRING([filename], charindex('Invoice', [filename])+8, LEN([filename]) - (CHARINDEX('Invoice', [filename])+11))))

[p]This snippet pulls out the invoice number from a filename. The filename is in a consistent format "MyCustomerName_Invoice_12345678.pdf". The snippet above uses the word "Invoice" as the anchor to get the invoice number and it doesn't care what the file extension is. You could use a CASE statement to test for a TXT or CSV extension. You could also use CASE to test for any consistent part of the filename. This is a T-SQL snippet so I don't know if this going to help you out of the box, but for what it's worth...[/p]

SQL & Cold Fusion

EricEyster SSCrazy Points: 2974 More actions · Answer 4

At some point, this type of logic becomes easier to write and maintain as a script task embedded in a package.

Ed Wagner SSC Guru Points: 287024 More actions · Answer 5

EricEyster (2/24/2014)
At some point, this type of logic becomes easier to write and maintain as a script task embedded in a package.

Agreed. I'm not at all an SSIS guy beyond spelling it, but if you're going to do this in T-SQL, an alternate approach would be to stop thinking of it in terms of parsing the string by character and start thinking of splitting it out by delimiter. If you're not familiar with Jeff Moden's DelimitedSplit8K function, the article is at http://www.sqlservercentral.com/articles/Tally+Table/72993/. It'll change the way you look at data and think about performance.

with cte(filename) as (

select 'MyCustomerName_Invoice_12345678.pdf' union all

select 'YourCustomerName_Invoice_23456.pdf' union all

select 'SomeLongerCustomerName_Invoice_123.pdf' union all

select 'ShortName_Invoice_7891234567890.pdf')

select filename, s.item

from cte

cross apply DelimitedSplit8K(cte.filename, '_') s

where s.ItemNumber = 3;

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Dwain Camps SSC Guru Points: 86908 More actions · Answer 6

I'm going to go out on a limb here since I am not an "SSIS guy" either and say that if you can use DelimitedSplit8K you can also use PatternSplitCM (see the 4th article in my signature links). In that article it gives a number of examples for parsing a file name, some of which may be applicable to your case.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 7

This should handle pretty much anything you can throw at it.

-- DROP TABLE #FileName

--===== Create a test table with some normal and "odd" file/path names.

SELECT FileName = 'C:/Location/Folder/Subfolder/filename20130101.xxx'

INTO #FileName UNION ALL

SELECT FileName = 'C:/Location/Folder/Subfolder/filename20130102' UNION ALL

SELECT FileName = 'filename20130103' UNION ALL

SELECT FileName = 'filename20130104.xxx' UNION ALL

SELECT FileName = '20130105.xxx' UNION ALL

SELECT FileName = '20130106' UNION ALL

SELECT FileName = '20130107def' UNION ALL

SELECT FileName = 'abc20130108def' UNION ALL

SELECT FileName = 'abc20130109def.xxx' UNION ALL

SELECT FileName = '123abc20130110def.xxx' UNION ALL

SELECT FileName = 'abcdef.xxx' --will not show up because has no numbers

;

--===== Get just the last set of numbers.

SELECT fn.FileName, DateOnly = LEFT(FromTheNumbers,ISNULL(NULLIF(PATINDEX('%[0-9][^0-9]%',FromTheNumbers),0),8000))

FROM #FileName fn

CROSS APPLY (SELECT RIGHT(FileName,ISNULL(NULLIF(PATINDEX('%[0-9][^0-9]%',REVERSE(FileName)),0),8000))) ca (FromTheNumbers)

WHERE fn.FileName LIKE '%[0-9]%'

;

Results:

FileName DateOnly

------------------------------------------------- --------

C:/Location/Folder/Subfolder/filename20130101.xxx 20130101

C:/Location/Folder/Subfolder/filename20130102 20130102

filename20130103 20130103

filename20130104.xxx 20130104

20130105.xxx 20130105

20130106 20130106

20130107def 20130107

abc20130108def 20130108

abc20130109def.xxx 20130109

123abc20130109def.xxx 20130110

(10 row(s) affected)

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed Wagner SSC Guru Points: 287024 More actions · Answer 8

Jeff Moden (2/25/2014)
This should handle pretty much anything you can throw at it.
-- DROP TABLE #FileName
--===== Create a test table with some normal and "odd" file/path names.
SELECT FileName = 'C:/Location/Folder/Subfolder/filename20130101.xxx'
INTO #FileName UNION ALL
SELECT FileName = 'C:/Location/Folder/Subfolder/filename20130102' UNION ALL
SELECT FileName = 'filename20130103' UNION ALL
SELECT FileName = 'filename20130104.xxx' UNION ALL
SELECT FileName = '20130105.xxx' UNION ALL
SELECT FileName = '20130106' UNION ALL
SELECT FileName = '20130107def' UNION ALL
SELECT FileName = 'abc20130108def' UNION ALL
SELECT FileName = 'abc20130109def.xxx' UNION ALL
SELECT FileName = '123abc20130110def.xxx' UNION ALL
SELECT FileName = 'abcdef.xxx' --will not show up because has no numbers
;
--===== Get just the last set of numbers.
SELECT fn.FileName, DateOnly = LEFT(FromTheNumbers,ISNULL(NULLIF(PATINDEX('%[0-9][^0-9]%',FromTheNumbers),0),8000))
FROM #FileName fn
CROSS APPLY (SELECT RIGHT(FileName,ISNULL(NULLIF(PATINDEX('%[0-9][^0-9]%',REVERSE(FileName)),0),8000))) ca (FromTheNumbers)
WHERE fn.FileName LIKE '%[0-9]%'
;
Results:
FileName DateOnly
------------------------------------------------- --------
C:/Location/Folder/Subfolder/filename20130101.xxx 20130101
C:/Location/Folder/Subfolder/filename20130102 20130102
filename20130103 20130103
filename20130104.xxx 20130104
20130105.xxx 20130105
20130106 20130106
20130107def 20130107
abc20130108def 20130108
abc20130109def.xxx 20130109
123abc20130109def.xxx 20130110
(10 row(s) affected)

Now that's a cool approach, Jeff, and it does 1M rows in about 12 seconds.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 9

Ed Wagner (2/26/2014)
... and it does 1M rows in about 12 seconds.

Heh... damn. I must be gettin' old and starting to slip. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed Wagner SSC Guru Points: 287024 More actions · Answer 10

Jeff Moden (2/26/2014)
Ed Wagner (2/26/2014)
... and it does 1M rows in about 12 seconds.
Heh... damn. I must be gettin' old and starting to slip. 😛

If you keep the underscores and use DelimitedSplit8K, it does 1M in about 8 seconds, but the regular expression has never been known to be a speed demon in any language. Nonetheless, for what it does, I'd say that he OP is not going to have 1M incoming files in the file system without running into a host of other problems.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jeff Moden SSC Guru Points: 1003861 More actions · Answer 11

I guess "keeping the underscores" would be a prob lem for the OP, though. His original format had no underscores.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed Wagner SSC Guru Points: 287024 More actions · Answer 12

Jeff Moden (2/27/2014)
I guess "keeping the underscores" would be a prob lem for the OP, though. His original format had no underscores.

DOH!!! You're absolutely right. Now my question is where I got the underscores in the first place. :crazy: Now I must be getting old...starting to see things.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions