RE: Importing Poorly Formatted Text File

SSC Enthusiast

Points: 171

July 31, 2014 at 1:43 am

Jeff Moden (7/30/2014)
marg 14154 (7/20/2014)
Jeff Moden (7/18/2014)
Ed Wagner (7/18/2014)
Jeff Moden (7/17/2014)
Ok. Here we go. Time for some good ol' fashioned "Black Arts". 🙂
The first thing is, is to understand the true nature of the data-file example that was posted. I did that by "quoting the post" that had the example and some remarkable stuff showed up that the HTML rendering on this site was hiding. There's a ton of leading and embedded spaces. Here's what the file data really looks like...
REQ que stn all
STN
DN 326 0000
LOC GHUT ILCE 01 2 12 01
OPT CFW 1MR DGT RTP 0 ALTO NACT AWS NACT !NPA CFB CFD ENQC
CFAN
STN
DN 326 0001
LOC GHUT ILCE 01 2 12 02
OPT SUS CFW 1MR CWT DGT RTP 0 ALTO 3261281 ACT AWS NACT !NPA !ECO !RIT !124 CFB
CFD VM ENQC
STN
DN 326 0003
LOC GHUT ILCE 03 2 16 14
OPT 1MR DGT RTP 0 !TKD CFB CFD
STN
DN 326 0004
LOC GHUT ILCE 03 2 19 20
OPT 1MR DGT RTP 0 !LOC CFB CFD
The blank line actually contain 4 spaces. To help us see the spaces, I changed them all to a dash so they're easier to see.
REQ---que-stn-all
----
----STN
----DN---326-0000
----LOC--GHUT-ILCE-01-2-12-01-
----OPT--CFW-1MR-DGT-RTP-0-ALTO-NACT-AWS-NACT-!NPA-CFB-CFD-ENQC-
---------CFAN-
----
----STN
----DN---326-0001
----LOC--GHUT-ILCE-01-2-12-02-
----OPT--SUS-CFW-1MR-CWT-DGT-RTP-0-ALTO-3261281-ACT-AWS-NACT-!NPA-!ECO-!RIT-!124-CFB-
---------CFD-VM-ENQC-
----
----STN
----DN---326-0003
----LOC--GHUT-ILCE-03-2-16-14-
----OPT--1MR-DGT-RTP-0-!TKD-CFB-CFD-
----
----STN
----DN---326-0004
----LOC--GHUT-ILCE-03-2-19-20-
----OPT--1MR-DGT-RTP-0-!LOC-CFB-CFD
That's almost a fixed field format. In fact, the data to the right of what I'm calling the "RowType" is the only place where it's not fixed field. It's "ragged right" but it does start in the same position in the line even for the "next line extension" of the OPT RowType. That's EASY.
The first thing to do is build a BCP FORMAT FILE to control the import. I recommend creating a new directory somewhere on the server itself to hold such format files. Save the following as a format file. On my test box, I saved it under "C:\SQL Examples\BulkData.fmt".
9.0
4
1 SQLCHAR 0 4 "" 0 Blanks01 ""
2 SQLCHAR 0 3 "" 2 RowType ""
3 SQLCHAR 0 2 "" 0 Blanks02 ""
4 SQLCHAR 0 8000 "\ n\r" 3 BulkData "" <-----<<<< Take the space out between the \ and the n. This site "consumes" backslash n as a control character.
The BCP FORMAT FILE skips the 4 spaces at the beginning of each line, picks up the next 3 characters (spaces or not) as the RowType, skips 2 more spaces, and then picks up to the next 8000 characters (ending with a carriage return/line feed) as the "bulk data" for each RowType.
The rest is pretty easy. I saved your file data example as "C:\SQL Examples\Weird Telephone File.txt". You'll, of course, need to change that and the location of the BCP FORMAT FILE. If all of this works out according to you, then I'll show you how to parameterize those two file paths so that you could turn this into a stored procedure.
Here's the code. As normal, the details are in the comments in the code.
--===== If the work table exists, drop it to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#BulkData','U') IS NOT NULL
DROP TABLE #BulkData
;
--===== Create the work table with the quintessential CLUSTERED index on
-- the RowNum column.
CREATE TABLE #BulkData
(
RowNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,RowType CHAR(3)
,BulkData VARCHAR(8000)
,DN CHAR(8)
)
;
--===== Using the previously constructed BCP FORMAT FILE,
-- load the file into the work table.
-- Because of the structure of the file and the BCP FORMAT FILE,
-- this will skip the blank rows, as well. Makes life real easy.
BULK INSERT #BulkData
FROM 'C:\SQL Examples\Weird Telephone File.txt'
WITH (
FORMATFILE = 'C:\SQL Examples\BulkData.fmt'
,FIRSTROW = 3
,TABLOCK
)
;
--===== Declare a "per row" variable that will be updated for every row.
-- Think "pseudo-cursor" here where we use the natural looping behind
-- the scenes of the UPDATE as if it were a declared cursor.
DECLARE @DN CHAR(8)
;
--===== Using the old fashioned method of a "Quirky Update",
-- do a "data smear" to copy each DN entry "down" to the next rows
-- until we hit another DN entry and then copy that "down", etc.
-- This is where we also assign a RowType to prep for the pivoting
-- that we'll do using an old fashiond CROSSTAB.
-- Notice that this is where we also create the "OFL" or "OVERFLOw"
-- RowType for the "OPT" row "extensions".
-- Heh... when I say "old fashioned", think "Black Arts".
UPDATE tgt
SET @DN = DN = CASE WHEN RowType = 'DN' THEN LEFT(BulkData,8) ELSE ISNULL(@DN,'') END
,RowType = CASE WHEN RowType = '' AND RTRIM(LTRIM(BulkData)) <> 'STN' THEN 'OFL' ELSE RowType END
,BulkData = LTRIM(RTRIM(BulkData))
FROM #BulkData tgt WITH(TABLOCKX,INDEX(1))
OPTION (MAXDOP 1)
;
--===== Let's see what we have in the work table...
SELECT * FROM #BulkData
;
--===== Now, let's simultaneously pivot and parse the data using a very high performance CROSSTAB.
SELECT PhoneNumber = DN
,StationCode = MAX(CASE WHEN RowType = 'LOC' THEN RTRIM(SUBSTRING(BulkData,1,CHARINDEX(' ',BulkData))) ELSE '' END)
,Port = MAX(CASE WHEN RowType = 'LOC' THEN RTRIM(LTRIM(SUBSTRING(BulkData,CHARINDEX(' ',BulkData),8000))) ELSE '' END)
,Options = MAX(CASE WHEN RowType = 'OPT' THEN RTRIM(BulkData) ELSE '' END)
+ MAX(CASE WHEN RowType = 'OFL' THEN ' '+LTRIM(RTRIM(BulkData)) ELSE '' END)
FROM #BulkData
GROUP BY DN
;
This is what ends up in the "work" table... (scroll-right to see the new "DN" column)
RowNum RowType BulkData DN
1 DN 326 0000 326 0000
2 LOC GHUT ILCE 01 2 12 01 326 0000
3 OPT CFW 1MR DGT RTP 0 ALTO NACT AWS NACT !NPA CFB CFD ENQC 326 0000
4 OFL CFAN 326 0000
5 STN 326 0000
6 DN 326 0001 326 0001
7 LOC GHUT ILCE 01 2 12 02 326 0001
8 OPT SUS CFW 1MR CWT DGT RTP 0 ALTO 3261281 ACT AWS NACT !NPA !ECO !RIT !124 CFB 326 0001
9 OFL CFD VM ENQC 326 0001
10 STN 326 0001
11 DN 326 0003 326 0003
12 LOC GHUT ILCE 03 2 16 14 326 0003
13 OPT 1MR DGT RTP 0 !TKD CFB CFD 326 0003
14 STN 326 0003
15 DN 326 0004 326 0004
16 LOC GHUT ILCE 03 2 19 20 326 0004
17 OPT 1MR DGT RTP 0 !LOC CFB CFD 326 0004
Notice how the DN has been "smeared down" for all rows where it first occurs until a new DN row shows up and then that is "smeared down", etc, etc. That's the key that we GROUP on to do the CROSS APPLY to pivot everything and to combine the any existing OFL rows with their respective OPT rows.
Here's what the final output looks like courtesy of the CROSS APPLY.
PhoneNumber StationCode Port Options
----------- ----------- --------------- ---------------------------------------------------------------------------------------
326 0000 GHUT ILCE 01 2 12 01 CFW 1MR DGT RTP 0 ALTO NACT AWS NACT !NPA CFB CFD ENQC CFAN
326 0001 GHUT ILCE 01 2 12 02 SUS CFW 1MR CWT DGT RTP 0 ALTO 3261281 ACT AWS NACT !NPA !ECO !RIT !124 CFB CFD VM ENQC
326 0003 GHUT ILCE 03 2 16 14 1MR DGT RTP 0 !TKD CFB CFD
326 0004 GHUT ILCE 03 2 19 20 1MR DGT RTP 0 !LOC CFB CFD
This is the "easy" part, so far. If that's ok by you, the next part would be to split the options (OPT) into separate columns but, I have to ask, do you REALLY want to do that?
Nicely done, Jeff. I'm filing this one away for later, as I'm sure the approach will come in handy some day.
There's a correction that I just made in the original post because this site ate the backslash N. In your quotes above, the \ n\r should be \r\ n (with the appropriate space removal). I had originally pasted the \ n into the wrong spot apparently due to caffeine depletion in the bloodstream. :blush:
I also do absolutely agree with Koen that SSIS does make some things (like gated parallel processing) a whole lot easier. If you combine the features of SSIS with the ability for it to call upon stored procedures and the like, it can be a formidable tool.
@Koen...
Is there a way to easily copy packages from one instance of SSIS to another? THAT would make it even more formidable.
@jeff Yes the the OPTion values have to be separated. There is a catch though when separating the OPTions:
1. the paymode is determined from the OPTions, if OPTions contain any one or more of these values: !124,32,82,13,34,38,!RIT,!TDK then its a Prepaid, if in these values: 3,52,DMS10i,58,4,DMS100,8,19,5,9,21,10,7,26,2,37 then its a postpaid. If none of the values are specified, then the default Postpaid.
2. From all possible values of OPTions, the only OPTions' columns needed are:
!BAR !ECO !IDD !LOC !NPA !OPR !STD !VAS 1FR 1MR ACT ALTO AWS CDF CFAN CFB CFBU CFD CFDU CFW CLT CND CNDB COB CWT DGT DNH DOR DTM ENQC FNT FPSV FREQ FRST MCT MTR NACT OCB OPT RTP SUS SUSO SUST VM
Thank you.
Sorry about the long quote above. I wanted to keep this all together. And, sorry about the delay. I was on vacation last week and forgot all about it.
Getting back on track, the following will do what you've asked for in the last post above. It uses the same BCP Format File, the same file, and almost all of the code is the same except for how we split up the "BulkData" column.
The first thing that you'll need to do is to build the "DelimitedSplit8K" function in whatever database you're doing all of this. It's a fully documented splitter function. You can get the function from the "Resources" section near the bottom of the article at the following link.
http://www.sqlservercentral.com/articles/Tally+Table/72993/
Once you've installed the "DelimitedSplit8K" function above, the following code will do as you've asked...
/* BCP Format File used (same as previous)
9.0
4
1 SQLCHAR 0 4 "" 0 Blanks01 ""
2 SQLCHAR 0 3 "" 2 RowType ""
3 SQLCHAR 0 2 "" 0 Blanks02 ""
4 SQLCHAR 0 8000 "\r\ n" 3 BulkData "" <-----<<<< Take the space out between the \ and the n. This site "consumes" backslash n as a control character.
*/
--===== If the work table exists, drop it to make reruns in SSMS easier.
IF OBJECT_ID('tempdb..#BulkData','U') IS NOT NULL
DROP TABLE #BulkData
;
--===== Create the work table with the quintessential CLUSTERED index on
-- the RowNum column.
CREATE TABLE #BulkData
(
RowNum INT IDENTITY(1,1) PRIMARY KEY CLUSTERED
,RowType CHAR(3)
,BulkData VARCHAR(8000)
,DN CHAR(8)
)
;
--===== Using the previously constructed BCP FORMAT FILE,
-- load the file into the work table.
-- Because of the structure of the file and the BCP FORMAT FILE,
-- this will skip the blank rows, as well. Makes life real easy.
BULK INSERT #BulkData
FROM 'C:\SQL Examples\Weird Telephone File.txt'
WITH (
FORMATFILE = 'C:\SQL Examples\BulkData.fmt'
,FIRSTROW = 3
,TABLOCK
)
;
--===== Declare a "per row" variable that will be updated for every row.
-- Think "pseudo-cursor" here where we use the natural looping behind
-- the scenes of the UPDATE as if it were a declared cursor.
DECLARE @DN CHAR(8)
;
--===== Using the old fashioned method of a "Quirky Update",
-- do a "data smear" to copy each DN entry "down" to the next rows
-- until we hit another DN entry and then copy that "down", etc.
-- This is where we also assign a RowType to prep for the pivoting
-- that we'll do using an old fashiond CROSSTAB.
-- Notice that this is where we also create the "OFL" or "OVERFLOw"
-- RowType for the "OPT" row "extensions".
-- Heh... when I say "old fashioned", think "Black Arts".
UPDATE tgt
SET @DN = DN = CASE WHEN RowType = 'DN' THEN LEFT(BulkData,8) ELSE ISNULL(@DN,'') END
,RowType = CASE WHEN RowType = '' AND RTRIM(LTRIM(BulkData)) <> 'STN' THEN 'OFL' ELSE RowType END
,BulkData = LTRIM(RTRIM(BulkData))
FROM #BulkData tgt WITH(TABLOCKX,INDEX(1))
OPTION (MAXDOP 1)
;
--===== Let's see what we have in the work table...
-- SELECT * FROM #BulkData
--;
--===== Now, let's simultaneously pivot and parse the data using a very high performance CROSSTAB.
SELECT [PhoneNumber] = DN
,[StationCode] = MAX(CASE WHEN RowType = 'LOC' THEN RTRIM(SUBSTRING(BulkData,1,CHARINDEX(' ',BulkData))) ELSE '' END)
,[Port] = MAX(CASE WHEN RowType = 'LOC' THEN RTRIM(LTRIM(SUBSTRING(BulkData,CHARINDEX(' ',BulkData),8000))) ELSE '' END)
,[PayMode] = MAX(CASE --Note that the second "WHEN" isn't necessary here but does make the code "future bullet-proof"
WHEN split.Item IN ('!124','32','82','13','34','38','!RIT','!TDK') THEN 'Prepaid'
WHEN split.Item IN ('3','52','DMS10i','58','4','DMS100','8','19','5','9','21','10','7','26','2','37') THEN 'Postpaid'
ELSE 'Postpaid'
END)
,[!BAR] = MAX(CASE WHEN split.Item = '!BAR' THEN split.Item ELSE '' END)
,[!ECO] = MAX(CASE WHEN split.Item = '!ECO' THEN split.Item ELSE '' END)
,[!IDD] = MAX(CASE WHEN split.Item = '!IDD' THEN split.Item ELSE '' END)
,[!LOC] = MAX(CASE WHEN split.Item = '!LOC' THEN split.Item ELSE '' END)
,[!NPA] = MAX(CASE WHEN split.Item = '!NPA' THEN split.Item ELSE '' END)
,[!OPR] = MAX(CASE WHEN split.Item = '!OPR' THEN split.Item ELSE '' END)
,[!STD] = MAX(CASE WHEN split.Item = '!STD' THEN split.Item ELSE '' END)
,[!VAS] = MAX(CASE WHEN split.Item = '!VAS' THEN split.Item ELSE '' END)
,[1FR] = MAX(CASE WHEN split.Item = '1FR' THEN split.Item ELSE '' END)
,[1MR] = MAX(CASE WHEN split.Item = '1MR' THEN split.Item ELSE '' END)
,[ACT] = MAX(CASE WHEN split.Item = 'ACT' THEN split.Item ELSE '' END)
,[ALTO] = MAX(CASE WHEN split.Item = 'ALTO' THEN split.Item ELSE '' END)
,[AWS] = MAX(CASE WHEN split.Item = 'AWS' THEN split.Item ELSE '' END)
,[CDF] = MAX(CASE WHEN split.Item = 'CDF' THEN split.Item ELSE '' END)
,[CFAN] = MAX(CASE WHEN split.Item = 'CFAN' THEN split.Item ELSE '' END)
,[CFB] = MAX(CASE WHEN split.Item = 'CFB' THEN split.Item ELSE '' END)
,[CFBU] = MAX(CASE WHEN split.Item = 'CFBU' THEN split.Item ELSE '' END)
,[CFD] = MAX(CASE WHEN split.Item = 'CFD' THEN split.Item ELSE '' END)
,[CFDU] = MAX(CASE WHEN split.Item = 'CFDU' THEN split.Item ELSE '' END)
,[CFW] = MAX(CASE WHEN split.Item = 'CFW' THEN split.Item ELSE '' END)
,[CLT] = MAX(CASE WHEN split.Item = 'CLT' THEN split.Item ELSE '' END)
,[CND] = MAX(CASE WHEN split.Item = 'CND' THEN split.Item ELSE '' END)
,[CNDB] = MAX(CASE WHEN split.Item = 'CNDB' THEN split.Item ELSE '' END)
,[COB] = MAX(CASE WHEN split.Item = 'COB' THEN split.Item ELSE '' END)
,[CWT] = MAX(CASE WHEN split.Item = 'CWT' THEN split.Item ELSE '' END)
,[DGT] = MAX(CASE WHEN split.Item = 'DGT' THEN split.Item ELSE '' END)
,[DNH] = MAX(CASE WHEN split.Item = 'DNH' THEN split.Item ELSE '' END)
,[DOR] = MAX(CASE WHEN split.Item = 'DOR' THEN split.Item ELSE '' END)
,[DTM] = MAX(CASE WHEN split.Item = 'DTM' THEN split.Item ELSE '' END)
,[ENQC] = MAX(CASE WHEN split.Item = 'ENQC' THEN split.Item ELSE '' END)
,[FNT] = MAX(CASE WHEN split.Item = 'FNT' THEN split.Item ELSE '' END)
,[FPSV] = MAX(CASE WHEN split.Item = 'FPSV' THEN split.Item ELSE '' END)
,[FREQ] = MAX(CASE WHEN split.Item = 'FREQ' THEN split.Item ELSE '' END)
,[FRST] = MAX(CASE WHEN split.Item = 'FRST' THEN split.Item ELSE '' END)
,[MCT] = MAX(CASE WHEN split.Item = 'MCT' THEN split.Item ELSE '' END)
,[MTR] = MAX(CASE WHEN split.Item = 'MTR' THEN split.Item ELSE '' END)
,[NACT] = MAX(CASE WHEN split.Item = 'NACT' THEN split.Item ELSE '' END)
,[OCB] = MAX(CASE WHEN split.Item = 'OCB' THEN split.Item ELSE '' END)
,[OPT] = MAX(CASE WHEN split.Item = 'OPT' THEN split.Item ELSE '' END)
,[RTP] = MAX(CASE WHEN split.Item = 'RTP' THEN split.Item ELSE '' END)
,[SUS] = MAX(CASE WHEN split.Item = 'SUS' THEN split.Item ELSE '' END)
,[SUSO] = MAX(CASE WHEN split.Item = 'SUSO' THEN split.Item ELSE '' END)
,[SUST] = MAX(CASE WHEN split.Item = 'SUST' THEN split.Item ELSE '' END)
,[VM] = MAX(CASE WHEN split.Item = 'VM' THEN split.Item ELSE '' END)
FROM #BulkData bd
CROSS APPLY (SELECT CAST(Item AS CHAR(4)) FROM dbo.DelimitedSplit8K(bd.BulkData,' ')) split (Item)
GROUP BY DN
;
Here's what the result set looks like...
PhoneNumber StationCode Port PayMode !BAR !ECO !IDD !LOC !NPA !OPR !STD !VAS 1FR 1MR ACT ALTO AWS CDF CFAN CFB CFBU CFD CFDU CFW CLT CND CNDB COB CWT DGT DNH DOR DTM ENQC FNT FPSV FREQ FRST MCT MTR NACT OCB OPT RTP SUS SUSO SUST VM
----------- ----------- --------------- -------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
326 0000 GHUT ILCE 01 2 12 01 Postpaid !NPA 1MR ALTO AWS CFAN CFB CFD CFW DGT ENQC NACT RTP
326 0001 GHUT ILCE 01 2 12 02 Prepaid !ECO !NPA 1MR ACT ALTO AWS CFB CFD CFW CWT DGT ENQC NACT RTP SUS VM
326 0003 GHUT ILCE 03 2 16 14 Postpaid 1MR CFB CFD DGT RTP
326 0004 GHUT ILCE 03 2 19 20 Postpaid !LOC 1MR CFB CFD DGT RTP
Again, sorry for the delay and thanks for the problem. This was fun. If you have any questions about how this thing works, please post back. If I can't get back to it, the heavy hitters on this forum are pretty familiar with all of these techniques and I'm sure that they'll be happy to answer in my stead.

Jeff Thank you so much for assistance. I should be ashamed, you actually did my job for me but then again I am so grateful as you saved me a lot of time and head scratching. 😀 .

I have another file I am trying to import as well, a bit different. I was thinking if you assist me with the BCP Format File. (Please refer to attached) . The columns titles have a colon at the end of it.

E.g. in

DN: 3210000

TYPE: PILOT OF DNH HUNT GROUP

SNPA: 675 SIG: DT LNATTIDX: 0

HUNT GROUP: 51 HUNT MEMBER: 0

LINE EQUIPMENT NUMBER: HOST 04 1 00 05

the columns are DN, TYPE, SNPA, SIG, LNATTIDX, HUNT GROUP and LINE EQUIPMENT NUMBER. These lines ---------- depict the end of a phone detail.

Thank you.