Processing Flat file with no delimiters

Question

Post reply

Processing Flat file with no delimiters

waxb18

Ten Centuries

Points: 1060
More actions
September 21, 2016 at 3:27 am

#330432

Morning All,
I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.
The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.
Below is a sample dataset
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1')

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

waxb18 Ten Centuries Points: 1060 More actions · Answer 1

I realised that i could use the lead function to process the next row as a separate column:

Select *

FROM

(

SELECT ID,

[Data],

--Concat('(''',[Data] Collate SQL_Latin1_General_CP1_CI_AS,'''),'),

--Case when [Data] like 'A12b%' then 1 else 0 END,

Lead ([Data],1,0) over (Order by ID) A,

Lead ([Data],2,0) over (Order by ID) b,

Lead ([Data],3,0) over (Order by ID)c,

Lead ([Data],4,0) over (Order by ID)d,

Lead ([Data],5,0) over (Order by ID)e,

Lead ([Data],6,0) over (Order by ID)f,

Lead ([Data],7,0) over (Order by ID)g,

Lead ([Data],8,0) over (Order by ID)h,

Lead ([Data],9,0) over (Order by ID)i

FROM dbo.[DataTEST]

)A

where [Data] like 'B12c%'

Thom A SSC Guru Points: 99126 More actions · Answer 2

Although this works in your example, what is a "column" is blank. Does the file contain a blank line, or does it go straight column? Just want to make sure you don't run into any unexpected errors.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 3

Here is a quick suggestion for filtering the data rows from the set

😎

IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;

CREATE TABLE [dbo].[DataTEST](

[ID] [int] IDENTITY(1,1) NOT NULL,

Data [varchar](255) NULL

) ON [PRIMARY]

INSERT INTO dbo.[DataTEST]

(Data)

VALUES

('TaskID'),

('CustomeridID'),

('AccountID'),

('EnquiryName'),

('Source'),

('EventDate'),

('EventDays'),

('EnquiryCreatedDate'),

('Description'),

('b12c0000008gowM'),

('003b000001DolFS'),

('001b000003KtIjG'),

('Conference and Team building'),

('Word of Mouth'),

('08/07/2016'),

('0'),

('31/03/2016'),

('8th or 14th. £57ddr'),

('Edwards £3500 or Smithson £1500'),

('b12c0000008gomb'),

('003b000001STRE'),

('004r000004tes5'),

('Flight Training'),

('Dave Gorman'),

('08/09/2016'),

('0'),

('12/04/2016'),

('Dave to confrim numbers'),

('Currently using Suite 1');

DECLARE @NUM_DATA_ROWS INT = 10;

;WITH BASE_DATA AS

(

SELECT

DT.ID

,DT.Data

,SUM(CASE

WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1

ELSE 0

END) OVER

(

ORDER BY DT.ID

ROWS BETWEEN UNBOUNDED PRECEDING

AND CURRENT ROW

) AS GROUP_ID

FROM dbo.DataTEST DT

)

,NUMBERED_LINE_SET AS

(

SELECT

BD.ID

,ROW_NUMBER() OVER

(

PARTITION BY BD.GROUP_ID

ORDER BY BD.GROUP_ID

) AS LINE_RID

,BD.GROUP_ID

,BD.Data

FROM BASE_DATA BD

WHERE BD.GROUP_ID > 0

)

SELECT

NLS.ID

,NLS.GROUP_ID

,NLS.LINE_RID

,NLS.Data

FROM NUMBERED_LINE_SET NLS

WHERE NLS.LINE_RID <= @NUM_DATA_ROWS

;

Output from the sample data

ID GROUP_ID LINE_RID Data

----------- ----------- -------------------- --------------------------------

10 1 1 b12c0000008gowM

11 1 2 003b000001DolFS

12 1 3 001b000003KtIjG

13 1 4 Conference and Team building

14 1 5 Word of Mouth

15 1 6 08/07/2016

16 1 7 0

17 1 8 31/03/2016

18 1 9 8th or 14th. £57ddr

19 1 10 Edwards £3500 or Smithson £1500

20 2 1 b12c0000008gomb

21 2 2 003b000001STRE

22 2 3 004r000004tes5

23 2 4 Flight Training

24 2 5 Dave Gorman

25 2 6 08/09/2016

26 2 7 0

27 2 8 12/04/2016

28 2 9 Dave to confrim numbers

29 2 10 Currently using Suite 1

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 4

This will pivot the data sets to columns

😎

IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;

CREATE TABLE [dbo].[DataTEST](

[ID] [int] IDENTITY(1,1) NOT NULL,

Data [varchar](255) NULL

) ON [PRIMARY]

INSERT INTO dbo.[DataTEST]

(Data)

VALUES

('TaskID'),

('CustomeridID'),

('AccountID'),

('EnquiryName'),

('Source'),

('EventDate'),

('EventDays'),

('EnquiryCreatedDate'),

('Description'),

('b12c0000008gowM'),

('003b000001DolFS'),

('001b000003KtIjG'),

('Conference and Team building'),

('Word of Mouth'),

('08/07/2016'),

('0'),

('31/03/2016'),

('8th or 14th. £57ddr'),

('Edwards £3500 or Smithson £1500'),

('b12c0000008gomb'),

('003b000001STRE'),

('004r000004tes5'),

('Flight Training'),

('Dave Gorman'),

('08/09/2016'),

('0'),

('12/04/2016'),

('Dave to confrim numbers'),

('Currently using Suite 1');

DECLARE @NUM_DATA_ROWS INT = 10;

;WITH BASE_DATA AS

(

SELECT

DT.ID

,DT.Data

,SUM(CASE

WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1

ELSE 0

END) OVER

(

ORDER BY DT.ID

ROWS BETWEEN UNBOUNDED PRECEDING

AND CURRENT ROW

) AS GROUP_ID

FROM dbo.DataTEST DT

)

,NUMBERED_LINE_SET AS

(

SELECT

BD.ID

,ROW_NUMBER() OVER

(

PARTITION BY BD.GROUP_ID

ORDER BY BD.GROUP_ID

) AS LINE_RID

,BD.GROUP_ID

,BD.Data

FROM BASE_DATA BD

WHERE BD.GROUP_ID > 0

)

SELECT

NLS.GROUP_ID

--,NLS.LINE_RID

,MAX(CASE WHEN NLS.LINE_RID = 1 THEN NLS.Data END ) AS COL_001

,MAX(CASE WHEN NLS.LINE_RID = 2 THEN NLS.Data END ) AS COL_002

,MAX(CASE WHEN NLS.LINE_RID = 3 THEN NLS.Data END ) AS COL_003

,MAX(CASE WHEN NLS.LINE_RID = 4 THEN NLS.Data END ) AS COL_004

,MAX(CASE WHEN NLS.LINE_RID = 5 THEN NLS.Data END ) AS COL_005

,MAX(CASE WHEN NLS.LINE_RID = 6 THEN NLS.Data END ) AS COL_006

,MAX(CASE WHEN NLS.LINE_RID = 7 THEN NLS.Data END ) AS COL_007

,MAX(CASE WHEN NLS.LINE_RID = 8 THEN NLS.Data END ) AS COL_008

,MAX(CASE WHEN NLS.LINE_RID = 9 THEN NLS.Data END ) AS COL_009

,MAX(CASE WHEN NLS.LINE_RID = 10 THEN NLS.Data END ) AS COL_010

FROM NUMBERED_LINE_SET NLS

WHERE NLS.LINE_RID <= @NUM_DATA_ROWS

GROUP BY NLS.GROUP_ID

;

Output from the sample data

GROUP_ID COL_001 COL_002 COL_003 COL_004 COL_005 COL_006 COL_007 COL_008 COL_009 COL_010

----------- ---------------- ---------------- ---------------- ----------------------------- -------------- ----------- -------- ----------- ------------------------ --------------------------------

1 b12c0000008gowM 003b000001DolFS 001b000003KtIjG Conference and Team building Word of Mouth 08/07/2016 0 31/03/2016 8th or 14th. £57ddr Edwards £3500 or Smithson £1500

2 b12c0000008gomb 003b000001STRE 004r000004tes5 Flight Training Dave Gorman 08/09/2016 0 12/04/2016 Dave to confrim numbers Currently using Suite 1

joeroshan SSChampion Points: 10377 More actions · Answer 5

A quick and simple solution with good old joins.

Select

CustomeridID.Data as CustomeridID ,

AccountID.Data as AccountID,

EnquiryName.Data as EnquiryName,

Source.Data as Source,

EventDate.Data as EventDate,

EventDays.Data as EventDays,

EnquiryCreatedDate.Data as EnquiryCreatedDate,

Description.Data as Description

FROM dbo.[DataTEST] d

join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1

join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2

join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3

join dbo.[DataTEST] Source on Source.ID= d.id+4

join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5

join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6

join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7

join dbo.[DataTEST] Description on Description.ID= d.id+8

and (d.id % 10)=0

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 6

joeroshan (9/21/2016)
A quick and simple solution with good old joins.
Select
CustomeridID.Data as CustomeridID ,
AccountID.Data as AccountID,
EnquiryName.Data as EnquiryName,
Source.Data as Source,
EventDate.Data as EventDate,
EventDays.Data as EventDays,
EnquiryCreatedDate.Data as EnquiryCreatedDate,
Description.Data as Description
FROM dbo.[DataTEST] d
join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1
join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2
join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3
join dbo.[DataTEST] Source on Source.ID= d.id+4
join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5
join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6
join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7
join dbo.[DataTEST] Description on Description.ID= d.id+8
and (d.id % 10)=0

Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.

😎

Luis Cazares SSC Guru Points: 183698 More actions · Answer 7

Can you start by sharing a sample of the file? It might be possible to do it during the import, but I can't be sure without a sample.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

joeroshan SSChampion Points: 10377 More actions · Answer 8

Eirikur Eiriksson (9/21/2016)
Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.
😎

This can be made sargable with a clustered index on id.

Create clustered index IX_DataTESTID on DataTEST(ID)

However the mod operator still would need a scan.

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

waxb18 Ten Centuries Points: 1060 More actions · Answer 9

Thanks for all your responses guys.

Regarding the multiple join table, IO is one of the things that we try our hardest to limit as a development team and having reviewed Eirikur Eiriksson's solution, i found this to work best with the full dataset.

That being said I'm humbled by everyone's positive responses and contributions.

joeroshan SSChampion Points: 10377 More actions · Answer 10

😎

Multiple joins need not mean you have a bad performance. I tried with some test data with some sample by executing

INSERT INTO dbo.[DataTEST]

Select data from

dbo.[DataTEST] where ID >9 order by id

Go 15

With 65000 rows the join was faster, especially after creating the clustered index.

All said, glad that your problem is solved.

Let us know if you face any challenges later.

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 11

waxb18 (9/21/2016)
Morning All,
I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.
The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.
Below is a sample dataset
CREATE TABLE [dbo].[DataTEST](
[ID] [int] IDENTITY(1,1) NOT NULL,
Data [varchar](255) NULL
) ON [PRIMARY]
INSERT INTO dbo.[DataTEST]
(Data)
VALUES
('TaskID'),
('CustomeridID'),
('AccountID'),
('EnquiryName'),
('Source'),
('EventDate'),
('EventDays'),
('EnquiryCreatedDate'),
('Description'),
('b12c0000008gowM'),
('003b000001DolFS'),
('001b000003KtIjG'),
('Conference and Team building'),
('Word of Mouth'),
('08/07/2016'),
('0'),
('31/03/2016'),
('8th or 14th. £57ddr'),
('Edwards £3500 or Smithson £1500'),
('b12c0000008gomb'),
('003b000001STRE'),
('004r000004tes5'),
('Flight Training'),
('Dave Gorman'),
('08/09/2016'),
('0'),
('12/04/2016'),
('Dave to confrim numbers'),
('Currently using Suite 1')

Why are there only 9 heading rows when there are 10 rows of data in each group? I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.

It doesn't look like there's any PII in that data. Would it be possible to attach the file? If you compress the file, please use a ZIP file instead of RAR or any other compression method. It will also limit IO to the very minimum of only occurring during the import and won't be more than what your current import incurs.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

joeroshan SSChampion Points: 10377 More actions · Answer 12

Jeff Moden (9/21/2016)
I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.

One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 13

joeroshan (9/22/2016)
Jeff Moden (9/21/2016)
I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.
One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.

Thanks Joe,

Just to explain a bit more, if the number of header lines equal the number of data lines consistently and the data lines are consistently in the same order as the header lines, then the file actually does have delimiters in the form of end-of-line characters for each. That can easily be defined in a BCP format file and the "pivot" will auto-magically be accomplished during the input. BCP/BULK INSERT is NOT relegated to 1 line per row.

If the file (as it sometimes does) has a blank line between the groups of items, that can be easily incorporated into the BCP format file delimiter for the last "field".

I frequently do the same with the output of PowerShell because the column widths are unreliable when delivered in a horizontal format but is very reliable for number of items per group and the order of the items when returned in a vertical format. A simple split to remove the embedded column header on each line aftwards (I normally import to a table that has a persisted computed column for each item makes it instantly complete on import. You can even do a bit of data-typing in such computed columns.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

joeroshan SSChampion Points: 10377 More actions · Answer 14

Thanks a lot for the detailed explanation Jeff. Looks like I need to play around with some flat files this weekend. Hope the waxb18 still watching the thread and would benefit from your suggestion.

----------------------------------------------------------------------------------------------------------------------------------------------------
Roshan Joe

Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]