Processing Flat file with no delimiters

  • Morning All,

    I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.

    The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.

    Below is a sample dataset

    CREATE TABLE [dbo].[DataTEST](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    Data [varchar](255) NULL

    ) ON [PRIMARY]

    INSERT INTO dbo.[DataTEST]

    (Data)

    VALUES

    ('TaskID'),

    ('CustomeridID'),

    ('AccountID'),

    ('EnquiryName'),

    ('Source'),

    ('EventDate'),

    ('EventDays'),

    ('EnquiryCreatedDate'),

    ('Description'),

    ('b12c0000008gowM'),

    ('003b000001DolFS'),

    ('001b000003KtIjG'),

    ('Conference and Team building'),

    ('Word of Mouth'),

    ('08/07/2016'),

    ('0'),

    ('31/03/2016'),

    ('8th or 14th. £57ddr'),

    ('Edwards £3500 or Smithson £1500'),

    ('b12c0000008gomb'),

    ('003b000001STRE'),

    ('004r000004tes5'),

    ('Flight Training'),

    ('Dave Gorman'),

    ('08/09/2016'),

    ('0'),

    ('12/04/2016'),

    ('Dave to confrim numbers'),

    ('Currently using Suite 1')

  • I realised that i could use the lead function to process the next row as a separate column:

    Select *

    FROM

    (

    SELECT ID,

    [Data],

    --Concat('(''',[Data] Collate SQL_Latin1_General_CP1_CI_AS,'''),'),

    --Case when [Data] like 'A12b%' then 1 else 0 END,

    Lead ([Data],1,0) over (Order by ID) A,

    Lead ([Data],2,0) over (Order by ID) b,

    Lead ([Data],3,0) over (Order by ID)c,

    Lead ([Data],4,0) over (Order by ID)d,

    Lead ([Data],5,0) over (Order by ID)e,

    Lead ([Data],6,0) over (Order by ID)f,

    Lead ([Data],7,0) over (Order by ID)g,

    Lead ([Data],8,0) over (Order by ID)h,

    Lead ([Data],9,0) over (Order by ID)i

    FROM dbo.[DataTEST]

    )A

    where [Data] like 'B12c%'

  • Although this works in your example, what is a "column" is blank. Does the file contain a blank line, or does it go straight column? Just want to make sure you don't run into any unexpected errors.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Here is a quick suggestion for filtering the data rows from the set

    😎

    IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;

    CREATE TABLE [dbo].[DataTEST](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    Data [varchar](255) NULL

    ) ON [PRIMARY]

    INSERT INTO dbo.[DataTEST]

    (Data)

    VALUES

    ('TaskID'),

    ('CustomeridID'),

    ('AccountID'),

    ('EnquiryName'),

    ('Source'),

    ('EventDate'),

    ('EventDays'),

    ('EnquiryCreatedDate'),

    ('Description'),

    ('b12c0000008gowM'),

    ('003b000001DolFS'),

    ('001b000003KtIjG'),

    ('Conference and Team building'),

    ('Word of Mouth'),

    ('08/07/2016'),

    ('0'),

    ('31/03/2016'),

    ('8th or 14th. £57ddr'),

    ('Edwards £3500 or Smithson £1500'),

    ('b12c0000008gomb'),

    ('003b000001STRE'),

    ('004r000004tes5'),

    ('Flight Training'),

    ('Dave Gorman'),

    ('08/09/2016'),

    ('0'),

    ('12/04/2016'),

    ('Dave to confrim numbers'),

    ('Currently using Suite 1');

    DECLARE @NUM_DATA_ROWS INT = 10;

    ;WITH BASE_DATA AS

    (

    SELECT

    DT.ID

    ,DT.Data

    ,SUM(CASE

    WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1

    ELSE 0

    END) OVER

    (

    ORDER BY DT.ID

    ROWS BETWEEN UNBOUNDED PRECEDING

    AND CURRENT ROW

    ) AS GROUP_ID

    FROM dbo.DataTEST DT

    )

    ,NUMBERED_LINE_SET AS

    (

    SELECT

    BD.ID

    ,ROW_NUMBER() OVER

    (

    PARTITION BY BD.GROUP_ID

    ORDER BY BD.GROUP_ID

    ) AS LINE_RID

    ,BD.GROUP_ID

    ,BD.Data

    FROM BASE_DATA BD

    WHERE BD.GROUP_ID > 0

    )

    SELECT

    NLS.ID

    ,NLS.GROUP_ID

    ,NLS.LINE_RID

    ,NLS.Data

    FROM NUMBERED_LINE_SET NLS

    WHERE NLS.LINE_RID <= @NUM_DATA_ROWS

    ;

    Output from the sample data

    ID GROUP_ID LINE_RID Data

    ----------- ----------- -------------------- --------------------------------

    10 1 1 b12c0000008gowM

    11 1 2 003b000001DolFS

    12 1 3 001b000003KtIjG

    13 1 4 Conference and Team building

    14 1 5 Word of Mouth

    15 1 6 08/07/2016

    16 1 7 0

    17 1 8 31/03/2016

    18 1 9 8th or 14th. £57ddr

    19 1 10 Edwards £3500 or Smithson £1500

    20 2 1 b12c0000008gomb

    21 2 2 003b000001STRE

    22 2 3 004r000004tes5

    23 2 4 Flight Training

    24 2 5 Dave Gorman

    25 2 6 08/09/2016

    26 2 7 0

    27 2 8 12/04/2016

    28 2 9 Dave to confrim numbers

    29 2 10 Currently using Suite 1

  • This will pivot the data sets to columns

    😎

    IF OBJECT_ID(N'dbo.DataTEST') IS NOT NULL DROP TABLE dbo.DataTEST;

    CREATE TABLE [dbo].[DataTEST](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    Data [varchar](255) NULL

    ) ON [PRIMARY]

    INSERT INTO dbo.[DataTEST]

    (Data)

    VALUES

    ('TaskID'),

    ('CustomeridID'),

    ('AccountID'),

    ('EnquiryName'),

    ('Source'),

    ('EventDate'),

    ('EventDays'),

    ('EnquiryCreatedDate'),

    ('Description'),

    ('b12c0000008gowM'),

    ('003b000001DolFS'),

    ('001b000003KtIjG'),

    ('Conference and Team building'),

    ('Word of Mouth'),

    ('08/07/2016'),

    ('0'),

    ('31/03/2016'),

    ('8th or 14th. £57ddr'),

    ('Edwards £3500 or Smithson £1500'),

    ('b12c0000008gomb'),

    ('003b000001STRE'),

    ('004r000004tes5'),

    ('Flight Training'),

    ('Dave Gorman'),

    ('08/09/2016'),

    ('0'),

    ('12/04/2016'),

    ('Dave to confrim numbers'),

    ('Currently using Suite 1');

    DECLARE @NUM_DATA_ROWS INT = 10;

    ;WITH BASE_DATA AS

    (

    SELECT

    DT.ID

    ,DT.Data

    ,SUM(CASE

    WHEN CHARINDEX('b12c',DT.Data,1) = 1 THEN 1

    ELSE 0

    END) OVER

    (

    ORDER BY DT.ID

    ROWS BETWEEN UNBOUNDED PRECEDING

    AND CURRENT ROW

    ) AS GROUP_ID

    FROM dbo.DataTEST DT

    )

    ,NUMBERED_LINE_SET AS

    (

    SELECT

    BD.ID

    ,ROW_NUMBER() OVER

    (

    PARTITION BY BD.GROUP_ID

    ORDER BY BD.GROUP_ID

    ) AS LINE_RID

    ,BD.GROUP_ID

    ,BD.Data

    FROM BASE_DATA BD

    WHERE BD.GROUP_ID > 0

    )

    SELECT

    NLS.GROUP_ID

    --,NLS.LINE_RID

    ,MAX(CASE WHEN NLS.LINE_RID = 1 THEN NLS.Data END ) AS COL_001

    ,MAX(CASE WHEN NLS.LINE_RID = 2 THEN NLS.Data END ) AS COL_002

    ,MAX(CASE WHEN NLS.LINE_RID = 3 THEN NLS.Data END ) AS COL_003

    ,MAX(CASE WHEN NLS.LINE_RID = 4 THEN NLS.Data END ) AS COL_004

    ,MAX(CASE WHEN NLS.LINE_RID = 5 THEN NLS.Data END ) AS COL_005

    ,MAX(CASE WHEN NLS.LINE_RID = 6 THEN NLS.Data END ) AS COL_006

    ,MAX(CASE WHEN NLS.LINE_RID = 7 THEN NLS.Data END ) AS COL_007

    ,MAX(CASE WHEN NLS.LINE_RID = 8 THEN NLS.Data END ) AS COL_008

    ,MAX(CASE WHEN NLS.LINE_RID = 9 THEN NLS.Data END ) AS COL_009

    ,MAX(CASE WHEN NLS.LINE_RID = 10 THEN NLS.Data END ) AS COL_010

    FROM NUMBERED_LINE_SET NLS

    WHERE NLS.LINE_RID <= @NUM_DATA_ROWS

    GROUP BY NLS.GROUP_ID

    ;

    Output from the sample data

    GROUP_ID COL_001 COL_002 COL_003 COL_004 COL_005 COL_006 COL_007 COL_008 COL_009 COL_010

    ----------- ---------------- ---------------- ---------------- ----------------------------- -------------- ----------- -------- ----------- ------------------------ --------------------------------

    1 b12c0000008gowM 003b000001DolFS 001b000003KtIjG Conference and Team building Word of Mouth 08/07/2016 0 31/03/2016 8th or 14th. £57ddr Edwards £3500 or Smithson £1500

    2 b12c0000008gomb 003b000001STRE 004r000004tes5 Flight Training Dave Gorman 08/09/2016 0 12/04/2016 Dave to confrim numbers Currently using Suite 1

  • A quick and simple solution with good old joins.

    Select

    CustomeridID.Data as CustomeridID ,

    AccountID.Data as AccountID,

    EnquiryName.Data as EnquiryName,

    Source.Data as Source,

    EventDate.Data as EventDate,

    EventDays.Data as EventDays,

    EnquiryCreatedDate.Data as EnquiryCreatedDate,

    Description.Data as Description

    FROM dbo.[DataTEST] d

    join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1

    join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2

    join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3

    join dbo.[DataTEST] Source on Source.ID= d.id+4

    join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5

    join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6

    join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7

    join dbo.[DataTEST] Description on Description.ID= d.id+8

    and (d.id % 10)=0

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • joeroshan (9/21/2016)


    A quick and simple solution with good old joins.

    Select

    CustomeridID.Data as CustomeridID ,

    AccountID.Data as AccountID,

    EnquiryName.Data as EnquiryName,

    Source.Data as Source,

    EventDate.Data as EventDate,

    EventDays.Data as EventDays,

    EnquiryCreatedDate.Data as EnquiryCreatedDate,

    Description.Data as Description

    FROM dbo.[DataTEST] d

    join dbo.[DataTEST] CustomeridID on CustomeridID.ID= d.id+1

    join dbo.[DataTEST] AccountID on AccountID.ID= d.id+2

    join dbo.[DataTEST] EnquiryName on EnquiryName.ID= d.id+3

    join dbo.[DataTEST] Source on Source.ID= d.id+4

    join dbo.[DataTEST] EventDate on EventDate.ID= d.id+5

    join dbo.[DataTEST] EventDays on EventDays.ID= d.id+6

    join dbo.[DataTEST] EnquiryCreatedDate on EnquiryCreatedDate.ID= d.id+7

    join dbo.[DataTEST] Description on Description.ID= d.id+8

    and (d.id % 10)=0

    Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.

    😎

  • Can you start by sharing a sample of the file? It might be possible to do it during the import, but I can't be sure without a sample.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Eirikur Eiriksson (9/21/2016)


    Careful there, this is very inefficient and the plan has multiple table scans, (num columns + 1) or in your query total of 9 scans.

    😎

    This can be made sargable with a clustered index on id.

    Create clustered index IX_DataTESTID on DataTEST(ID)

    However the mod operator still would need a scan.

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • Thanks for all your responses guys.

    Regarding the multiple join table, IO is one of the things that we try our hardest to limit as a development team and having reviewed Eirikur Eiriksson's solution, i found this to work best with the full dataset.

    That being said I'm humbled by everyone's positive responses and contributions.

  • 😎

    Multiple joins need not mean you have a bad performance. I tried with some test data with some sample by executing

    INSERT INTO dbo.[DataTEST]

    Select data from

    dbo.[DataTEST] where ID >9 order by id

    Go 15

    With 65000 rows the join was faster, especially after creating the clustered index.

    All said, glad that your problem is solved.

    Let us know if you face any challenges later.

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • waxb18 (9/21/2016)


    Morning All,

    I've received a flat file which needs to processed into a SQL table, the only problem being that the file has no column delimiters hence when imported it appears as a single column.

    The only identifier that I have for each row is that the first row begins with "B12C....", other issue being that the number of columns varies row to row.

    Below is a sample dataset

    CREATE TABLE [dbo].[DataTEST](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    Data [varchar](255) NULL

    ) ON [PRIMARY]

    INSERT INTO dbo.[DataTEST]

    (Data)

    VALUES

    ('TaskID'),

    ('CustomeridID'),

    ('AccountID'),

    ('EnquiryName'),

    ('Source'),

    ('EventDate'),

    ('EventDays'),

    ('EnquiryCreatedDate'),

    ('Description'),

    ('b12c0000008gowM'),

    ('003b000001DolFS'),

    ('001b000003KtIjG'),

    ('Conference and Team building'),

    ('Word of Mouth'),

    ('08/07/2016'),

    ('0'),

    ('31/03/2016'),

    ('8th or 14th. £57ddr'),

    ('Edwards £3500 or Smithson £1500'),

    ('b12c0000008gomb'),

    ('003b000001STRE'),

    ('004r000004tes5'),

    ('Flight Training'),

    ('Dave Gorman'),

    ('08/09/2016'),

    ('0'),

    ('12/04/2016'),

    ('Dave to confrim numbers'),

    ('Currently using Suite 1')

    Why are there only 9 heading rows when there are 10 rows of data in each group? I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.

    It doesn't look like there's any PII in that data. Would it be possible to attach the file? If you compress the file, please use a ZIP file instead of RAR or any other compression method. It will also limit IO to the very minimum of only occurring during the import and won't be more than what your current import incurs.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/21/2016)


    I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.

    One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • joeroshan (9/22/2016)


    Jeff Moden (9/21/2016)


    I ask because the easiest and absolutely fastest way to import this data would be do create a BCP format file with line sensitivity. It would eliminate the need for pivoting or any other -TSQL prestidigitation.

    One should be really good at a tool to know when not to use it :-D. Its always great learning from you Jeff.

    Thanks Joe,

    Just to explain a bit more, if the number of header lines equal the number of data lines consistently and the data lines are consistently in the same order as the header lines, then the file actually does have delimiters in the form of end-of-line characters for each. That can easily be defined in a BCP format file and the "pivot" will auto-magically be accomplished during the input. BCP/BULK INSERT is NOT relegated to 1 line per row.

    If the file (as it sometimes does) has a blank line between the groups of items, that can be easily incorporated into the BCP format file delimiter for the last "field".

    I frequently do the same with the output of PowerShell because the column widths are unreliable when delivered in a horizontal format but is very reliable for number of items per group and the order of the items when returned in a vertical format. A simple split to remove the embedded column header on each line aftwards (I normally import to a table that has a persisted computed column for each item makes it instantly complete on import. You can even do a bit of data-typing in such computed columns.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks a lot for the detailed explanation Jeff. Looks like I need to play around with some flat files this weekend. Hope the waxb18 still watching the thread and would benefit from your suggestion.

    ----------------------------------------------------------------------------------------------------------------------------------------------------
    Roshan Joe

    Jeff Moden -Forum Etiquette: How to post data/code on a forum to get the best help[/url]

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply