Parse String By Column Position

  • Greetings,

    I am looking for a way to simplify chopping up a fixed width string into its individual fields. I realize I can just use SUBSTRING() to get each part, but I have a feeling there may be a better way. An example of what I have and want I need is this:

    [font="Courier New"]Input file:

    20150801 N6743 Doe John 23.56 M

    20150802 Y6534 Doe Jane 26.32 F

    Output File:

    * Record 1 *

    Entry_Date : 2015-08-01 -- Positions 1 - 8

    Prior_Cust : N -- Position 11 - 11

    Cust_ID : 6743 -- Position 12 - 15

    LastName : Doe -- Position 17 - 26

    FirstName : John -- Position 27 - 36

    Amt_Paid : 23.56 -- Position 37 - 42

    Reserved : -- Position 43 - 47

    Gender : M -- Position 48 - 48

    * Record 2 *

    Entry_Date : 2015-08-02 -- Positions 1 - 8

    Prior_Cust : Y -- Position 11 - 11

    Cust_ID : 6534 -- Position 12 - 15

    LastName : Doe -- Position 17 - 26

    FirstName : Jane -- Position 27 - 36

    Amt_Paid : 26.32 -- Position 37 - 42

    Reserved : -- Position 43 - 47

    Gender : F -- Position 48 - 48

    [/font]

    My thought that maybe a XML schema could be used to parse the single record line into the individual fields. This could even be a 2+ step approach. One part to split the string into smaller strings and then other parts to convert the smaller strings into their respective types.

    To add to the complexity, the files would have a single header and total record. The good part is all 3 records have an indicator in the same location for identification. If there is an easy way to do this, then would there also be a way to perform the opposite? Take a single record and return it as a single fixed position string?

    Thank you for your time.

    Terry Steadman

  • Terry,

    before you do much of anything, I would recommend reading Jeff Moden's article[/url] (and borrow his code) for his DelimitedSplit8K TVF. It will split your data up really easily...

    Here's my imperfect example:

    SELECT *

    FROM

    (SELECT '20150801 N6743 Doe John 23.56 M' As Data

    UNION ALL

    SELECT '20150802 Y6534 Doe Jane 26.32 F') x

    CROSS APPLY dbo.DelimitedSplit8K(Data,' ');

    You could use the function to split your data into a new table, and then deal with the data that way.

  • Greetings,

    Yes, his splitters are really good. But, in this case, you can't split the string on a space. The post had taken out all of the extra spaces in the example which made it appear to be able to split by a single space, but if you re-expand it, then you can see it wouldn't work.

    I replaced the spaces with a period to show you how the file may look like.

    20150801..N6743.Doe.......John.......23.56.....M

    20150802..Y6534.Doe.......Jane.......26.32.....F

    It really needs to get the fields based on the column position, not by a delimiter. If you try splitting by a space, then you could get a lot of records. You could try collapsing the spaces groups into a single space each but it would also fail if any of the field column positions were blank (missing data).

    Thank you for the reply.

  • For fixed width, SUBSTRING is the way to go. To reverse the process will be more difficult as you're creating different types of rows that would need validation when importing from flat files.

    SELECT '* Record ' +CAST( ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS varchar(5)) + ' *

    Entry_Date : ' + SUBSTRING(Data, 1,4) + '-' + SUBSTRING(Data, 5,2) + '-' + SUBSTRING(Data, 7,2) + '

    Prior_Cust : ' + SUBSTRING(Data, 11,1) + '

    Cust_ID : ' + SUBSTRING(Data, 12,5) + '

    LastName : ' + SUBSTRING(Data, 17,10) + '

    FirstName : ' + SUBSTRING(Data, 27,10) + '

    Amt_Paid : ' + SUBSTRING(Data, 37,6) + '

    Reserved : ' + SUBSTRING(Data, 43,5) + '

    Gender : ' + SUBSTRING(Data, 48,1) + '

    '

    FROM

    (SELECT '20150801 N6743 Doe John 23.56 M' As Data

    UNION ALL

    SELECT '20150802 Y6534 Doe Jane 26.32 F') x

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • terrance.steadman (8/5/2015)


    Greetings,

    I am looking for a way to simplify chopping up a fixed width string into its individual fields. I realize I can just use SUBSTRING() to get each part, but I have a feeling there may be a better way. An example of what I have and want I need is this:

    [font="Courier New"]Input file:

    20150801 N6743 Doe John 23.56 M

    20150802 Y6534 Doe Jane 26.32 F

    Output File:

    * Record 1 *

    Entry_Date : 2015-08-01 -- Positions 1 - 8

    Prior_Cust : N -- Position 11 - 11

    Cust_ID : 6743 -- Position 12 - 15

    LastName : Doe -- Position 17 - 26

    FirstName : John -- Position 27 - 36

    Amt_Paid : 23.56 -- Position 37 - 42

    Reserved : -- Position 43 - 47

    Gender : M -- Position 48 - 48

    * Record 2 *

    Entry_Date : 2015-08-02 -- Positions 1 - 8

    Prior_Cust : Y -- Position 11 - 11

    Cust_ID : 6534 -- Position 12 - 15

    LastName : Doe -- Position 17 - 26

    FirstName : Jane -- Position 27 - 36

    Amt_Paid : 26.32 -- Position 37 - 42

    Reserved : -- Position 43 - 47

    Gender : F -- Position 48 - 48

    [/font]

    My thought that maybe a XML schema could be used to parse the single record line into the individual fields. This could even be a 2+ step approach. One part to split the string into smaller strings and then other parts to convert the smaller strings into their respective types.

    To add to the complexity, the files would have a single header and total record. The good part is all 3 records have an indicator in the same location for identification. If there is an easy way to do this, then would there also be a way to perform the opposite? Take a single record and return it as a single fixed position string?

    Thank you for your time.

    Terry Steadman

    I have to respectfully disagree with everyone that has responded, so far.

    Since the data originates in a file and the fields in the file are, in fact, fixed width with no delimiters, the absolute best thing to do would be to create a BCP format file to identify the fixed width of each column and then use BULK INSERT to bring the data in already parsed. Nothing else will beat it for performance or simplicity.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I agree with Jeff on this one, nothing is going to be faster than bcp, are all lines in the incoming file the same format/layout?

    😎

    Now just for fun, here is a simple method of parsing a fixed width file, just in case you already got the content in a table.

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo..#PARSE_FIXED') IS NOT NULL DROP TABLE #PARSE_FIXED;

    GO

    SELECT

    X.LNO

    ,X.IMP_STR

    INTO #PARSE_FIXED

    FROM

    (

    VALUES

    ( 1, '20150801 N6743 Doe John 23.56 M')

    ,( 2, '20150802 Y6534 Doe Jane 26.32 F')

    ) AS X(LNO,IMP_STR)

    DECLARE @FORMAT TABLE

    (

    COL_NO INT NOT NULL PRIMARY KEY CLUSTERED

    ,COL_START INT NOT NULL

    ,COL_LEN INT NOT NULL

    ,COL_CAPTION VARCHAR(20) NOT NULL

    );

    INSERT INTO @FORMAT

    (

    COL_NO

    ,COL_START

    ,COL_LEN

    ,COL_CAPTION

    )

    VALUES

    (1, 1,8,'Entry_Date')

    ,(2,10,1,'Prior_Cust')

    ,(3,11,4,'Cust_ID' )

    ,(4,16,4,'LastName' )

    ,(5,20,4,'FirstName' )

    ,(6,25,5,'Amt_Paid' )

    ,(7,30,1,'Reserved' )

    ,(8,31,1,'Gender' );

    SELECT

    PF.LNO

    ,FM.COL_CAPTION

    ,SUBSTRING(PF.IMP_STR,FM.COL_START,FM.COL_LEN)

    FROM @FORMAT FM

    CROSS APPLY #PARSE_FIXED PF

    ;

    Output

    LNO COL_CAPTION

    ---- ------------ ---------

    1 Entry_Date 20150801

    1 Prior_Cust N

    1 Cust_ID 6743

    1 LastName Doe

    1 FirstName John

    1 Amt_Paid 23.56

    1 Reserved

    1 Gender M

    2 Entry_Date 20150802

    2 Prior_Cust Y

    2 Cust_ID 6534

    2 LastName Doe

    2 FirstName Jane

    2 Amt_Paid 26.32

    2 Reserved

    2 Gender F

  • Greetings All,

    Thank you for the current replies. I had a feeling that the way Jeff had said would end up being the primary solution. Still, when running tests with a script, it would have been nice to have a way to keep the test file within the database in a format I could adjust and rip apart as needed.

    I had also thought about using some method to check the files that we create to see if the data is going into the correct columns as necessary. So, basically, being able to convert from a single string into fields and from fields into a single string.

    I guess I was considering how things used to be when memory wasn't protected and variables were nothing more than memory address locations with an expected length. In languages like Assembly, you could define a variable that was 3,000 characters long to hold your single string. Then, define a bunch of variables inside the same memory location. Each one referencing the slice for it's own variable value. This also meant converting from a single string to multiple fields, and vice - versa was rather fast. You just write the one big field and the others were already populated at the same time. You could adjust 1 of the distinct fields and the single line record was adjusted at the same time.

    Aahh. The simplicity of strings in the past (* oops, my age is showing *).

    On a more serious note, I had considered that the FOR XML abilities might have contained the ability to either rip apart, or put together a single string from multiple string fields. Quite possibly with the format file as suggested by Jeff.

    Thank you for your time.

  • terrance.steadman (8/6/2015)


    Greetings All,

    Thank you for the current replies. I had a feeling that the way Jeff had said would end up being the primary solution. Still, when running tests with a script, it would have been nice to have a way to keep the test file within the database in a format I could adjust and rip apart as needed.

    I had also thought about using some method to check the files that we create to see if the data is going into the correct columns as necessary. So, basically, being able to convert from a single string into fields and from fields into a single string.

    I guess I was considering how things used to be when memory wasn't protected and variables were nothing more than memory address locations with an expected length. In languages like Assembly, you could define a variable that was 3,000 characters long to hold your single string. Then, define a bunch of variables inside the same memory location. Each one referencing the slice for it's own variable value. This also meant converting from a single string to multiple fields, and vice - versa was rather fast. You just write the one big field and the others were already populated at the same time. You could adjust 1 of the distinct fields and the single line record was adjusted at the same time.

    Aahh. The simplicity of strings in the past (* oops, my age is showing *).

    On a more serious note, I had considered that the FOR XML abilities might have contained the ability to either rip apart, or put together a single string from multiple string fields. Quite possibly with the format file as suggested by Jeff.

    Thank you for your time.

    If you create the right kind of BCP format file, the right kind of staging table, and set BULK INSERT to sequester any errors, you'll have all the things you say. I strong recommend you lookup "BCP Format File" and "BULK INSERT". It's code that works the "old way" (nasty fast, easy to use, always works when setup correctly) but without having to actually write loops.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply