Taking a single 80 character piece of Text and converting to columns in a table.

  • Thom A wrote:

    I don't really feel like I should be defending myself here, but 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' is 57 characters long, the length of all those columns is 80 characters. The OP never actually said (at least prior to my reply Jeff quoted)) that it was fixed width, and supplying a 57 character length string, and saying that they have an 80 character length one didn't help that.

    This information was drip fed, and had to be worked out through intuition/guess work, and it certainly wasn't obvious when I made the reply that Jeff quoted; which was my first reply on the thread.

    The OP does, in a later reply, say that the value is actually 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D                  P', however, that string is also only 74 characters long; though helps identify that we are likely more on the path of fixed width.

    I think the file is the CIF "Timetable Information Data Feed Interface Specification"

    Specifications here. https://www.raildeliverygroup.com/files/Publications/services/rsp/RSPS5046_timetable_information_data_feed_interface_specification.pdf

    The OP will have to split lines differently depending on the "Record Identity" (first two characters of each line).

    It's probably a job best performed by an external program, for example, .NET FileIO has a FixedWidth file reader which when given the format will automatically split a line into an array containing all the required substrings. It shouldn't take more than a couple of minutes to read in the file, then added to that would be the time taken to insert 8.4 million rows into the database, which if done in bulk inserts shouldn't take more than a few minutes.

  • The original question was how to import the data without having to use SUBSTRING to do it.  The answer there is to use a BCP Format File along with BULK INSERT.

    The rest of it is extra.  As we now know, the OP is trying to distribute the data found in a single file to multiple hopefully normalized tables and we can't help there because we simply don't know enough about those tables.  Like you, though, I'd be interested in what the OP changed to speed up his existing process.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry... Missed Jonathan's post, which appears to be spot on.  Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it?  Is this something you've dealt with before?

    And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file.  I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting.  Sounds like a BCP Format File on steroids.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    Sorry... Missed Jonathan's post, which appears to be spot on.  Jonathan, how did you figure out it was a rail schedule and where to get the documentation on it?  Is this something you've dealt with before?

    And, I agree... a simple BCP format file isn't going to work so well for this because of the varying unit record layouts within the same file.  I'm not sure that I'd write external code for it, but that fixed width read DOES sound mighty interesting.  Sounds like a BCP Format File on steroids. 

    What gave it away was columns like trainuid, trainservicecode. I used Google search to find the specifications. I have seen and used CIF files before. CIF format is historically the main way timetable information was distributed across the British railways.

  • After 50 replies, I believe forum rules officially declare this a post hijack 😉

     

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Here's an XML approach.

    declare @t table 
    (
    RowID int identity,
    line varchar(80)
    )

    insert into @t (line)
    select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
    select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'

    ;
    with
    pd as
    (
    select
    row_number() over (partition by x.RowID order by (select (null))) as colID,
    x.RowID,
    ca.val
    from
    (
    select
    RowID,
    cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
    from
    @t as t
    )
    as x
    cross apply
    (
    select
    xrow.D.value('.', 'varchar(max)') as val
    from
    x.xmlCol.nodes('X') as xrow(D)
    )
    as ca
    )
    select
    *
    from
    pd
    pivot
    (
    max(val)
    for
    colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
    )
    as p
  • ariski 64798 wrote:

    Here's an XML approach.

    declare @t table 
    (
    RowID int identity,
    line varchar(80)
    )

    insert into @t (line)
    select 'BSNC001481905191912080000001 PEE5P01 124650005 EMU 090D P' union
    select 'BSNC002647745191912080000001 PTE5P02 146650455 RTY 094D Q'

    ;
    with
    pd as
    (
    select
    row_number() over (partition by x.RowID order by (select (null))) as colID,
    x.RowID,
    ca.val
    from
    (
    select
    RowID,
    cast('<X>' + replace(t.line, ' ', '</X><X>') + '</X>' as xml) as xmlCol
    from
    @t as t
    )
    as x
    cross apply
    (
    select
    xrow.D.value('.', 'varchar(max)') as val
    from
    x.xmlCol.nodes('X') as xrow(D)
    )
    as ca
    )
    select
    *
    from
    pd
    pivot
    (
    max(val)
    for
    colID in ([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
    )
    as p

    I think you missed the main conversation, the OP is not using he spaces as seperators. its a fixed length string that needs to be broken up in chunks based on character position

    but I do like some of the ideas in your code - a little over engineered, but I can see uses for it

    MVDBA

  • The XML approach is one of the slowest mostly because of "ExpAnsive Concatenation".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to one and all for your replies

    I have mange to re-code my script and able to import the 8.4 million rows into their respective tables in 6 - 7 minutes, but it still uses substring to place the data into their tables. So the real problems was down to IO throughput rather than the substring processing.

     

     

  • Andrew, can you confirm that it is the railway schedule CIF file(s) you're working with?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, yes it is the railway schedule data

  • Thanks, Andrew.  Sounds like you've got is sussed.  What did you do to decrease the I/O problems that you were having?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    basically I generated a work table with the basic information where the datasets begin and end and used direct insert into <table> select substring(fieldname,1,2) as recordident from basetable where recordid = 'BS' and between xxx and yyyy which are keyid value generated when the data is added to basetable. Theres also a index for the recordident  to help things along.

     

    Hope fully that made sense !!!

     

    Andrew

  • It does, indeed.  Even if you're only extracting the "BS" record types, it might make sense to split out the first two characters as the record type during import using a BCP Format File and, of course, import the other 78 characters into a CHAR(78) column for the other processing you do.  That might also speed up your identification of where each dataset begins and ends because it'll be an indexable column.  You could skip the BCP Format File if you added a persisted computed column to extract the record type and simply inserted into a view that pointed to your staging table to not include that computed column or your dataset "identifier" column(s) that you create.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 61 through 73 (of 73 total)

You must be logged in to reply to this topic. Login to reply