Splitting Table Column on Delimiter

  • I would swear I've done this before, but I'm drawing a blank today.

    I have a table with lots of rows with a column of data that contains 6 elements delimited by the pipe character.

    Instead of one field, i need the table split by the delimiter. I think the solution invoves a sub select, but I'll be darned if I remember how today.

    Here's some example data, and what i was getting using a tally table.

    I'm looking for results similar to this:

    [font="Courier New"] col1 col2 ol3 col4 col5 col6

    1000000 Timmerman Pond Dam Y NULL NULL U.S. Army Corps blah blah.....[/font]

    [font="Courier New"]

    --tally table if needed:

    --===== Create and populate the Tally table on the fly

    SELECT TOP 11000 --equates to more than 30 years of dates        

      IDENTITY(INT,1,1) AS N  

    INTO dbo.Tally  

    FROM MASTER.dbo.SysColumns sc1,        MASTER.dbo.SysColumns sc2

    --===== Add a Primary Key to maximize performance  

    ALTER TABLE dbo.Tally    

      ADD CONSTRAINT PK_Tally_N         PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    CREATE TABLE rawdata(rawinput VARCHAR(8000))

    INSERT INTO rawdata

    SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers.  Dams ....' UNION

    SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION

    SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION

    SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. ' UNION

    SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers.  Dams and Reservoirs List,  Washington, DC, 1981. '

    --assuming you have a tally table, this returns each element as a row, but I'm looking for a table with 5 columns

    SELECT   t.n,

    SUBSTRING(yt.rawinput+'|', t.n,CHARINDEX('|', yt.rawinput+'|', t.n) - t.n)  

    FROM rawdata AS yt,        dbo.Tally t  

    WHERE t.n <= LEN(yt.rawinput)    

    AND SUBSTRING('|' + yt.rawinput,t.n, 1) = '|'    

    AND

    DATALENGTH(rawinput)  - DATALENGTH(REPLACE(rawinput,'|','')) = 5[/font]

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Probably the simplest solution is to use Pivot after you break it up into rows. Otherwise, it's five sub-queries, one for each column.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • if you can be sure your data has no < symbols (or can live with a character swap), the following approach will work. you can use any function that splits a string into a table set.

    -- function to split list by delimiter

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    alter function [dbo].[fListToText]( @list varchar(max), @delim varchar(max) = ',')

    returns @returnTable table

    ( item varchar(255) not null )

    as begin

    declare @xml XML

    set @xml = '<item>' + REPLACE(@list, @delim, '</item><item>') + '</item>'

    insert into @returnTable

    SELECT data.item.value('.','varchar(255)')

    FROM @xml.nodes('//item') as data(item)

    return

    end

    go

    select rawinput,

    min(case when seq % 6 = 0 then item else null end) as field1,

    min(case when seq % 6 = 1 then item else null end) as field2,

    min(case when seq % 6 = 2 then item else null end) as field3,

    min(case when seq % 6 = 3 then item else null end) as field4,

    min(case when seq % 6 = 4 then item else null end) as field5,

    min(case when seq % 6 = 5 then item else null end) as field6

    from

    (

    select rawinput, item, row_number() over ( order by rawinput ) as seq

    from rawdata cross apply dbo.fListToText( R.rawinput, '|' ) ) as data

    group by rawinput

  • that's what i was looking for antonio; i remember using a different kind of split function for the same type of results. it's the cross apply and the row_number() over that i forgot about using. too much sql 2000, not enough 2005.

    Thank you!

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • CROSS APPLY = CORRELATED SUBQUERY

    CORRELATED SUBQUERY = RBAR

    RBAR + UDF + TABLE VARIABLE = S....L....O....W

    Heh... Lowell!!! :blink: You of all people should know that the Tally Table method will blow the doors off every other split there is! Just on 5 rows, the XML UDF method takes about 7 times longer than the Tally Table... try this out, buddy... it also happens to give you the exact result set you asked for. No extra columns... no data out of order...

    CREATE TABLE rawdata(rawinput VARCHAR(8000))

    INSERT INTO rawdata

    SELECT '1000000|Timmerman Pond Dam|Y|||U.S. Army Corps of Engineers. Dams ....' UNION

    SELECT '1000001|Thomas Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

    SELECT '1000002|Thomas Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

    SELECT '1000003|Walsh Fishing Lake|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. ' UNION

    SELECT '1000004|Walsh Fishing Lake Dam|Y|||U.S. Army Corps of Engineers. Dams and Reservoirs List, Washington, DC, 1981. '

    ;WITH cteMySplit AS

    (

    SELECT DENSE_RANK() OVER (ORDER BY RawInput) AS RowNum,

    ROW_NUMBER() OVER (PARTITION BY RawInput ORDER BY RawInput)-1 AS ColNum,

    NULLIF(SUBSTRING('|'+h.RawInput+'|', t.N+1, CHARINDEX('|', '|'+h.RawInput+'|', t.N+1)-t.N-1),'') AS String

    FROM dbo.Tally t,

    dbo.RawData h

    WHERE SUBSTRING('|'+h.RawInput+'|', t.N, 1) = '|'

    AND t.N < LEN('|'+h.RawInput)

    )

    SELECT MAX(CASE WHEN ColNum % 6 = 0 THEN String ELSE NULL END) AS Col1,

    MAX(CASE WHEN ColNum % 6 = 1 THEN String ELSE NULL END) AS Col2,

    MAX(CASE WHEN ColNum % 6 = 2 THEN String ELSE NULL END) AS Col3,

    MAX(CASE WHEN ColNum % 6 = 3 THEN String ELSE NULL END) AS Col4,

    MAX(CASE WHEN ColNum % 6 = 4 THEN String ELSE NULL END) AS Col5,

    MAX(CASE WHEN ColNum % 6 = 5 THEN String ELSE NULL END) AS Col6

    FROM cteMySplit

    GROUP BY RowNum

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm putting this thru it's paces now Jeff;

    The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • why not just create a 'real' table, bcp out the rawdata and then bcp that back in with a pipe (|) delimiter to the 'real' table?

  • Spot on Antonio... real key is that the data should never have come into the database in an unsplit fashion to begin with. They should use Bulk Insert or BCP to load the raw data.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lowell (4/1/2008)


    I'm putting this thru it's paces now Jeff;

    The data I'm trying to parse is some FIP55 data with every place name in the united states; 560 meg of data, so the sql is taking a while to run(+10 minutes on a dev server) ; I couldn't get my mental arms around this one, thanks.

    Then, I agree with Antonio... the data should not be in this condition in the database to begin with... the data should be loaded from a file using Bulk Insert or BCP using "|" as the delimiter. Bulk Insert will load 5.1 million 20 column rows in 60 seconds.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.

    The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.

    which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • then you should be able to import the same file into both the "rawdata" table and the 'real/parsed' table in the same load script. (you'll need a seperate fmt file to skip fields and handle different field terminators.)

    if the load scripts can't be modified, you can write an extended stored proc that determines where the temp path is, exports "rawdata" to it, then calls bcp to import that file, then deletes it.

    convoluted? yes. but that's the problem with stupid 'requirements'. don't get me started on that subject again.

  • Lowell (4/1/2008)


    boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.

    The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.

    which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

    Lowell... Can you attach the first, say, 100 rows or so as a file to your next post? Dunno about Antonnio, but I wouldn't mind taking a crack at it.

    Also, can you outline the (macro) steps you take to do the import... I think we might be able to make a couple of suggestions there, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s. The steps you take to get the raw data file from wherever should also be included. Of course, no username/passwords, please.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lowell (4/1/2008)


    boy I really agree; It's a stupid Project requirement that all "raw data" is imported from text into SQL Server database, and then the data massaged from there.

    The theory is If I got struck by lightning, anyone could re-run any step in the scripts again to duplicate the work i'm doing, and not need my machine and specific paths to the text files... in case something wasn't right or needed adaptation.

    which really makes no sense, since the script to bulk insert the raw data is path specific anyway; i had to bulk import it on a tilde ~ separator that I knew didn't exist in the data, instead of the pipe character.

    I guess I don't have the same definition of "massage" (I know - that's sounds a bit like Bill Clinton...:hehe: ) as your project folks. I wouldn't consider parsing things into separate fields "massaging". Now - data cleanup, reformatting so that you can cast it - THAT's "massaging" in my book.

    Where are you firing the script from? how are you firing it? there are rather straightforward ways to fire off such scripts with the UNC path as a variable. If you did it in DTS/SSIS, you could also be doing in under credentials that would have access to that path, which also would help with the external security.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • To add to what Matt just said... if the server logs in as a domain super-user that can see all the machines and at least some "shares", using a little Dynamic SQL in conjuction with BULK INSERT will blow the doors off of any other type of import. If you have the luxury of being able to use BCP, BCP will find bad rows and store them in a "bad file" so you can repair them later without having the main import fail. It's not quite as fast as BULK INSERT but is still way faster than DTS. Rumor has it that SSIS also has a "Bulk Insert" method but I don't know if it'll check for bad rows "auto-magically".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply