Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • thanks for the Response Sean.  attaching sample data

    Attachments:
    You must be logged in to view attached files.
  • tellery_frmt wrote:

    thanks for the Response Sean.  attaching sample data

    No offense mate but not going to pull down an excel document to load up some sample data. Post it as text and I will help.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • here is the text versions:  fulltxtstring is what i start with;  parsedtxtstrings is the required split results

     

    thanks much for your help.

     

     

    Attachments:
    You must be logged in to view attached files.
  • Sean Lange wrote:

    tellery_frmt wrote:

    thanks for the Response Sean.  attaching sample data

    No offense mate but not going to pull down an excel document to load up some sample data. Post it as text and I will help.

    I would recommend opening a new question for this instead of tacking onto this thread.  Other individuals may not see this or might ignore it

     

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • The issue with parsing is there are 6 different delimiters (including 'Supervisor') when you really only need 1.  With 1 delimiter an ordinal splitter (which is based on a tally table) could do the job.  I tested to make sure the string didn't contain a tilde '~' then used that as the delimiter.

    declare @string         nvarchar(max)=N'|System| |Sean Colley|06/02/2015 05:28:55 PM|New entry created.|User|NTCLSC04|Sean C|06/02/2015 05:41:25 PM|recd atty rep letter from Char Rosaa Jr with Rosaxx, Xmnnnd & XXXmilia 999-999-5572 999-998-4870 fax  misc details unknown.|System| |Tiffany H|06/03/2015 07:21:13 AM| status change.|User|COCLTH03|Tiffany H|06/03/2015 07:34:48 AM|attempted contact with any office. 999-998-5572 place was closed.|User|COCLTH03|Tiffany H|07/01/2015 07:41:22 AM|attempted contact with any office. |System| |Chrystal R|10/21/2015 03:52:40 PM|Reassigned by  Chrystal  from Tiffany to John .|User|COCLJD02|John D|12/05/2015 04:47:40 PM|Police were called on 8/14/13  -sent papers to Supe for approval|User|COCLJD02|John D|01/03/2016 10:22:58 PM|EOR returned to from post officer |User|COCLJD02|John D|01/03/2016 10:40:04 PM| - including recorded statement: attempts made  Confirmed date Recommendation:   |Supervisor| |Gale D|01/04/2016 09:34:55 AM|Agree to proceed with the process.|User|COCLJD02|John D|01/05/2016 11:37:29 PM|EOR has been reissued to returned to sender|User|COCLJD02|John D|01/05/2016 11:40:57 PM|Sent notice of explanation |System| |John Difranco|01/05/2016 11:41:57 PM|Closed.';

    select dl.*
    from (values (replace(replace(replace(replace(replace(replace(@string,
    N'|Supervisor|', N'~|Supervisor|'),
    N'|Contact|', N'~|Contact|'),
    N'|Manager|', N'~|Manager|'),
    N'|Note|', N'~|Note|'),
    N'|User|', N'~|User|'),
    N'|System|', N'~|System|')
    )) v(string)
    cross apply dbo.DelimitedSplit8K_LEAD(v.string, '~') dl
    where len(trim(dl.Item))>0
    order by dl.ItemNumber;
    --where v.string not like '%~%'
    ItemNumberItem
    2|System| |Sean Colley|06/02/2015 05:28:55 PM|New entry created.
    3|User|NTCLSC04|Sean C|06/02/2015 05:41:25 PM|recd atty rep letter from Char Rosaa Jr with Rosaxx, Xmnnnd & XXXmilia 999-999-5572 999-998-4870 fax misc details unknown.
    4|System| |Tiffany H|06/03/2015 07:21:13 AM| status change.
    5|User|COCLTH03|Tiffany H|06/03/2015 07:34:48 AM|attempted contact with any office. 999-998-5572 place was closed.
    6|User|COCLTH03|Tiffany H|07/01/2015 07:41:22 AM|attempted contact with any office.
    7|System| |Chrystal R|10/21/2015 03:52:40 PM|Reassigned by Chrystal from Tiffany to John .
    8|User|COCLJD02|John D|12/05/2015 04:47:40 PM|Police were called on 8/14/13 -sent papers to Supe for approval
    9|User|COCLJD02|John D|01/03/2016 10:22:58 PM|EOR returned to from post officer
    10|User|COCLJD02|John D|01/03/2016 10:40:04 PM| - including recorded statement: attempts made Confirmed date Recommendation:
    11|Supervisor| |Gale D|01/04/2016 09:34:55 AM|Agree to proceed with the process.
    12|User|COCLJD02|John D|01/05/2016 11:37:29 PM|EOR has been reissued to returned to sender
    13|User|COCLJD02|John D|01/05/2016 11:40:57 PM|Sent notice of explanation
    14|System| |John Difranco|01/05/2016 11:41:57 PM|Closed.

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Thank you, Steve.   greatly appreciated!

  • There are some caveats which should've been mentioned.  The string is declared as NVARCHAR(MAX) because I don't know that that's not appropriate.  The ordinal splitter is name '%8K_LEAD' because it's targeted at up to VARCHAR(8000).  The query should use a Unicode version of the splitter.  Is there one of %_LEAD idk?

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Sean Lange wrote:

    tellery_frmt wrote:

    Hi Jeff.

    your tally table approach to parsing is the best i've seen.   i have a real world problem;  wondering if you have any thoughts.

    i need to parse a large clob/text field using VARIABLE delimiters!   each segment of the clob is delimited by one of the following 'tags':  |system| ,|user|, |note|,|manager|,|contact|  .   the end of the segment is then delimited by the NEXT occurrence of one of the delimiters .

    for example: |system| bunch of system data...|user| some data about users xyz.....|contact| more data for the contact...

    Jeff's delimiter is one of the best tools for doing what it designed for. However, Jeff's delimiter is 1000000% the wrong tool for this job. It sounds like you would have far better luck using either XML or JSON, likely utilizing CLR. Posting some consumable sample data would make this a lot easier.

    Perhaps not.  If someone split on the "|" symbol, that would create a return where odd numbered rows would contain the "tag" and the even numbered rows contain the "message" for the tag.  If we used Eirikur Eiriksson optimation for 2012 and above ( https://www.sqlservercentral.com/articles/reaping-the-benefits-of-the-window-functions-in-t-sql-2 ), it will like still beat a conversion to XML.  As someone once told me, "While it might be 3 times slower when working with LOBs, what's 3 times slower than greased lightning"?

    @tellery_frmt - My recommendation would be to start a new post an attach a file with some of the sample data you speak of so that people can give it a shot for you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • tellery_frmt wrote:

    Thank you, Steve.   greatly appreciated!

    Ah, be careful.  He may not be correct here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Steve Collins wrote:

    There are some caveats to that should've been mentioned.  The string is declared as NVARCHAR(MAX) because I don't know that that's not appropriate.  The ordinal splitter is name '%8K_LEAD' because it's targeted at up to VARCHAR(8000).  The query should use a Unicode version of the splitter.  Is there one of %_LEAD idk?

    No but it wouldn't be difficult to make one.

    But before we even think of getting started on this, let's NOT do it here.  Let's open a separate thread and have the OP provide some reasonable sample data that mimics his actual data.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden wrote:

    But before we even think of getting started on this, let's NOT do it here.  Let's open a separate thread and have the OP provide some reasonable sample data that mimics his actual data.

    Completely agree.  I wouldn't mind seeing my prior mess (big mistake on Xmas *cough*) in this thread removed.  That would be editing of content which was actually appreciated

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Jeff Moden wrote:

    Steve Collins wrote:

    There are some caveats to that should've been mentioned.  The string is declared as NVARCHAR(MAX) because I don't know that that's not appropriate.  The ordinal splitter is name '%8K_LEAD' because it's targeted at up to VARCHAR(8000).  The query should use a Unicode version of the splitter.  Is there one of %_LEAD idk?

    No but it wouldn't be difficult to make one.

    But before we even think of getting started on this, let's NOT do it here.  Let's open a separate thread and have the OP provide some reasonable sample data that mimics his actual data.

    Steve Collins wrote:

    Jeff Moden wrote:

    But before we even think of getting started on this, let's NOT do it here.  Let's open a separate thread and have the OP provide some reasonable sample data that mimics his actual data.

    Completely agree.  I wouldn't mind seeing my prior mess (big mistake on Xmas *cough*) in this thread removed.  That would be editing of content which was actually appreciated

    I've already written an nvarchar(MAX) splitter function that can be used on SQL Server 2012 and above.

    IF OBJECT_ID('[dbo].[STRING_SPLIT]','IF') IS NULL BEGIN
    EXEC ('CREATE FUNCTION [dbo].[STRING_SPLIT] () RETURNS TABLE AS RETURN SELECT 1 X')
    END
    GO
    ALTER FUNCTION [dbo].[STRING_SPLIT]
    (
    @string nvarchar(MAX),
    @separator nvarchar(MAX)
    )
    RETURNS TABLE WITH SCHEMABINDING
    AS RETURN
    WITH X(N) AS (SELECT 'Table1' FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) T(C)),
    Y(N) AS (SELECT 'Table2' FROM X A1, X A2, X A3, X A4, X A5, X A6, X A7, X A8) , -- Up to 16^8 = 4 billion
    T(N) AS (SELECT TOP(ISNULL(LEN(@string),0)+1) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1 N FROM Y),
    Delim(Pos) AS (SELECT t.N FROM T WHERE SUBSTRING(@string, t.N, LEN(@separator+'x')-1) LIKE @separator OR t.N = 0),
    Separated(value,position) AS (SELECT SUBSTRING(@string, d.Pos + CASE WHEN Pos=0 THEN 1 ELSE LEN(@separator+'x')-1 END , LEAD(d.Pos,1,2147483647) OVER (ORDER BY (SELECT NULL)) - CASE WHEN d.Pos=0 THEN 2-LEN(@separator+'x') ELSE d.Pos END - (LEN(@separator+'x')-1)),
    d.pos
    FROM Delim d
    WHERE @string IS NOT NULL)
    SELECT LTRIM(s.value) value,
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) position
    FROM Separated s
    WHERE s.value <> @separator;
    GO

    The code to do the same as Steve Collins code but using the function above is:

    declare @string         nvarchar(max)=N'|System| |Sean Colley|06/02/2015 05:28:55 PM|New entry created.|User|NTCLSC04|Sean C|06/02/2015 05:41:25 PM|recd atty rep letter from Char Rosaa Jr with Rosaxx, Xmnnnd & XXXmilia 999-999-5572 999-998-4870 fax  misc details unknown.|System| |Tiffany H|06/03/2015 07:21:13 AM| status change.|User|COCLTH03|Tiffany H|06/03/2015 07:34:48 AM|attempted contact with any office. 999-998-5572 place was closed.|User|COCLTH03|Tiffany H|07/01/2015 07:41:22 AM|attempted contact with any office. |System| |Chrystal R|10/21/2015 03:52:40 PM|Reassigned by  Chrystal  from Tiffany to John .|User|COCLJD02|John D|12/05/2015 04:47:40 PM|Police were called on 8/14/13  -sent papers to Supe for approval|User|COCLJD02|John D|01/03/2016 10:22:58 PM|EOR returned to from post officer |User|COCLJD02|John D|01/03/2016 10:40:04 PM| - including recorded statement: attempts made  Confirmed date Recommendation:   |Supervisor| |Gale D|01/04/2016 09:34:55 AM|Agree to proceed with the process.|User|COCLJD02|John D|01/05/2016 11:37:29 PM|EOR has been reissued to returned to sender|User|COCLJD02|John D|01/05/2016 11:40:57 PM|Sent notice of explanation |System| |John Difranco|01/05/2016 11:41:57 PM|Closed.';

    select dl.position ItemNumber,dl.value Item
    from (values (replace(replace(replace(replace(replace(replace(@string,
    N'|Supervisor|', N'~|Supervisor|'),
    N'|Contact|', N'~|Contact|'),
    N'|Manager|', N'~|Manager|'),
    N'|Note|', N'~|Note|'),
    N'|User|', N'~|User|'),
    N'|System|', N'~|System|')
    )) v(string)
    cross apply [dbo].[STRING_SPLIT](v.string, '~') dl
    where len(trim(dl.value))>0
    order by dl.position;
    --where v.string not like '%~%'

    Link to the script: https://www.sqlservercentral.com/scripts/a-varcharmax-string_split-function-for-sql-2012-and-above

     

  • tellery_frmt wrote:

    Hi Jeff.

    your tally table approach to parsing is the best i've seen.   i have a real world problem;  wondering if you have any thoughts.

    i need to parse a large clob/text field using VARIABLE delimiters!   each segment of the clob is delimited by one of the following 'tags':  |system| ,|user|, |note|,|manager|,|contact|  .   the end of the segment is then delimited by the NEXT occurrence of one of the delimiters .

    for example: |system| bunch of system data...|user| some data about users xyz.....|contact| more data for the contact...

    I've looked at your sample data and this is actually a pretty easy thing to do.  Where is the CLOB data coming from?  Is it (hopefully) coming from a file somewhere?  But, again... can you please start a new thread on this and post the link on this thread?  Thanks.

     

    p.s. No "splitter" is actually required for this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • tellery_frmt wrote:

    Hi Jeff.

    your tally table approach to parsing is the best i've seen.   i have a real world problem;  wondering if you have any thoughts.

    i need to parse a large clob/text field using VARIABLE delimiters!   each segment of the clob is delimited by one of the following 'tags':  |system| ,|user|, |note|,|manager|,|contact|  .   the end of the segment is then delimited by the NEXT occurrence of one of the delimiters .

    for example: |system| bunch of system data...|user| some data about users xyz.....|contact| more data for the contact...

     

    Ok... You seem reluctant to open a new thread so let me ask just two more questions...

    1. Is the data in a file or in a VARCHAR(MAX) column in a table?
    2. What is the maximum length of the data?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I'm thinking the OP may have gotten an answer from somewhere else and so may no longer be interest in an answer from this thread and certainly not one where he has to open a new thread.

    With that idea in mind, let's pretend the max size of the string to split will be <= 8,000 bytes.  Using the given string, here's a solution that uses DelimitedSplit8K and the ancient "Black Art" known as a CROSSTAB.  It splits every implied Row out to an actual separate row and each row has all 5 elements in separate columns that have been assigned what seem like the appropriate datatype for this set of data.

    The code below can be executed as is with the only "setup" being that you actually do have to have DelimitedSplit8K (and I do use Eirikur Eiriksson's very high performance LEAD modification made possible after 2012 came out).

    DECLARE @FullTxtString VARCHAR(8000) = '|System| |Sean Colley|06/02/2015 05:28:55 PM|New entry created.|User|NTCLSC04|Sean C|06/02/2015 05:41:25 PM|recd atty rep letter from Char Rosaa Jr with Rosaxx, Xmnnnd & XXXmilia 999-999-5572 999-998-4870 fax  misc details unknown.|System| |Tiffany H|06/03/2015 07:21:13 AM| status change.|User|COCLTH03|Tiffany H|06/03/2015 07:34:48 AM|attempted contact with any office. 999-998-5572 place was closed.|User|COCLTH03|Tiffany H|07/01/2015 07:41:22 AM|attempted contact with any office. |System| |Chrystal R|10/21/2015 03:52:40 PM|Reassigned by  Chrystal  from Tiffany to John .|User|COCLJD02|John D|12/05/2015 04:47:40 PM|Police were called on 8/14/13  -sent papers to Supe for approval|User|COCLJD02|John D|01/03/2016 10:22:58 PM|EOR returned to from post officer |User|COCLJD02|John D|01/03/2016 10:40:04 PM| - including recorded statement: attempts made  Confirmed date Recommendation:   |Supervisor| |Gale D|01/04/2016 09:34:55 AM|Agree to proceed with the process.|User|COCLJD02|John D|01/05/2016 11:37:29 PM|EOR has been reissued to returned to sender|User|COCLJD02|John D|01/05/2016 11:40:57 PM|Sent notice of explanation |System| |John Difranco|01/05/2016 11:41:57 PM|Closed. '
    ;
    DROP TABLE IF EXISTS dbo.FullString;
    WITH cteGrp AS
    (
    SELECT Grp = SUM(IIF (Item IN ('System','Supervisor')
    AND (ItemNumber-1)%5 = 0 ,1,0)
    ) OVER (ORDER BY ItemNumber)
    ,Row = (ItemNumber-1)/5
    ,Col = (ItemNumber-1)%5
    ,Item
    FROM dbo.DelimitedSplit8K(SUBSTRING(@FullTxtString,2,8000),'|') --Skips Leading Delimiter
    )
    SELECT RowNum = ROW_NUMBER() OVER (ORDER BY Grp,Row)
    ,Grp = CONVERT(INT,Grp)
    ,GrpRow = ROW_NUMBER() OVER (PARTITION BY Grp ORDER BY Row)
    ,RowType = CONVERT(CHAR(10) ,MAX(IIF(Col=0,Item,'')))
    ,UserID = CONVERT(CHAR( 8) ,MAX(IIF(Col=1,Item,'')))
    ,Name = CONVERT(CHAR(50) ,MAX(IIF(Col=2,Item,'')))
    ,EntryDT = CONVERT(DATETIME2(0),MAX(IIF(Col=3,Item,'')))
    ,Comment = CONVERT(VARCHAR(500),MAX(IIF(Col=4,Item,'')))
    FROM cteGrp
    GROUP BY Grp,Row
    ORDER BY Grp,Row
    ;

    I added 3 extra columns at the beginning to maintain and exploit the same order of the data as it is in then original string.  It's a real shame that the SQL Server STRING_SPLIT() function was born crippled with the ordinal order of the individual elements returned.  You just have to wonder why Microsoft was so short sighted there.

    If the data is actually in a file, no biggee there either but I'll save that for another time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 961 through 975 (of 982 total)

You must be logged in to reply to this topic. Login to reply