Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««23456

Split input string into multicolumn - multirows Expand / Collapse
Author
Message
Posted Saturday, December 22, 2012 1:29 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 2:06 AM
Points: 1,785, Visits: 5,677
Jeff Moden (12/21/2012)
Steven Willis (12/21/2012)
First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.



There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.



Just to be clear, I am not suggesting that the XML method is in any way a replacement for the DelimitedSplit8K function, which is blinding fast.

I am suggesting that it may be useful in this particular exercise because the length of the input may exceed 8000.


MM


  • MMGrid Addin
  • MMNose Addin


  • Forum Etiquette: How to post Reporting Services problems
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • How to Post Performance Problems - by Gail Shaw

  • Post #1399664
    Posted Saturday, December 22, 2012 10:28 AM


    SSC-Dedicated

    SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

    Group: General Forum Members
    Last Login: Yesterday @ 7:56 PM
    Points: 36,775, Visits: 31,230
    mister.magoo (12/22/2012)
    Jeff Moden (12/21/2012)
    Steven Willis (12/21/2012)
    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.



    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.



    Just to be clear, I am not suggesting that the XML method is in any way a replacement for the DelimitedSplit8K function, which is blinding fast.

    I am suggesting that it may be useful in this particular exercise because the length of the input may exceed 8000.


    Not to worry, ol' friend. I absolutely understood that from the begining. That's why I said "N-i-i-i-i-c-c-c-c-e-e-e!" about your code post previously. It was "Thinking out of the box" at its best.

    I just want everyone to know that the DelimitedSplit8K fuction has the term "8K" in it for a bloody good reason. It's absoluetly not meant to split blobs and, if modified to do so, is going to lose just about any race in a terrible fashion as would any code that joins to a blob at the character level.


    --Jeff Moden
    "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

    (play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

    Helpful Links:
    How to post code problems
    How to post performance problems
    Post #1399691
    Posted Saturday, December 22, 2012 10:43 AM
    Grasshopper

    GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

    Group: General Forum Members
    Last Login: Friday, December 28, 2012 9:21 PM
    Points: 23, Visits: 53
    Jeff Moden (12/21/2012)
    murthyvs (12/11/2012)
    Hello all - I am having hard time to split an input string into multicolumn - multirows.

    Task - Create a stored procedure that reads an input string with pre-defined field and row terminators; splits the string into multicolumn - multirows; and inserts records into a table.


    So, a couple of questions remain to solve this for you properly...

    1. Can you use CLR on your servers or not?
    2. If not, what is the guranteed absolute maximum length of your delimited parameters?



    Wow .. lots of activity when I wasnt around :). Thanks a lot gurus! I certainly learned a whole lot in the last one hour of reading the entire thread.

    Here are my answers:
    1. Sorry. I have no clue on CLR - I am a newbie with serious appetite to learn new things.
    2. The max characters for field1 = 10; field2 = 20; possible field delimiters = ";", " tab "; row delimiters = "|", "new line".
    The total string lenght is at the max 50K although 100K is nice to have.

    Thanks to everyone.

    Post #1399693
    Posted Saturday, December 22, 2012 10:55 AM
    SSC-Addicted

    SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

    Group: General Forum Members
    Last Login: Sunday, September 29, 2013 1:24 AM
    Points: 429, Visits: 1,721
    Jeff Moden (12/21/2012)
    Steven Willis (12/21/2012)
    First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.



    There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.

    Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.



    Thanks for your input Jeff. I was aware of the VARCHAR(MAX) issue and neglected to note that I did use VARCHAR(8000) whenever possible. Only on the later test runs for the XML versions did I switch to VARCHAR(MAX) and that is the code that got posted. For the XML versions the 8000 limit was easily exceeded and I was getting XML errors because the strings were getting truncated and creating "bad" XML that the XML handler wouldn't accept.

    I tested each method both ways with the given inputs and reported the best result for each test run. I also tried testing an older version of the tally table splitter using an actual tally table but the performance was so poor compared to ANY of the other methods that I didn't bother to report the results. All of this was done on my work laptop so performance would certainly improve for all the methods on a faster server.

    For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format. Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.

    However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives. I have used the XML splitter method on one-dimensional arrays before and based on the performance of "Mr Magoo's" method I decided to try and incorporate that into one function that could handle both one- and two-dimensional arrays. But I wanted to see how each performed and if my new function would perform any better than what I had been using.

    So I did the testing and posted my results. I realize no testing situation is perfect because each method has its pros and cons depending on the form the input data takes and the scale of the split being performed. I think my test results--as imperfect as they may be--reinforced the idea that each split method will perform differently in different situations. I was neither trying to come up with the "perfect" universal splitter nor trying to prove any of these methods were bad. I think they are all good and each needs to be tested as possible alternatives when faced with any delimited split requirement.

     
    Post #1399694
    Posted Saturday, December 22, 2012 12:05 PM


    SSC-Dedicated

    SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

    Group: General Forum Members
    Last Login: Yesterday @ 7:56 PM
    Points: 36,775, Visits: 31,230
    Understood.

    I agree that the "old" Tally Table method was terrible on performance but that was almost entirely because of the concatenation of delimiters that most folks (including myself in the early days) used. That was the whole purpose of the "Taly OH!" article... to get rid of that nasty little dependency.

    Understood and agreed on the XML truncation errors on the 8K inputs. To wit, I'd never use anything but VARCHAR(MAX) on the input of such a function even if it was supposedly guaranteed to only ever receive a thousand bytes.

    For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format.


    Heh... I almost cry when I see something like that (storing delimited data) and that also includes when I see someone storing XML in a database for all the same reasons not to mention that the additioal complexity and resource usage that comes about because XML is hierarchical in nature even when used on supposedly "flat" data.

    Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.


    I'm honored and glad to be able to have helped. To be clear, though, that baby isn't mine. It was actually developed over time by may people with some great inputs. The latest version (which is posted in the article itself, now), doesn't use the precise method that the article originally used. The article showed how to get rid of the concatenation and a couple of good folks in the discussion took that huge improvement and made it even faster. Truly, DelimitedSplit8K is a community effort of which I'm honored to have been a part of.

    However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives.


    THAT's part of the reason why I've been watching this thread with great interest. A lot of people haven't been exposed to it but a lot of the world (apparently) revolves around CSVs that have been created by spreadsheets (DoubleClick.net, for example, provides their data in such a fashion). The data is, in fact, multi-dimensional (4 dimensions, in this case) and is frequently "unknown" as to the number of columns passed (although the columns do have a pattern epressed in the column headers). I've written several custom splitters/normalizers for such data but it's always interesting (to me, ayway) to see how other people approach the problem. I agree that, although using DelimitedSplit8K in a CROSS APPLY for each dimension is a good generic solution, it's not exactly the bees-knees when it comes to performance, so other solutions are welcome.

    To wit, my inputs on this thread shouldn't be considered to be defensive in nature. I just didn't want people to think that DelimitedSplit8K is super slow just because they don't understand that it wasn't meant to be used against MAX datatypes or if it's used incorrectly. Neither do I want them to think that DelimitedSplit8K is the only way because, patently, it is not.

    I'm also (perhaps, overly) concerned when it comes to testing solutions. For example, lots of folks were using the same row of data which causes "vertical grooving" in the logic which sometimes works in favor of one method or another and gives the false impression that it's faster. Again, my inputs are only meant as susggestions and I was very happy to see someone make that particular change.

    Anyway, thanks for the feedback on what you're doing. I'll eventually get to the point where I can setup my own tests for some of the solutions on this thread and apologize for not yet doing so. I'm so far behind with things that I need to be twins to catch up.


    --Jeff Moden
    "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

    (play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

    Helpful Links:
    How to post code problems
    How to post performance problems
    Post #1399698
    Posted Friday, December 28, 2012 6:07 PM
    SSC-Addicted

    SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

    Group: General Forum Members
    Last Login: Sunday, September 29, 2013 1:24 AM
    Points: 429, Visits: 1,721
    I've done some extensive testing using a modified version of the test harness from Jeff Moden's Tally Oh! article to test the various split methods in this thread. My test tables consist of randomly generated alphanumeric value pairs. Each iteration of the test creates an increasing number of value pairs per row and also increases the size of each value in each pair in an orderly progression. I did some testing with Jeff's original test data and also some testing with VARCHAR(MAX) data. But for the last run of the test (#17) it is all VARCHAR(8000).

    That last run is significant. Up to that point the CLR Split was ALWAYS fastest. Of the non-CLR methods, Mr. Magoo's XML Split was the clear winner with the advantage that it worked equally well with VARCHAR(MAX) as with VARCHAR(8000). But I decided to make some modifications to the original DelimitedSplit8K function to allow two delimiters for a 2-dimensional split. The new function DelimitedSplit8K_2DIM does the 2-way split within the function rather than trying to use the original function (or CLR) with a cross apply.

    Remarkably, this new function actually out performs the CLR with smaller rows and smaller value pairs. Even at the larger end of the scale it is still very close to the performance of the CLR Split which uses a cross apply. Like the original DelimitedSplit8K the function gets increasing less efficient with large value pairs and is, of course, limited to strings under the 8K limit. So for anything within these parameters I think this new variation seems to have the best performance. If the data must be VARCHAR(MAX) or for larger data pairs, Mr Magoo's XML Split method comes in a very close second.

    I've attached below all of my testing code (for most of which I must give credit to Jeff Moden) as well as a spreadsheet with the results of my own testing.

    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.


    CREATE FUNCTION [dbo].[DelimitedSplit8K_2DIM]
    (
    @pString VARCHAR(8000)
    ,@pDelimiter1 CHAR(1)
    ,@pDelimiter2 CHAR(1) = NULL
    )
    RETURNS TABLE WITH SCHEMABINDING AS
    RETURN
    WITH E1(N) AS (
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
    ), --10E+1 or 10 rows
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS (
    SELECT 0 UNION ALL
    SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    ),
    cteStart(N1) AS (
    SELECT t.N+1
    FROM cteTally t
    WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter1 OR t.N = 0)
    )
    SELECT
    ItemNumber
    ,Item1
    ,Item2 = REPLACE(Item2,Item1+@pDelimiter2,'')
    FROM
    (
    SELECT
    ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1)
    ,Item1 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter2,@pString,s.N1),0)-s.N1,8000))
    ,Item2 = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter1,@pString,s.N1),0)-s.N1,8000))
    FROM cteStart s
    ) i1


    /*
    Example usage:

    DECLARE @str VARCHAR(8000)
    SET @str = '123,456|789,000'

    SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,'|',',')

    DECLARE @str VARCHAR(8000)
    SET @str = '0WQDNw|aXxbzu,durPP|7yaFpK,UMERA5|FLN2G,HUdZv|QUQy5,3MbdqS|JWUgPp,F23jqp|kWbSBn,nSWunU|uh1zR,pqBJ4U|eNnZzE,jbu7R|cwd4E,1hNMC|Ru7ar'

    SELECT * FROM [dbo].[DelimitedSplit8K_2DIM](@str,',','|')

    */




      Post Attachments 
    RunSplitTestWith2DimArrays.zip (8 views, 39.08 KB)
    Post #1401102
    Posted Friday, December 28, 2012 8:19 PM


    SSC-Dedicated

    SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

    Group: General Forum Members
    Last Login: Yesterday @ 7:56 PM
    Points: 36,775, Visits: 31,230
    Steven Willis (12/28/2012)
    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.

    Very cool. Thanks for posting that code. Just remember that the second dimension can still only have 2 elements in this case.


    --Jeff Moden
    "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

    (play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

    Helpful Links:
    How to post code problems
    How to post performance problems
    Post #1401110
    Posted Saturday, December 29, 2012 12:11 PM
    SSC-Addicted

    SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

    Group: General Forum Members
    Last Login: Sunday, September 29, 2013 1:24 AM
    Points: 429, Visits: 1,721
    Jeff Moden (12/28/2012)
    Steven Willis (12/28/2012)
    Here's the new splitter function as I tested it...I'm sure someone may be able to improve it even more.

    Very cool. Thanks for posting that code. Just remember that the second dimension can still only have 2 elements in this case.

    Yeah, while I was working on testing the new version I was thinking to myself how cool it would be if the function could allow any number of delimited values within "groups."

    For example:

    '12,34,56,78|12,78|34,56,78'
    or
    '12,34,56,78|12,78|34,56,78~56,67|34,23|67'

    I've had the misfortune to face such strings and splitting them can be a nightmare.

    I had one string to split lately that I had no control over that looked like this (and this is a simplified version:

    '{TeamIDs:[1,2|3,4|5,6|7,8]}
    ,{Scores:[88,70|90,56|67,70|88,87|45,52|78,77|81,70]}
    ,{Dates:[12/22/2012|12/22/2012|12/22/2012|12/22/2012|12/29/2012|12/29/2012|12/31/2012]}'

    The scores correlated to the results of each team pair AND each subsequent round of the bracket. No keys except the order of the data...which meant I had to figure out the winner of each round based on the score pairs to get the value pairs of each round so I could create an scheduled event for each game.

    Then to make it worse, once I had the scheduled events I had to write a query to create a similar string to pass back to the application so it could display the bracket and game results. I used a coalesce function to "reverse" split the results. Perhaps some testing could be done on some "reverse" split functions?

     
    Post #1401201
    « Prev Topic | Next Topic »

    Add to briefcase «««23456

    Permissions Expand / Collapse