RE: Split input string into multicolumn - multirows

SSCrazy Eights

Points: 9893

December 22, 2012 at 10:55 am

Jeff Moden (12/21/2012)
Steven Willis (12/21/2012)
First test used the same data as generated in a post above using a Tally table CTE. I generated test data with 100 rows, 1,000 rows, and 10,000 rows.
There is a major fault with this testing. It's a well advertised fact that the DelimitedSplit8K method doesn't work well with blobs. In fact, any method that internally joins to a blob at the character level is guaranteed to run much slower than other methods such a XML. That fact is even stated in the "Tally OH!" ariticle.
Methods like XML, however, are (with the understanding that I've not had the time to do the testing the code on this thread deservers) typically much slower than the DelimitedSplit8K method on datatypes of VARCHAR(8000) or less because of the required concatenations.

Thanks for your input Jeff. I was aware of the VARCHAR(MAX) issue and neglected to note that I did use VARCHAR(8000) whenever possible. Only on the later test runs for the XML versions did I switch to VARCHAR(MAX) and that is the code that got posted. For the XML versions the 8000 limit was easily exceeded and I was getting XML errors because the strings were getting truncated and creating "bad" XML that the XML handler wouldn't accept.

I tested each method both ways with the given inputs and reported the best result for each test run. I also tried testing an older version of the tally table splitter using an actual tally table but the performance was so poor compared to ANY of the other methods that I didn't bother to report the results. All of this was done on my work laptop so performance would certainly improve for all the methods on a faster server.

For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format. Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.

However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives. I have used the XML splitter method on one-dimensional arrays before and based on the performance of "Mr Magoo's" method I decided to try and incorporate that into one function that could handle both one- and two-dimensional arrays. But I wanted to see how each performed and if my new function would perform any better than what I had been using.

So I did the testing and posted my results. I realize no testing situation is perfect because each method has its pros and cons depending on the form the input data takes and the scale of the split being performed. I think my test results--as imperfect as they may be--reinforced the idea that each split method will perform differently in different situations. I was neither trying to come up with the "perfect" universal splitter nor trying to prove any of these methods were bad. I think they are all good and each needs to be tested as possible alternatives when faced with any delimited split requirement.