RE: Split input string into multicolumn - multirows

SSC Guru

Points: 1003865

December 22, 2012 at 12:05 pm

Understood.

I agree that the "old" Tally Table method was terrible on performance but that was almost entirely because of the concatenation of delimiters that most folks (including myself in the early days) used. That was the whole purpose of the "Taly OH!" article... to get rid of that nasty little dependency.

Understood and agreed on the XML truncation errors on the 8K inputs. To wit, I'd never use anything but VARCHAR(MAX) on the input of such a function even if it was supposedly guaranteed to only ever receive a thousand bytes.

For various reasons, I need to do array splits often on data I do not directly own. (I work mainly on other people's databases for which I usually had no design input.) It is particularly annoying to see a delimited list stored in a varchar column that I have to split, process, then put back together again using a coalesce function because of someone's poor design. I also need to process data that is posted from websites that have multiselect dropdowns and/or checkbox lists--both of which produce delimited arrays as output that my procedures must process and convert to proper row/column format.

Heh... I almost cry when I see something like that (storing delimited data) and that also includes when I see someone storing XML in a database for all the same reasons not to mention that the additioal complexity and resource usage that comes about because XML is hierarchical in nature even when used on supposedly "flat" data.

Most of my procedures use your DelimitedSplit8K function and that has always seemed to be the fasted solution. Whenever I have to revisit an old procedure and see an older split function I replace it with a DelimitedSplit8K version if time and budget allow.

I'm honored :blush: and glad to be able to have helped. To be clear, though, that baby isn't mine. It was actually developed over time by may people with some great inputs. The latest version (which is posted in the article itself, now), doesn't use the precise method that the article originally used. The article showed how to get rid of the concatenation and a couple of good folks in the discussion took that huge improvement and made it even faster. Truly, DelimitedSplit8K is a community effort of which I'm honored to have been a part of.

However, in this case we were dealing with a two-dimensional array situation. I posted a solution somewhere around page 2 or 3 above that I have used frequently. That method basically used DelimitedSplit8K to do two CROSS APPLYs to split the data. When speed tests compared that method to some of the other methods it did not perform as well. So I wanted to explore alternatives.

THAT's part of the reason why I've been watching this thread with great interest. A lot of people haven't been exposed to it but a lot of the world (apparently) revolves around CSVs that have been created by spreadsheets (DoubleClick.net, for example, provides their data in such a fashion). The data is, in fact, multi-dimensional (4 dimensions, in this case) and is frequently "unknown" as to the number of columns passed (although the columns do have a pattern epressed in the column headers). I've written several custom splitters/normalizers for such data but it's always interesting (to me, ayway) to see how other people approach the problem. I agree that, although using DelimitedSplit8K in a CROSS APPLY for each dimension is a good generic solution, it's not exactly the bees-knees when it comes to performance, so other solutions are welcome.

To wit, my inputs on this thread shouldn't be considered to be defensive in nature. I just didn't want people to think that DelimitedSplit8K is super slow just because they don't understand that it wasn't meant to be used against MAX datatypes or if it's used incorrectly. Neither do I want them to think that DelimitedSplit8K is the only way because, patently, it is not.

I'm also (perhaps, overly) concerned when it comes to testing solutions. For example, lots of folks were using the same row of data which causes "vertical grooving" in the logic which sometimes works in favor of one method or another and gives the false impression that it's faster. Again, my inputs are only meant as susggestions and I was very happy to see someone make that particular change.

Anyway, thanks for the feedback on what you're doing. I'll eventually get to the point where I can setup my own tests for some of the solutions on this thread and apologize for not yet doing so. I'm so far behind with things that I need to be twins to catch up. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)