Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 61 through 75 (of 991 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 1

UMG Developer (5/2/2011)
Thanks for the excellent article that was both entertaining and instructional. I will be making use of your new splitter soon!

Thanks, UMG. Like I said to someone else, usage is the highest form of flattery. 🙂 Thanks for the kind words.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 2

gary.rumble (5/2/2011)
I'd be interested in seeing the data you used for benchmarking.

It's in the third paragraph of the Prologue, Gary.

Shifting gears, there are 3 "obviously named" attachments to this article. [font="Arial Black"]One is a fully documented stand-alone test script[/font], one is a 5 sheet Excel spreadsheet with a performance chart on each sheet, and one is a zipped file which contains copies of the new VARCHAR(8000) and NVARCHAR(4000) splitters that I think you'll like.

I tried a method that used a cte to generate a table of offsets for the delimiters. Then I joined that on itself to find the starts and ends and used that to get the substrings from the original string.
I was curious how that would perform compared to the other methods.

Now that you know where the test code is, give it a try a let us know. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mark hutchinson SSC Eights! Points: 878 More actions · Answer 3

Jeff,

Glad I brought a wide smile or laugh to your face/heart. But I was serious. I'm used to minimizing operations in the normal course of performance-tuning. If this minimized-operation approach performs worse than the tally table approach, I'd really like to know so I might adjust my thinking in the T-SQL environment.

Mark

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 4

tfifield (5/2/2011)
Jeff,
Brilliant piece of work! I did something a bit similar but cheated by always appending the last delimiter to the string in the function call. I always felt that it was a hack but I was under some pressure to get the project done. Yours is a much cleaner and way more versatile approach.
The article is so well written that I was coming up with the ISNULL/NULLIF right along with you - a testament to the logical progression of the thing.
The switch to always getting N+1 and starting with zero was also brilliant.
By the way, I always referred to the 'Nibbler' approach as a 'Shuffler' approach. I think I like 'Nibbler' a bit better.
Todd Fifield

Awesome feedback, Todd. :blush: Glad you liked it. BTW... the dust bunnies are tickled that you like the "Nibbler" name... it was their idea. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 5

Nadrek (5/2/2011)
Jeff Moden (5/2/2011)
...
I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore).
...
Thank you for the excellent reminder; I looked at my ASCII chart for a refresher:
CHAR(28) --File Separator
CHAR(29) --Group Separator
CHAR(30) --Record Separator
CHAR(31) --Unit Separator
I move we start trying to use these separator characters again! 30 and 31 are particularly appropriate for the usual two-dimensional split into table techniques.
On another note, which tricks did I miss for a permanent, 0-based tally table adaptation?
CREATE FUNCTION [dbo].[YourFunction](@pString [varchar](8000), @pDelimiter [char](1))
RETURNS TABLE
RETURN
WITH cteStart(N1) AS (--==== This limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP ((DATALENGTH(ISNULL(@pString,1)))+1) N+1
FROM YourDatabase.dbo.YourZeroBasedTallyTable t
WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
ORDER BY N
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT Item = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))
FROM cteStart s
;
And Record Separator test:
select * from YourFunction('first'+CHAR(30)+'second'+CHAR(30)+REPLICATE('last',60),CHAR(30));
And thank you for giving me a solid reason to finally have a good old-fashioned 0-based set as well.

Well, that makes 3 of us. We have a quorum! 😛 Phil Factor wrote an article fairly recently on all the XML, JSON, and other gobilty-gook and suggested that someone should come up with a better way. I was considering writing an article about the all-too-"secret" characters 28-31 and a couple of other goodies in "control character land". Whatcha tink? Worthwhile or not?

I haven't run it but your real Tally Table splitter looks right. You picked up on everything including the use of the "TOP".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

gary.rumble SSCommitted Points: 1947 More actions · Answer 6

Well, so far I got:

(2 row(s) affected)

Msg 8623, Level 16, State 1, Line 3

The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query. If you believe you have received this message in error, contact Customer Support Services for more information.

Msg 3701, Level 11, State 5, Line 4

Cannot drop the table 'dbo.Csv8K', because it does not exist or you do not have permission.

I think your code is too much for my server. 😉

14090 SW TENNESSEE LN

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 7

UMG Developer (5/2/2011)
mtassin (5/2/2011)
On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.
I too noticed that, and would be interested in seeing the graph compared a few little further out to see if the CLR solution continued it surge to the top.

I think the darned thing is running so fast that just about any blip on the CPU will cause a delay. BUT, you can bet I'll test it because that CLR will also be used in the split "MAX" article when I can finally get to it. There was also a downward trend that's real hard to see early in the game on that chart. It might be why the final point looks like an upward trend with line smoothing turned on.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 8

gary.rumble (5/2/2011)
Well, so far I got:
...
I think your code is too much for my server. 😉

Yowch. The code is even split into batches. What you may have to do is run the code a section at a time up to where the test loop begins and the let the test loop rip.

Thanks for trying, Gary. If for some reason, your server just won't take it, let me know and we'll test your code for you.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 9

mark hutchinson (5/2/2011)
Jeff,
Glad I brought a wide smile or laugh to your face/heart. But I was serious. I'm used to minimizing operations in the normal course of performance-tuning. If this minimized-operation approach performs worse than the tally table approach, I'd really like to know so I might adjust my thinking in the T-SQL environment.
Mark

I really did think you were simply being ironic but wasn't sure.

I can't imagine that a cursor stepping through the individual characters to find a delimiter would be faster but that might be a box that a lot of us are stuck in that we should try to get out of. In fact, that's the way the CLR was written. Do you have some T-SQL code for what you propose?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mark hutchinson SSC Eights! Points: 878 More actions · Answer 10

Jeff

I think you already identified the basic code in your articles. Since you have a test bed for performance comparison, I thought you might have tested a loop to populate a temp table with CharIndex() values of the delimiter positions as opposed to the Substring() operation for every character in the string.

Once you have the delimiter positions in your temp table, use a cursor to iterate through the delimiter positions, keeping track of the prior position in a local variable (initialized to zero, perhaps).

In answer to your question...No. I don't have any code to offer. I just thought that you were comparing your Tally Table method against a less-than-optimized looping method.

I'm even interested in a performance comparison between a CharIndex() loop to find delimiter positions and your Tally Table method. The Tally Table method invokes a SubString() function for every character to get the same result. The optimized OH method invokes a CharIndex(), to return the next delimiter position, in a similar manner that I suggest for the initial delimiter position calculation.

I've been around relational databases for a long time (mid 80s), but my DB duties are much less interesting than this topic. I hope to learn something from you in this post-article discussion.

Anipaul SSC-Insane Points: 24681 More actions · Answer 11

As usual, you have written another excellent article Jeff. It was great one.:-)

Johan Bijnens SSC Guru Points: 135604 More actions · Answer 12

It's a pity some obscure character caught the attention of CNN, Bloomberg, Al Jazera, ... the day your great article got published.

For sure it would have been a headliner !

Great code, great benchmark info,

Great job Jeff

Thanks again for sharing the knowledge.

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data and code to get the best help

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Who am I ? Sometimes this is me but most of the time this is me

Adnan SSC-Addicted Points: 451 More actions · Answer 13

Great job! Thank you all those who have written, contributed or inspired Jeff.

I have done some tests on real data before using this new one, executing both under same parameters for 10.000 times and logging the times.

The previous splitter function we used took 1084, 1986 and 633 milliseconds (average, max and min) on a given parameter. This new one takes 286, 416 and 176 milliseconds (again avg, max, min).

It's around 4 times faster in my scenario.

So, just congrats and thanks!

srikant maurya Hall of Fame Points: 3293 More actions · Answer 14

srikant maurya

Hall of Fame

Points: 3293

May 3, 2011 at 6:08 am

#1318993

Great Job,

I will use these code for future use.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 15

mark hutchinson (5/2/2011)
Jeff
I think you already identified the basic code in your articles. Since you have a test bed for performance comparison, I thought you might have tested a loop to populate a temp table with CharIndex() values of the delimiter positions as opposed to the Substring() operation for every character in the string.
Once you have the delimiter positions in your temp table, use a cursor to iterate through the delimiter positions, keeping track of the prior position in a local variable (initialized to zero, perhaps).
In answer to your question...No. I don't have any code to offer. I just thought that you were comparing your Tally Table method against a less-than-optimized looping method.
I'm even interested in a performance comparison between a CharIndex() loop to find delimiter positions and your Tally Table method. The Tally Table method invokes a SubString() function for every character to get the same result. The optimized OH method invokes a CharIndex(), to return the next delimiter position, in a similar manner that I suggest for the initial delimiter position calculation.
I've been around relational databases for a long time (mid 80s), but my DB duties are much less interesting than this topic. I hope to learn something from you in this post-article discussion.

The "Inchworm" While Loop method I tested uses the optimization you speak of.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)