Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 466 through 480 (of 991 total)

You must be logged in to reply to this topic. Login to reply

John Hardin Grasshopper Points: 20 More actions · Answer 1

Lynn Pettis (12/30/2012)
John Hardin (12/30/2012)
L' Eomot Inversé (12/30/2012)
Jeff Moden (12/28/2012)
If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.
Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.
Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.
On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁
Page 27 passed on how many posts per page? I ask as I display 50 posts per page. It would help if you posted the url of your post so we could go directly to that post.

D'oh! I didn't consider non-default posts-per-page settings. Sorry. Here's the direct link: http://www.sqlservercentral.com/Forums/FindPost1129283.aspx

The second code block is what I am referring to.

Paul White SSC Guru Points: 150468 More actions · Answer 2

L' Eomot Inversé (12/30/2012)
Why can't the optimiser look and see if spooling would enhance performance (only in the cases where it make no semantic difference, of course) and use it when it does? Or if that's too difficult a task for the optimiser (it shouldn't be, but they haven't done it so I can imagine them claiming it is) why can't we have a query hint that tells them to do it? I hope Paul or someone will jump in and explain all this - maybe tell me I've got it all wrong, but that's OK too, I like learning.

You're right that the optimizer doesn't consider CTE materialization today (they are strictly in-line view definitions, expanded once per reference) and there is no hint for materialization either.

There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 3

Tony.l (12/29/2012)
Hi
unbelievable - so good. I have been doing these puter things for 30 years now and i am still amazed how little i know or probably at just how clever other people are. Really well explained and totally useful.
keep up the good work.
APL. My initials not the language.

Thanks for the great feedback, Tony. I feel the same way. I've been working with computers in one form or another since 1968 and I'm still learning something new about them everyday.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 4

John Hardin (12/30/2012)
On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

You'll have to forgive me a bit on that. If you look at even the most recent pages of this thread, there are still people saying "Try this" or "How about this"? Even you said...

How about this to get rid of the CHARINDEX() string operation:
{snip}
...how does that affect performance?

Two days later (lots of water had flowed under the bridge by then and we're not notified by edits), you added...

edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

...and followed that up with...

(Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

I just didn't (and still don't) have the time to test everyone's suggestion. That's why I spent so much time on building a test harness for everyone to use.

If you think your method has merit in the area of performance, please test it and post the test results. Who knows? You might be on to something.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TomThomson SSC Guru Points: 104768 More actions · Answer 5

SQL Kiwi (12/30/2012)
There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

Well, I've added my vote to that. It will probably make no difference, but that isn't an excuse for not doing so whem I think it's right.

Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to http://www.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

Tom

Paul White SSC Guru Points: 150468 More actions · Answer 6

L' Eomot Inversé (12/31/2012)
For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.
Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to http://www.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

Thanks Tom, I have corrected the link in the original post.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Dwain Camps SSC Guru Points: 86978 More actions · Answer 7

Nice update to this article Jeff!

I especially liked your mysteriously labelled black line (????). Nice trick with the 0 based Tally table too - will need to remember that one.

I have already replaced the old DelimitedSplit8K in my sandbox and now I must seek out other places where it may have been used and do the same.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

MarbryHardin Old Hand Points: 363 More actions · Answer 8

Interesting stuff, thank you.

Table type parameters for stored procedures have eliminated much of my need for doing this within SQL, but it's always nice to have options.

klini Mr or Mrs. 500 Points: 508 More actions · Answer 9

Jeff or Paul:

I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

Paul, would you mind sharing that with us? I would love to see its guts.

Kurt

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 10

klini (1/3/2013)
I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

It's in one of the zip files at the bottom of the original article: Tally OH! An Improved SQL 8K “CSV Splitter” Function[/url]

telcogod SSC Enthusiast Points: 143 More actions · Answer 11

hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

DROP FUNCTION [dbo].[DelimitedSplit8K]

go

CREATE FUNCTION [dbo].[DelimitedSplit8K]

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

, cteStart(N1,L1) AS (

SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

UNION ALL

SELECT t.N+1

, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

FROM cteTally t

WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

)

--select * from cteStart

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteStart l

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 12

telcogod (1/4/2013)
hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen
IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL
DROP FUNCTION [dbo].[DelimitedSplit8K]
go
CREATE FUNCTION [dbo].[DelimitedSplit8K]
(@pString VARCHAR(8000), @pDelimiter CHAR(1))
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
, cteStart(N1,L1) AS (
SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1
UNION ALL
SELECT t.N+1
, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1
FROM cteTally t
WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
)
--select * from cteStart
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(@pString, l.N1, l.L1)
FROM cteStart l

Pretty good performance compared to DelimitedSplit8K! However, when running the tests using Jeff Moden's test harness the function failed whenever there was only one element and was still not quite as fast. Very close in performance though, so the change you made didn't really make much difference.

RowNumSplitterNameNumberOfRowsNumberOfElementsMinElementLengthMaxElementLengthDurationMinLengthAvgLengthMaxLength

1DelimitedSplit8K100011100.123001510

2DelimitedSplit_TELCO100011100.076001510

3DelimitedSplit8K100021100.1530031121

4DelimitedSplit_TELCO100021100.1860031121

5DelimitedSplit8K100041100.2600092443

6DelimitedSplit_TELCO100041100.1500092443

7DelimitedSplit8K100081100.33000215073

8DelimitedSplit_TELCO100081100.13000215073

9DelimitedSplit8K1000161100.3460064102140

10DelimitedSplit_TELCO1000161100.4660064102140

11DelimitedSplit8K1000321100.33300158206268

12DelimitedSplit_TELCO1000321100.41600158206268

13DelimitedSplit8K1000641100.53600349415511

14DelimitedSplit_TELCO1000641100.48600349415511

15DelimitedSplit8K10001281100.87600701830920

16DelimitedSplit_TELCO10001281100.96600701830920

17DelimitedSplit8K10002561101.59600153016631835

18DelimitedSplit_TELCO10002561101.96000153016631835

19DelimitedSplit8K10005121103.09300307533233541

20DelimitedSplit_TELCO10005121103.32300307533233541

21DelimitedSplit8K100011501107.23000714174667790

22DelimitedSplit_TELCO100011501108.41600714174667790

23DelimitedSplit8K1000110200.01000101420

24DelimitedSplit_TELCO1000110200.01000101420

25DelimitedSplit8K1000210200.03600213041

26DelimitedSplit_TELCO1000210200.05300213041

27DelimitedSplit8K1000410200.10000476282

28DelimitedSplit_TELCO1000410200.08000476282

29DelimitedSplit8K1000810200.1200093126155

30DelimitedSplit_TELCO1000810200.1260093126155

31DelimitedSplit8K10001610200.20000219254292

32DelimitedSplit_TELCO10001610200.23600219254292

33DelimitedSplit8K10003210200.36000458511559

34DelimitedSplit_TELCO10003210200.39300458511559

35DelimitedSplit8K10006410200.7100093110211099

36DelimitedSplit_TELCO10006410200.7600093110211099

37DelimitedSplit8K100012810201.45300194320472158

38DelimitedSplit_TELCO100012810201.48300194320472158

39DelimitedSplit8K100025610202.92000394640954238

40DelimitedSplit_TELCO100025610202.98300394640954238

41DelimitedSplit8K100048010205.37000743776767900

42DelimitedSplit_TELCO100048010205.66300743776767900

43DelimitedSplit8K1000120300.05000202530

44DelimitedSplit_TELCO1000120300.01000202530

45DelimitedSplit8K1000220300.02300415161

46DelimitedSplit_TELCO1000220300.05600415161

47DelimitedSplit8K1000420300.0700084102120

48DelimitedSplit_TELCO1000420300.1030084102120

49DelimitedSplit8K1000820300.13600179206232

50DelimitedSplit_TELCO1000820300.16300179206232

51DelimitedSplit8K10001620300.26600372414449

52DelimitedSplit_TELCO10001620300.32600372414449

53DelimitedSplit8K10003220300.52600768830896

54DelimitedSplit_TELCO10003220300.57000768830896

55DelimitedSplit8K10006420301.04300159316631733

56DelimitedSplit_TELCO10006420301.09600159316631733

57DelimitedSplit8K100012820302.09000318233283441

58DelimitedSplit_TELCO100012820302.20000318233283441

59DelimitedSplit8K100025620304.24000649766546803

60DelimitedSplit_TELCO100025620304.30300649766546803

61DelimitedSplit8K100029020304.75300733175387696

62DelimitedSplit_TELCO100029020304.88600733175387696

63DelimitedSplit8K1000130400.01300303440

64DelimitedSplit_TELCO1000130400.01000303440

65DelimitedSplit8K1000230400.02600617181

66DelimitedSplit_TELCO1000230400.03000617181

67DelimitedSplit8K1000430400.09300126143161

68DelimitedSplit_TELCO1000430400.12300126143161

69DelimitedSplit8K1000830400.18300263287315

70DelimitedSplit_TELCO1000830400.23300263287315

71DelimitedSplit8K10001630400.35300529575619

72DelimitedSplit_TELCO10001630400.38300529575619

73DelimitedSplit8K10003230400.69300109811501212

74DelimitedSplit_TELCO10003230400.73300109811501212

75DelimitedSplit8K10006430401.38300221923012391

76DelimitedSplit_TELCO10006430401.43300221923012391

77DelimitedSplit8K100012830402.87300449646054736

78DelimitedSplit_TELCO100012830402.83300449646054736

79DelimitedSplit8K100021030404.57600741275617710

80DelimitedSplit_TELCO100021030404.67000741275617710

81DelimitedSplit8K1000140500.01600404450

82DelimitedSplit_TELCO1000140500.01000404450

83DelimitedSplit8K1000240500.043008191101

84DelimitedSplit_TELCO1000240500.030008191101

85DelimitedSplit8K1000440500.14000167182201

86DelimitedSplit_TELCO1000440500.14300167182201

87DelimitedSplit8K1000840500.22000342366395

88DelimitedSplit_TELCO1000840500.25300342366395

89DelimitedSplit8K10001640500.43600693735779

90DelimitedSplit_TELCO10001640500.49000693735779

91DelimitedSplit8K10003240500.86000139914701530

92DelimitedSplit_TELCO10003240500.91300139914701530

93DelimitedSplit8K10006440501.71600285229423022

94DelimitedSplit_TELCO10006440501.81300285229423022

95DelimitedSplit8K100012840503.56600576758886012

96DelimitedSplit_TELCO100012840503.59000576758886012

97DelimitedSplit8K100016540504.44600746575907717

98DelimitedSplit_TELCO100016540504.55000746575907717

telcogod SSC Enthusiast Points: 143 More actions · Answer 13

ah well, I have seen that before where you think removing the extra select will improve performance but it doesn't. Anyway I am working on a look ma no ddl version as well. This is brilliant stuff I always knew had to be a better way than select substring(string,charindex(delimiter,string), substring(string(charindex(delimiter,string,(charindex(delimiter,string))) etc. Thanks for checking I do not have a sandbox at work where is ok to freeproccache.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 14

SQL Kiwi (12/30/2012)
Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

More generalized matching is exactly what I always wanted, and would hate to see a query hint specific for subexpression reuse. CTE's expanding as they do now is logical as on each site, the "inline view" defined in the CTE can be used differently and thus must be optimised seperately as well. A more generalised optimisation as in the paper will exploit the similarities still found afer that process.

As for the paper, I only quickly scanned it to see what they were up to, and hope their table signatures are based on what is there after expanding contributing code to a query/batch. You want CTE's and non inline views expanded to have a wide as possible base for finding matches. Thus the optimisation should be 100% seperate from the SQL code and no query hint should be wished for!

aochss SSCommitted Points: 1677 More actions · Answer 15

The performance of this is amazing, as is the thought behind the code. These kind of posts really make you think and learn a whole new set of features in SQL

Thanks Jeff.

A couple of questions (sorry if it has been addressed - I didn't see it mentioned in the thread):

I would not expect a record back if the string being sent in is NULL, a single space or all spaces. In my case, I have records where the string being split is often blank and I would want no records coming back from the CROSS APPLY. I could filter them out using in the WHERE clause. I fixed this by adding LTRIM and RTRIM to the source string in the function.

Also, some of my "not so good" data also had multiple spaces in between the values causing numerous records returned that were blank.

For example:

dbo.DelimitedSplit8k('123 467 789', ' ') -- There are two spaces between 467 and 789

ItemNumber Item

1 123

2 467

3 <-- Empty String

4 789

Thanks Again,

Anton