Splitting Strings Based on Patterns

Question

Splitting Strings Based on Patterns

Dwain Camps

SSC Guru

Points: 86978
More actions
November 28, 2012 at 11:58 pm

#265270

Comments posted to this topic are about the item Splitting Strings Based on Patterns
My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 1 through 15 (of 56 total)

You must be logged in to reply to this topic. Login to reply

apsalescoordinator Valued Member Points: 66 More actions · Answer 1

dwain.c (11/28/2012)
Comments posted to this topic are about the item <A HREF="/articles/String+Manipulation/94365/">Splitting Strings Based on Patterns</A>

dplaut 49149 SSC Eights! Points: 909 More actions · Answer 2

Is there a way to download these examples, copying text out of those scrolling boxes is a pain and carries along formatting code.

Mike DiRenzo SSCrazy Points: 2013 More actions · Answer 3

Mike DiRenzo

SSCrazy

Points: 2013

November 29, 2012 at 6:03 am

#1563719

WOW. What a great read!

peter-757102 SSCertifiable Points: 6877 More actions · Answer 4

I haven't read the article properly and just scanned it to see what it is about and if it is interesting.

But one thing I noticed quickly was a quote at end saying:

SQL Server Execution Times: CPU time = 249 ms, elapsed time = 247 ms.

Which fired all alarm bells in my head at once as one shoud not use set statistics time on for any serious benchmarking, ever! Measuring performance this way can have a really big impact on the actual execution times themselfs. This has to do with how SQL Server internally processes various constructs and its points of measurement.

The most reliable way of measuring is to have long sufficiently long runs and start with a GetDate() and end with a GetDate() and compute the difference after the test has run. You will require a variable to store the start time for this, but it won't affect the results as much.

Other then that, the article looks as quite a bit of decdication and work has gone in, and I will sure read it in full when I have the time. In the meantime, please recheck the conclusions based on this different method of measuring, you might be surprised!

MarbryHardin Old Hand Points: 363 More actions · Answer 5

It does seem like that would be one instance where a CLR udf would do very well, especially to support more complex examples. That's precisely the type of thing regular expressions are good at.

Although, string parsing like that in SQL typically smells like bad design or something that should have been done outside SQL in the first place.

Peter H SSC Enthusiast Points: 127 More actions · Answer 6

My version of SQL (2005) does not seem to handle the "VALUES" in

SELECT TOP(DATALENGTH(@List))

n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

FROM

(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),

(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),

(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),

(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n)

Could anyone help me expand on this please?

fregatepllada SSCommitted Points: 1702 More actions · Answer 7

Why we have to be inventive and create such acrobatic solution in a first place? Just to compensate for bad design and implementation decisions.

If original DB developers did not use it as a data dump, there is no need for such acrobatics. Good article to refresh my memory.

GRUMPY OLD MEN

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 8

Irozenberg 1347 (11/29/2012)
Why we have to be inventive and create such acrobatic solution in a first place? Just to compensate for bad design and implementation decisions.

In a word, YES! 😉

If original DB developers did not use it as a data dump, there is no need for such acrobatics. Good article to refresh my memory.

I make a good part of my living by fixing such data dumps. Thank goodness there's a need for such acrobatics! 😀

"Fellow GRUMPY OLD MAN".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 9

Peter H (11/29/2012)
My version of SQL (2005) does not seem to handle the "VALUES" in
SELECT TOP(DATALENGTH(@List))
n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
FROM
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) f (n),
(VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) g (n)
Could anyone help me expand on this please?

That form of the VALUES clause came out in 2008. The following can be used as a replacement that will work at approximately the same speed...

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to

-- 10,000... enough to cover VARCHAR(8000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b) --10E+4 or 10,000 rows max

--===== Generate the numbers from 1 to the length of @List

SELECT TOP (DATALENGTH(@List))

n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM E4

;

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 10

peter-757102 (11/29/2012)
I haven't read the article properly and just scanned it to see what it is about and if it is interesting.
But one thing I noticed quickly was a quote at end saying:
SQL Server Execution Times: CPU time = 249 ms, elapsed time = 247 ms.
Which fired all alarm bells in my head at once as one shoud not use set statistics time on for any serious benchmarking, ever! Measuring performance this way can have a really big impact on the actual execution times themselfs. This has to do with how SQL Server internally processes various constructs and its points of measurement.
The most reliable way of measuring is to have long sufficiently long runs and start with a GetDate() and end with a GetDate() and compute the difference after the test has run. You will require a variable to store the start time for this, but it won't affect the results as much.
Other then that, the article looks as quite a bit of decdication and work has gone in, and I will sure read it in full when I have the time. In the meantime, please recheck the conclusions based on this different method of measuring, you might be surprised!

Although strongly I agree with what you said (proof is here... http://www.sqlservercentral.com/articles/T-SQL/91724/), don't forget that CPU time can and frequently does exceed DURATION if parallelism is involved.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 11

Nicely done, Dwain! I've not done a deep dive on it all but it's a great read that you really did put your heart into. Due to a NEC problem (Not Enough Coffee :-D), I couldn't figure out where you were going with all of this until I saw the first result set (3rd string) and the lights suddenly went on. Well done!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 12

dplaut 49149 (11/29/2012)
Is there a way to download these examples, copying text out of those scrolling boxes is a pain and carries along formatting code.

Sorry but in this article I didn't post downloadable examples except for the final test harness for DelimitedSplit8K. I'll make an attempt to do so in a couple of days, as I am traveling out of the country over the weekend on business and I'll have limited time.

I'll try to remember to PM you once that is done.

Edit: I have uploaded the scripts to here: http://www.sqlservercentral.com/Forums/Topic1390297-3122-4.aspx

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 13

Mike DiRenzo (11/29/2012)
WOW. What a great read!

Thanks Mike! Glad you stopped by.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Dwain Camps SSC Guru Points: 86978 More actions · Answer 14

MarbryHardin (11/29/2012)
It does seem like that would be one instance where a CLR udf would do very well, especially to support more complex examples. That's precisely the type of thing regular expressions are good at.
Although, string parsing like that in SQL typically smells like bad design or something that should have been done outside SQL in the first place.

To your first point, yes and no. Certainly there are regular expression CLR parsers out there that you could adapt. So if you substitute the regular expression for the PATINDEX argument, this could work well. However if you need to use the SQL PATINDEX for some reason, I suspect creating that in a CLR would be a major headache (waiting for someone to prove me wrong).

To you second point, in general I agree but often times we forget that we're not the only designers out there, so unfortunately there are a few bad ones. 🙂 But more importantly, there are times where incoming data to our SQL Server will take the much more rudimentary format of strings that must be parsed to derive the incoming information. Any application that takes feeds from a device (e.g., GPS unit) will typically receive messages that need to be parsed in some format.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St