Find a String in a String in SQL Server

Question

Find a String in a String in SQL Server

Viewing 15 posts - 31 through 45 (of 47 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 1

quagmired (3/18/2016)
Jeff Moden (3/18/2016)
quagmired (3/17/2016)
Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?
Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?
Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...
Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.
Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.
Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.

Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀

For the things that it is good at, it's pretty good at it. While it may not be great at ...

a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character

... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.

And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

quagmired Old Hand Points: 324 More actions · Answer 2

Jeff Moden (3/18/2016)
quagmired (3/18/2016)
Jeff Moden (3/18/2016)
quagmired (3/17/2016)
Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?
Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?
Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...
Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.
It's 2016. Looks like we have some CSV support in SSMS, but not with newlines, omg, lol, etc.
Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.
Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.
Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀
For the things that it is good at, it's pretty good at it. While it may not be great at ...
a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character
... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.
And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.

Your cited thread indicates that you SQL folks (and I am one of you) are avoiding full-blown regex like the plague. Keep in mind, you are already a carrier.

My point is that due to lack of native (!) modern regex support, TSQL forces users into cumbersome, inelegant, non-performant solutions. Your cited thread, which I have read (I did not recognize your mention of it due to your characterization of it as heads-handed) is in no way heads-handed to regex proponents--Solomon's solution addresses a particular need that IS NOT MET by SQL out of the box and is not a vindication for LACK of NATIVE regex support in SQL. You can not compare sub-optimal performance of auxiliary CLR functions to elementary built-ins and claim THAT is a legitimate argument against regex in general. Native regex lives almost everywhere, including SQL server. SQL server needs the modern flavor.

skeleton567 SSCertifiable Points: 5925 More actions · Answer 3

I've watched this discussion for a number of days now, and this morning the thought struck me that the solution probably lies, and the effort might be well spent, in a modification that fixes the original design mistake and gets the data into proper variables at the source. I know there will be the usual nay-sayers that argue 'we can't do that', but I know from years of doing it that sometimes you just have to move ahead and make it 'just happen' as a part of some other maintenance.

As far as the PO - or other - data being buried in text, provide the proper location at the source, put proper requirements and validations in, and force the issue.

What's the expression? Don't be so mamby-pamby about it.

Rick
Disaster Recovery = Backup ( Backup ( Your Backup ) )

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 4

quagmired (3/19/2016)
Jeff Moden (3/18/2016)
quagmired (3/18/2016)
Jeff Moden (3/18/2016)
quagmired (3/17/2016)
Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?
Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?
Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...
Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.
It's 2016. Looks like we have some CSV support in SSMS, but not with newlines, omg, lol, etc.
Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.
Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.
Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀
For the things that it is good at, it's pretty good at it. While it may not be great at ...
a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character
... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.
And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.
Your cited thread indicates that you SQL folks (and I am one of you) are avoiding full-blown regex like the plague. Keep in mind, you are already a carrier.
My point is that due to lack of native (!) modern regex support, TSQL forces users into cumbersome, inelegant, non-performant solutions. Your cited thread, which I have read (I did not recognize your mention of it due to your characterization of it as heads-handed) is in no way heads-handed to regex proponents--Solomon's solution addresses a particular need that IS NOT MET by SQL out of the box and is not a vindication for LACK of NATIVE regex support in SQL. You can not compare sub-optimal performance of auxiliary CLR functions to elementary built-ins and claim THAT is a legitimate argument against regex in general. Native regex lives almost everywhere, including SQL server. SQL server needs the modern flavor.

That's exactly the point I've been trying to make. If you run Regex from SQL Server, it's going to suck for performance.

I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

quagmired Old Hand Points: 324 More actions · Answer 5

Jeff Moden (3/19/2016)
I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx

This is on 2008 R2 Enterprise Edition 64 bit.

Trivial operation on 1M rows:

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

SELECT COUNT(*)

FROM dbo.testEnvironment

WHERE len(randomName) % 2 = 1

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 124 ms, elapsed time = 50 ms.

Count:498164

Like pattern match on 1M rows:

SET STATISTICS IO ON;

SET STATISTICS TIME ON;

SELECT COUNT(*)

FROM dbo.testEnvironment

WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

SET STATISTICS TIME OFF;

SET STATISTICS IO OFF;

Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 794 ms, elapsed time = 166 ms.

Count: 111960

And this is php5.5.1/PCRE8.32 2012-11-30 running on virtual openSuse Linux

<?php

$start = microtime(1);

require_once 'database/db.php';

$db = new db('util');

$timeStart = microtime(1);

$values = $db->values("select randomName from dbo.testEnvironment");

$timeEnd = microtime(1);

echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s

";

$timeStart = microtime(1);

$oddLength = 0;

foreach($values as $value) if(strlen($value) % 2) ++$oddLength;

$timeEnd = microtime(1);

echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";

echo "Odd length matches: ",$oddLength,"

";

$timeStart = microtime(1);

$matches = 0;

foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;

$timeEnd = microtime(1);

echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

echo "Total matches: ",$matches,"

";

$timeStart = microtime(1);

$matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });

$timeEnd = microtime(1);

echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

echo "Total matches: ",count($matches),"

";

$timeStart = microtime(1);

$matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);

$timeEnd = microtime(1);

echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";

echo "Total matches: ",count($matches),"

";

$end = microtime(1);

echo 'Elapsed time: ',round($end - $start,4),"

";

Retrieve time: 3.2316s

Iterate time, cheap-op: 0.3293s

Odd length matches: 498164

Iterate time, PCRE matching: 0.8386s

Total matches: 111960

Map time, PCRE matching: 1.7313s

Total matches: 111960

Internal preg_grep, matching: 0.4964s

Total matches: 111960

Elapsed time: 6.6327s

I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.

Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.

preg_grep is the most comparable to LIKE as it simply filters values.

Jeff Moden SSC Guru Points: 1004682 More actions · Answer 6

quagmired (3/19/2016)
Jeff Moden (3/19/2016)
I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?
Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx
This is on 2008 R2 Enterprise Edition 64 bit.
Trivial operation on 1M rows:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE len(randomName) % 2 = 1
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 124 ms, elapsed time = 50 ms.
Count:498164
Like pattern match on 1M rows:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 794 ms, elapsed time = 166 ms.
Count: 111960
And this is php5.5.1/PCRE8.32 2012-11-30 running on virtual openSuse Linux
<?php
$start = microtime(1);
require_once 'database/db.php';
$db = new db('util');
$timeStart = microtime(1);
$values = $db->values("select randomName from dbo.testEnvironment");
$timeEnd = microtime(1);
echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s
";
$timeStart = microtime(1);
$oddLength = 0;
foreach($values as $value) if(strlen($value) % 2) ++$oddLength;
$timeEnd = microtime(1);
echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";
echo "Odd length matches: ",$oddLength,"
";
$timeStart = microtime(1);
$matches = 0;
foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;
$timeEnd = microtime(1);
echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",$matches,"
";
$timeStart = microtime(1);
$matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });
$timeEnd = microtime(1);
echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",count($matches),"
";
$timeStart = microtime(1);
$matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);
$timeEnd = microtime(1);
echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",count($matches),"
";
$end = microtime(1);
echo 'Elapsed time: ',round($end - $start,4),"
";
Retrieve time: 3.2316s
Iterate time, cheap-op: 0.3293s
Odd length matches: 498164
Iterate time, PCRE matching: 0.8386s
Total matches: 111960
Map time, PCRE matching: 1.7313s
Total matches: 111960
Internal preg_grep, matching: 0.4964s
Total matches: 111960
Elapsed time: 6.6327s
I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.
Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.
preg_grep is the most comparable to LIKE as it simply filters values.

Thanks for the test code and the results timings. I appreciate it.

Each PHP snippet (except Preg_Grep) would appear to have lost to LIKE by quit a bit especially since you had to do an initial load. Preg_Grep appears to be a contender for LIKE unless you consider the initial load. I'm fairly ignorant of PHP and its built in tools so don't laugh too hard when I ask if the "Preg_" functionality is considered to be Regex or not.

Correct me if I've made a mistake in interpreting what you've posted, please, but I'm not seeing an advantage compared to LIKE when LIKE is capable of doing the job for something simple like this. In fact, I see a couple of disadvantages in that this had to be done outside the database and, if you had to write the rows back to a table, you'll have the extra overhead of doing so instead of just using the result set in the LIKE query.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

quagmired Old Hand Points: 324 More actions · Answer 7

Jeff Moden (3/19/2016)
quagmired (3/19/2016)
Jeff Moden (3/19/2016)
I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?
Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx
This is on 2008 R2 Enterprise Edition 64 bit.
Trivial operation on 1M rows:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE len(randomName) % 2 = 1
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 124 ms, elapsed time = 50 ms.
Count:498164
Like pattern match on 1M rows:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;
Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 794 ms, elapsed time = 166 ms.
Count: 111960
And this is php5.5.1/PCRE8.32 2012-11-30 running on virtual openSuse Linux
<?php
$start = microtime(1);
require_once 'database/db.php';
$db = new db('util');
$timeStart = microtime(1);
$values = $db->values("select randomName from dbo.testEnvironment");
$timeEnd = microtime(1);
echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s
";
$timeStart = microtime(1);
$oddLength = 0;
foreach($values as $value) if(strlen($value) % 2) ++$oddLength;
$timeEnd = microtime(1);
echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";
echo "Odd length matches: ",$oddLength,"
";
$timeStart = microtime(1);
$matches = 0;
foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;
$timeEnd = microtime(1);
echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",$matches,"
";
$timeStart = microtime(1);
$matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });
$timeEnd = microtime(1);
echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",count($matches),"
";
$timeStart = microtime(1);
$matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);
$timeEnd = microtime(1);
echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";
echo "Total matches: ",count($matches),"
";
$end = microtime(1);
echo 'Elapsed time: ',round($end - $start,4),"
";
Retrieve time: 3.2316s
Iterate time, cheap-op: 0.3293s
Odd length matches: 498164
Iterate time, PCRE matching: 0.8386s
Total matches: 111960
Map time, PCRE matching: 1.7313s
Total matches: 111960
Internal preg_grep, matching: 0.4964s
Total matches: 111960
Elapsed time: 6.6327s
I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.
Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.
preg_grep is the most comparable to LIKE as it simply filters values.
Thanks for the test code and the results timings. I appreciate it.
Each PHP snippet (except Preg_Grep) would appear to have lost to LIKE by quit a bit especially since you had to do an initial load. Preg_Grep appears to be a contender for LIKE unless you consider the initial load. I'm fairly ignorant of PHP and its built in tools so don't laugh too hard when I ask if the "Preg_" functionality is considered to be Regex or not.
Correct me if I've made a mistake in interpreting what you've posted, please, but I'm not seeing an advantage compared to LIKE when LIKE is capable of doing the job for something simple like this. In fact, I see a couple of disadvantages in that this had to be done outside the database and, if you had to write the rows back to a table, you'll have the extra overhead of doing so instead of just using the result set in the LIKE query.

You asked for example performance numbers of regex outside of SQL Server. I have provided a simple example of PCRE matches in a single thread in PHP user land.

Cadavre's post of a test of Solomon Rutzky's SQL# regex functions with an apparent CPU time of 43 seconds tells me the problem is not regex (http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx).

Here are more examples on better hardware, loading the test data from a newline-delimited file:

Retrieve time: 0.1751s

Value count: 1000000

Character classes

Pattern: /[A-C][A-F][P-Z]/

Matches: 111960

Random Matches: DUCWEPSACU WHONSORCAPO RSUTUCERGO KKATBUHBEWHI NAPRAFRONSE

Elapsed: 0.1118s

*

Pattern: /AB*C/i

Matches: 36665

Random Matches: RVEPLACROVO HROHREPACSI RVOKACWUWHE NACLUJUSI MGIWACSOGE

Elapsed: 0.0744s

Possessive *

Pattern: /AB*+C/i

Matches: 36665

Random Matches: FLEWCUPACSI LIWHUZACRU HBOLBAFACHA MGIWACLISA NACLUCSESHI

Elapsed: 0.1139s

Alternation

Pattern: /(ABC)|(BCA)|D/i

Matches: 149917

Random Matches: RTIWCIRDAZO TUDOHWEBRI JEHRITURDA DRAWSOBIGU PLUDROSKEVRO

Elapsed: 0.3457s

Alternation with *

Pattern: /(A.*C)|(B.*A)|D/i

Matches: 328915

Random Matches: STESHAGHACLA SWOPRAHWUCO ZUNSUWSARDU SKOLIDOGI PSASANOWCI

Elapsed: 0.2878s

Anchoring

Pattern: /^PA/i

Matches: 5000

Random Matches: PAPAGLELBA PAGLORTUGHA PAWTATWIBWE PAGLOSATWA PAMARHODO

Elapsed: 0.0697s

Anchoring

Pattern: /WE$/i

Matches: 15000

Random Matches: CROZEHREWE RDAGEHSOTWE NUBOSWIWE JUHSAHWUWE FOTWASWIWE

Elapsed: 0.0724s

Anchoring, non-greedy *

Pattern: /^S.*?E.*O$/i

Matches: 14239

Random Matches: SNEVORGUSCO SWOSLEGLURHO SWOSLEFUHSO SKORDEDOSKO SWUHEHROLPO

Elapsed: 0.0893s

Elapsed time: 1.3622s

That's 8 relatively simple patterns, 1 million matches each, including one, AB*C, which in TSQL would require a custom scalar function with iteration, all in < 1.4 seconds in a single thread in PHP user land. I'm not suggesting a replacement for LIKE. I'm suggesting that regex isn't the performance horror some people think it is.

Edit: I was not able to attach this to the forum post. Here is my test data if anyone would like to chew on it with another runtime: https://www.dropbox.com/s/2s8mlpd9pm5sgzh/test%20data.txt?dl=0

Alan Burstein SSC Guru Points: 61152 More actions · Answer 8

Steve Jones - SSC Editor (3/10/2016)
PATINDEX is powerful, without a doubt. I have more of a fondness for charindex, mostly because I can indicate a start position. However, as Luis showed, you can get around this.
Certainly there is power in PATINDEX, and I didn't tackle that here. I wanted to focus on a few common problems that I have seen across the last 6 months in questions. It seems many people are struggling with simple parsing.
A more complex treatment of PATINDEX would have to wait for another piece. Unless you want to write one 😉

I started an article and will submit it this week. 😀

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 9

quagmired (3/17/2016)
Jeff Moden (3/17/2016)
quagmired (3/17/2016)
My personal favorite is, given the same accuracy, which ever is the fastest and least resource intensive. That's usually not REGEX as called from SQL SERVER even when using a CLR. There have been a ton of threads on this very subject where it has been found that if CHARINDEX or PATINDEX can be used in a straight forward manner (and sometimes in a complicated manner), it beats and sometimes blows the doors off of Regex calls.
All the more reason to have a nimble regex library (PCRE) running natively within SQL server with a instance-wide cache of compiled patterns. The cumbersome TSQL solutions for even relatively simple patterns are a joke when a short, sweet, regex will suffice. It's 2016. Why don't we have this?
So what happens when something doesn't fit a "compiled pattern"?
Mid string searches in SQL suck. Regex has it's own problems. The reason either of them exist is to be able to interrogate improperly formed data. The best thing (although rarely the easiest thing) to do would be to have a "Moden Style" pork chop dinner with whomever the data provider is and get them to fix their junk. 😉
Not looking to turn this into a drag 'em out, but...
Most runtimes maintain a cache of compiled patterns. If a new one comes along, compile it, and keep it around for the next time. I have no idea, but I'd be willing to bet that SQL Server does this with LIKE patterns.
Regex or any variety of pattern matching exists for a whole lot more than mid-string searches. Think data validation. I absolutely agree that storing structured data out of place in a blob is a piss-poor practice. But pattern matching still has its place in SQL, especially in analytics, and providing a subset of full blown regex is just putting up unnecessary hoops and hurdles.
Name a problem with regex that is not a problem with what you get in standard SQL pattern matching.
Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?

I have to run and will be back later to add my 2 cents but here's how I'd do it.

USE tempdb -- safe place to test

GO

-- There's a reason I'm recomending a perm tally table here

IF OBJECT_ID('dbo.tally') IS NULL

BEGIN

CREATE TABLE dbo.tally (N int not null);

INSERT dbo.tally

SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a, sys.all_columns b;

ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY(N) WITH FILLFACTOR = 100;

ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE(N);

END

GO

-- test data

IF OBJECT_ID('tempdb.dbo.strings') IS NOT NULL DROP TABLE dbo.strings;

CREATE TABLE dbo.strings(StrID int identity primary key, StrTxt varchar(100) NOT NULL);

INSERT dbo.strings

SELECT s

FROM

(VALUES

('10xxxxx01'),('1yyyy11'),('zz0zz'),('xx11xx'),('110'), -- these don't match the pattern

('xx10001'),('ff101'),('100000000000000000012345') -- these do match the pattern

)t(s);

-- technique #1: only return rows that match the pattern

SELECT StrID, StrTxt

FROM dbo.strings s

CROSS JOIN dbo.tally t

WHERE t.N < LEN(s.StrTxt)

AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

-- technique #2: Return all rows with a flag

SELECT StrID, StrTxt, MatchesPatternFlag = ISNULL(N/N,0)

FROM dbo.strings s

LEFT JOIN dbo.tally t

ON t.N < LEN(s.StrTxt)

AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

quagmired Old Hand Points: 324 More actions · Answer 10

Alan.B (3/20/2016)
I have to run and will be back later to add my 2 cents but here's how I'd do it.
USE tempdb -- safe place to test
GO
-- There's a reason I'm recomending a perm tally table here
IF OBJECT_ID('dbo.tally') IS NULL
BEGIN
CREATE TABLE dbo.tally (N int not null);
INSERT dbo.tally
SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a, sys.all_columns b;
ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY(N) WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE(N);
END
GO
-- test data
IF OBJECT_ID('tempdb.dbo.strings') IS NOT NULL DROP TABLE dbo.strings;
CREATE TABLE dbo.strings(StrID int identity primary key, StrTxt varchar(100) NOT NULL);
INSERT dbo.strings
SELECT s
FROM
(VALUES
('10xxxxx01'),('1yyyy11'),('zz0zz'),('xx11xx'),('110'), -- these don't match the pattern
('xx10001'),('ff101'),('100000000000000000012345') -- these do match the pattern
)t(s);
-- technique #1: only return rows that match the pattern
SELECT StrID, StrTxt
FROM dbo.strings s
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt)
AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
-- technique #2: Return all rows with a flag
SELECT StrID, StrTxt, MatchesPatternFlag = ISNULL(N/N,0)
FROM dbo.strings s
LEFT JOIN dbo.tally t
ON t.N < LEN(s.StrTxt)
AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million. Run this against a table containing '1' + replicate('0',999998) + '1'.

If we had a decent flavor of regular expressions, a solution would look like: SELECT whatever FROM wherever WHERE someString REGEX_MATCH '10*+1'

And what if we wanted "a 1 followed by arbitrarily many 0s or 2s followed by a 1"? SELECT whatever FROM wherever WHERE someString REGEX_MATCH '1[02]*+1'

Back to the drawing board with TSQL as it is.

Alan Burstein SSC Guru Points: 61152 More actions · Answer 11

quagmired (3/20/2016)
This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

GO

CREATE FUNCTION dbo.fnTally

/**********************************************************************************************************************

Purpose:

Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

Usage:

--===== Syntax example (Returns BIGINT)

SELECT t.N

FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

;

Notes:

1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

Refer to the following URLs for how it works and introduction for how it replaces certain loops.

http://www.sqlservercentral.com/articles/T-SQL/62867/

http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

will cause the sequence to start at 1.

3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

5. If @MaxN is negative or NULL, a "TOP" error will be returned.

6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

that many values, you should consider using a different tool. ;-)

7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending

sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

DECLARE @MaxN BIGINT;

SELECT @MaxN = 1000;

SELECT DescendingN = @MaxN-N+1

FROM dbo.fnTally(1,@MaxN);

8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

Revision History:

Rev 00 - Unknown - Jeff Moden

- Initial creation with error handling for @MaxN.

Rev 01 - 09 Feb 2013 - Jeff Moden

- Modified to start at 0 or 1.

Rev 02 - 16 May 2013 - Jeff Moden

- Removed error handling for @MaxN because of exceptional cases.

**********************************************************************************************************************/

(@ZeroOrOne BIT, @MaxN INT)

RETURNS TABLE WITH SCHEMABINDING AS RETURN

WITH

E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1) --10E1 or 10 rows

, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

UNION ALL

SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

GO

... Now we don't need a table at all.

I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

GO

-- Use logic from solution #1 for an indexed view

-- 1st the index

CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

SELECT StrID, StrTxt

FROM dbo.strings

CROSS JOIN dbo.tally t

WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

GO

CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

Now when we run this query:

SELECT StrID, StrTxt

FROM dbo.strings s

CROSS JOIN dbo.tally t

WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

Or this query:

SELECT * FROM dbo.vw_StringPat1;

We get this index plan:

Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

SELECT s.*

FROM dbo.strings s

CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

WHERE item LIKE '%1%0%1%';

You would have to modify the function however to support varchar(max) (which is not that hard to do).

Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

quagmired Old Hand Points: 324 More actions · Answer 12

Alan.B (3/20/2016)
quagmired (3/20/2016)
This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.
Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):
IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally
GO
CREATE FUNCTION dbo.fnTally
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.
As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.
Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;
Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URLs for how it works and introduction for how it replaces certain loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool. ;-)
7. There will be a substantial reduct
ion in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.
DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);
8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN INT)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN
GO
... Now we don't need a table at all.
I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:
IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;
GO
-- Use logic from solution #1 for an indexed view
-- 1st the index
CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS
SELECT StrID, StrTxt
FROM dbo.strings
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
GO
CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);
Now when we run this query:
SELECT StrID, StrTxt
FROM dbo.strings s
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
Or this query:
SELECT * FROM dbo.vw_StringPat1;
We get this index plan:
Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.
UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:
SELECT s.*
FROM dbo.strings s
CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')
WHERE item LIKE '%1%0%1%';
You would have to modify the function however to support varchar(max) (which is not that hard to do).
Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

Perhaps you will consider the rest of my response to your prior post later on?

I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

Name that regex solution (and given the barrier of entry for regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

Seriously, for a string of length N, creating N strings of 0s of all lengths up to N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

quagmired Old Hand Points: 324 More actions · Answer 13

Alan.B (3/20/2016)
quagmired (3/20/2016)
This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.
Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):
IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally
GO
CREATE FUNCTION dbo.fnTally
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.
As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.
Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;
Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URLs for how it works and introduction for how it replaces certain loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool. ;-)
7. There will be a substantial reduct
ion in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.
DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);
8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN INT)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN
GO
... Now we don't need a table at all.
I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:
IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;
GO
-- Use logic from solution #1 for an indexed view
-- 1st the index
CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS
SELECT StrID, StrTxt
FROM dbo.strings
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
GO
CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);
Now when we run this query:
SELECT StrID, StrTxt
FROM dbo.strings s
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
Or this query:
SELECT * FROM dbo.vw_StringPat1;
We get this index plan:
Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.
UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:
SELECT s.*
FROM dbo.strings s
CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')
WHERE item LIKE '%1%0%1%';
You would have to modify the function however to support varchar(max) (which is not that hard to do).
Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

Perhaps you will consider the rest of my response to your prior post later on?

I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

Name that regex solution (and given the barrier of entry for modern regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

quagmired Old Hand Points: 324 More actions · Answer 14

Alan.B (3/20/2016)
quagmired (3/20/2016)
This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.
Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):
IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally
GO
CREATE FUNCTION dbo.fnTally
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.
As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.
Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;
Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URLs for how it works and introduction for how it replaces certain loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool. ;-)
7. There will be a substantial reduct
ion in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.
DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);
8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN INT)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN
GO
... Now we don't need a table at all.
I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:
IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;
GO
-- Use logic from solution #1 for an indexed view
-- 1st the index
CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS
SELECT StrID, StrTxt
FROM dbo.strings
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
GO
CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);
Now when we run this query:
SELECT StrID, StrTxt
FROM dbo.strings s
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
Or this query:
SELECT * FROM dbo.vw_StringPat1;
We get this index plan:
Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.
UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:
SELECT s.*
FROM dbo.strings s
CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')
WHERE item LIKE '%1%0%1%';
You would have to modify the function however to support varchar(max) (which is not that hard to do).
Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

Perhaps you will consider the rest of my response to your prior post later on?

I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

Name that regex solution (and given the barrier of entry for real regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

quagmired Old Hand Points: 324 More actions · Answer 15

Alan.B (3/20/2016)
quagmired (3/20/2016)
This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.
Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):
IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally
GO
CREATE FUNCTION dbo.fnTally
/**********************************************************************************************************************
Purpose:
Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.
As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.
Usage:
--===== Syntax example (Returns BIGINT)
SELECT t.N
FROM dbo.fnTally(@ZeroOrOne,@MaxN) t
;
Notes:
1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.
Refer to the following URLs for how it works and introduction for how it replaces certain loops.
http://www.sqlservercentral.com/articles/T-SQL/62867/
http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers
2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type
will cause the sequence to start at 1.
3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.
5. If @MaxN is negative or NULL, a "TOP" error will be returned.
6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger
number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with
that many values, you should consider using a different tool. ;-)
7. There will be a substantial reduct
ion in performance if "N" is sorted in descending order. If a descending
sort is required, use code similar to the following. Performance will decrease by about 27% but it's still
very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.
If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.
DECLARE @MaxN BIGINT;
SELECT @MaxN = 1000;
SELECT DescendingN = @MaxN-N+1
FROM dbo.fnTally(1,@MaxN);
8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
Revision History:
Rev 00 - Unknown - Jeff Moden
- Initial creation with error handling for @MaxN.
Rev 01 - 09 Feb 2013 - Jeff Moden
- Modified to start at 0 or 1.
Rev 02 - 16 May 2013 - Jeff Moden
- Removed error handling for @MaxN because of exceptional cases.
**********************************************************************************************************************/
(@ZeroOrOne BIT, @MaxN INT)
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1) --10E1 or 10 rows
, E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows
, E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows
SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.
UNION ALL
SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN
GO
... Now we don't need a table at all.
I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:
IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;
GO
-- Use logic from solution #1 for an indexed view
-- 1st the index
CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS
SELECT StrID, StrTxt
FROM dbo.strings
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
GO
CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);
Now when we run this query:
SELECT StrID, StrTxt
FROM dbo.strings s
CROSS JOIN dbo.tally t
WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';
Or this query:
SELECT * FROM dbo.vw_StringPat1;
We get this index plan:
Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.
UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:
SELECT s.*
FROM dbo.strings s
CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')
WHERE item LIKE '%1%0%1%';
You would have to modify the function however to support varchar(max) (which is not that hard to do).
Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

Perhaps you will consider the rest of my response to your prior post later on?

I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

Name that regex solution (and given the barrier of entry for real regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.