Find a String in a String in SQL Server

  • Jeff Moden

    SSC Guru

    Points: 995157

    quagmired (3/18/2016)


    Jeff Moden (3/18/2016)


    quagmired (3/17/2016)


    Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?

    Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?

    Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...

    Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.

    Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.

    Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.

    Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀

    For the things that it is good at, it's pretty good at it. While it may not be great at ...

    a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character

    ... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.

    And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • quagmired

    Old Hand

    Points: 324

    Jeff Moden (3/18/2016)


    quagmired (3/18/2016)


    Jeff Moden (3/18/2016)


    quagmired (3/17/2016)


    Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?

    Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?

    Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...

    Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.

    It's 2016. Looks like we have some CSV support in SSMS, but not with newlines, omg, lol, etc.

    Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.

    Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.

    Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀

    For the things that it is good at, it's pretty good at it. While it may not be great at ...

    a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character

    ... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.

    And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.

    Your cited thread indicates that you SQL folks (and I am one of you) are avoiding full-blown regex like the plague. Keep in mind, you are already a carrier.

    My point is that due to lack of native (!) modern regex support, TSQL forces users into cumbersome, inelegant, non-performant solutions. Your cited thread, which I have read (I did not recognize your mention of it due to your characterization of it as heads-handed) is in no way heads-handed to regex proponents--Solomon's solution addresses a particular need that IS NOT MET by SQL out of the box and is not a vindication for LACK of NATIVE regex support in SQL. You can not compare sub-optimal performance of auxiliary CLR functions to elementary built-ins and claim THAT is a legitimate argument against regex in general. Native regex lives almost everywhere, including SQL server. SQL server needs the modern flavor.

  • skeleton567

    SSCarpal Tunnel

    Points: 4970

    I've watched this discussion for a number of days now, and this morning the thought struck me that the solution probably lies, and the effort might be well spent, in a modification that fixes the original design mistake and gets the data into proper variables at the source. I know there will be the usual nay-sayers that argue 'we can't do that', but I know from years of doing it that sometimes you just have to move ahead and make it 'just happen' as a part of some other maintenance.

    As far as the PO - or other - data being buried in text, provide the proper location at the source, put proper requirements and validations in, and force the issue.

    What's the expression? Don't be so mamby-pamby about it.

    Rick
    Simplicity is the ultimate sophistication.
    - L. DaVinci

  • Jeff Moden

    SSC Guru

    Points: 995157

    quagmired (3/19/2016)


    Jeff Moden (3/18/2016)


    quagmired (3/18/2016)


    Jeff Moden (3/18/2016)


    quagmired (3/17/2016)


    Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?

    Starting thought.... Have you ever actually had to do that anywhere? If so, what was it for?

    Also, there's a whole thread where people spent the time trying to justify RegEx from SQL and got their heads handed to them by coded proof. I'll see if I can find it...

    Fine, how about a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character. When have I ever had to match 10*1? Maybe in automata class. Why does it matter? It's a common class of problems that the SQL community, apparently, does not want to own, and the waste of effort and bloated TSQL code that ensues is really very much a joke, hence this article/forum topic.

    It's 2016. Looks like we have some CSV support in SSMS, but not with newlines, omg, lol, etc.

    Don't pay any attention to the fact that LIKE pattern matching already available in SQL is dinosaur regex.

    Handed heads: if that coded "proof" involves a native engine with caching and features that prevent runaway patterns, (non)?greedy, possessive, anchored, etc (again, PCRE), then perhaps that's proof, but I really doubt it. Looking forward to a link.

    Heh... Why does it matter? I will agree that there are certain things in this world that SQL is going to suck at and shouldn't be used for. Resolving and interrogating heterogeneous data is one of those things that it's not particularly good at. Its also not real good for drawing pretty pictures and it won't make coffee unless you call an interrupt on the coffee pot using xp_CmdShell. 😀

    For the things that it is good at, it's pretty good at it. While it may not be great at ...

    a string with 'PO:' preceded by a non-alphanumeric character or nothing at all, followed by an arbitrary number of spaces followed by up to ten decimal digits, terminating, or followed by a non decimal character

    ... even the dinosaur fudge known as "LIKE" will be faster than a call to Regex from T-SQL for those things that LIKE or PATINDEX can resolve and that's my point. Regex may be a panacea for resolving complex patterns that the likes of "LIKE" have little chance of solving but it won't be a panacea of performance. See the discussion and the test code at the link I posted. Even the people that tout the virtues of Regex admit that performance will suffer. And please don't say that performance doesn't matter. It always matters.

    And it's not that SQL folks are trying to avoid Regex... they're trying to avoid bad data that would require something like Regex to resolve... and they pretty good at that, too.

    Your cited thread indicates that you SQL folks (and I am one of you) are avoiding full-blown regex like the plague. Keep in mind, you are already a carrier.

    My point is that due to lack of native (!) modern regex support, TSQL forces users into cumbersome, inelegant, non-performant solutions. Your cited thread, which I have read (I did not recognize your mention of it due to your characterization of it as heads-handed) is in no way heads-handed to regex proponents--Solomon's solution addresses a particular need that IS NOT MET by SQL out of the box and is not a vindication for LACK of NATIVE regex support in SQL. You can not compare sub-optimal performance of auxiliary CLR functions to elementary built-ins and claim THAT is a legitimate argument against regex in general. Native regex lives almost everywhere, including SQL server. SQL server needs the modern flavor.

    That's exactly the point I've been trying to make. If you run Regex from SQL Server, it's going to suck for performance.

    I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • quagmired

    Old Hand

    Points: 324

    Jeff Moden (3/19/2016)

    I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

    Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx

    This is on 2008 R2 Enterprise Edition 64 bit.

    Trivial operation on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE len(randomName) % 2 = 1

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 124 ms, elapsed time = 50 ms.

    Count:498164

    Like pattern match on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 794 ms, elapsed time = 166 ms.

    Count: 111960

    And this is php5.5.1/PCRE 8.32 2012-11-30 running on virtual openSuse Linux

    <?php

    $start = microtime(1);

    require_once 'database/db.php';

    $db = new db('util');

    $timeStart = microtime(1);

    $values = $db->values("select randomName from dbo.testEnvironment");

    $timeEnd = microtime(1);

    echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s

    ";

    $timeStart = microtime(1);

    $oddLength = 0;

    foreach($values as $value) if(strlen($value) % 2) ++$oddLength;

    $timeEnd = microtime(1);

    echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";

    echo "Odd length matches: ",$oddLength,"

    ";

    $timeStart = microtime(1);

    $matches = 0;

    foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;

    $timeEnd = microtime(1);

    echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",$matches,"

    ";

    $timeStart = microtime(1);

    $matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });

    $timeEnd = microtime(1);

    echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $timeStart = microtime(1);

    $matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);

    $timeEnd = microtime(1);

    echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $end = microtime(1);

    echo 'Elapsed time: ',round($end - $start,4),"

    ";

    Retrieve time: 3.2316s

    Iterate time, cheap-op: 0.3293s

    Odd length matches: 498164

    Iterate time, PCRE matching: 0.8386s

    Total matches: 111960

    Map time, PCRE matching: 1.7313s

    Total matches: 111960

    Internal preg_grep, matching: 0.4964s

    Total matches: 111960

    Elapsed time: 6.6327s

    I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.

    Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.

    preg_grep is the most comparable to LIKE as it simply filters values.

  • Jeff Moden

    SSC Guru

    Points: 995157

    quagmired (3/19/2016)


    Jeff Moden (3/19/2016)

    I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

    Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx

    This is on 2008 R2 Enterprise Edition 64 bit.

    Trivial operation on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE len(randomName) % 2 = 1

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 124 ms, elapsed time = 50 ms.

    Count:498164

    Like pattern match on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 794 ms, elapsed time = 166 ms.

    Count: 111960

    And this is php5.5.1/PCRE 8.32 2012-11-30 running on virtual openSuse Linux

    <?php

    $start = microtime(1);

    require_once 'database/db.php';

    $db = new db('util');

    $timeStart = microtime(1);

    $values = $db->values("select randomName from dbo.testEnvironment");

    $timeEnd = microtime(1);

    echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s

    ";

    $timeStart = microtime(1);

    $oddLength = 0;

    foreach($values as $value) if(strlen($value) % 2) ++$oddLength;

    $timeEnd = microtime(1);

    echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";

    echo "Odd length matches: ",$oddLength,"

    ";

    $timeStart = microtime(1);

    $matches = 0;

    foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;

    $timeEnd = microtime(1);

    echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",$matches,"

    ";

    $timeStart = microtime(1);

    $matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });

    $timeEnd = microtime(1);

    echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $timeStart = microtime(1);

    $matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);

    $timeEnd = microtime(1);

    echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $end = microtime(1);

    echo 'Elapsed time: ',round($end - $start,4),"

    ";

    Retrieve time: 3.2316s

    Iterate time, cheap-op: 0.3293s

    Odd length matches: 498164

    Iterate time, PCRE matching: 0.8386s

    Total matches: 111960

    Map time, PCRE matching: 1.7313s

    Total matches: 111960

    Internal preg_grep, matching: 0.4964s

    Total matches: 111960

    Elapsed time: 6.6327s

    I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.

    Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.

    preg_grep is the most comparable to LIKE as it simply filters values.

    Thanks for the test code and the results timings. I appreciate it.

    Each PHP snippet (except Preg_Grep) would appear to have lost to LIKE by quit a bit especially since you had to do an initial load. Preg_Grep appears to be a contender for LIKE unless you consider the initial load. I'm fairly ignorant of PHP and its built in tools so don't laugh too hard when I ask if the "Preg_" functionality is considered to be Regex or not.

    Correct me if I've made a mistake in interpreting what you've posted, please, but I'm not seeing an advantage compared to LIKE when LIKE is capable of doing the job for something simple like this. In fact, I see a couple of disadvantages in that this had to be done outside the database and, if you had to write the rows back to a table, you'll have the extra overhead of doing so instead of just using the result set in the LIKE query.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems
    Create a Tally Function (fnTally)

  • quagmired

    Old Hand

    Points: 324

    Jeff Moden (3/19/2016)


    quagmired (3/19/2016)


    Jeff Moden (3/19/2016)

    I'm a big fan of using the right tool to the job (even when others don't think it's the right tool). With that in mind, what I'd like to see is how well Regex performs outside of SQL Server in some form of a comparison against SQL Server. Have you got anything there?

    Borrowing test data directly from Cadavre, http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx

    This is on 2008 R2 Enterprise Edition 64 bit.

    Trivial operation on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE len(randomName) % 2 = 1

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 124 ms, elapsed time = 50 ms.

    Count:498164

    Like pattern match on 1M rows:

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Table 'testEnvironment'. Scan count 7, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 794 ms, elapsed time = 166 ms.

    Count: 111960

    And this is php5.5.1/PCRE 8.32 2012-11-30 running on virtual openSuse Linux

    <?php

    $start = microtime(1);

    require_once 'database/db.php';

    $db = new db('util');

    $timeStart = microtime(1);

    $values = $db->values("select randomName from dbo.testEnvironment");

    $timeEnd = microtime(1);

    echo "Retrieve time: ",round($timeEnd - $timeStart,4),"s

    ";

    $timeStart = microtime(1);

    $oddLength = 0;

    foreach($values as $value) if(strlen($value) % 2) ++$oddLength;

    $timeEnd = microtime(1);

    echo "Iterate time, cheap-op: ",round($timeEnd - $timeStart,4),"s";

    echo "Odd length matches: ",$oddLength,"

    ";

    $timeStart = microtime(1);

    $matches = 0;

    foreach($values as $value) if(preg_match('/[A-C][A-F][P-Z]/i',$value)) ++$matches;

    $timeEnd = microtime(1);

    echo "Iterate time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",$matches,"

    ";

    $timeStart = microtime(1);

    $matches = array_filter(array_map('preg_match',array_fill(0,count($values),'/[A-C][A-F][P-Z]/i'),$values),function($v){ return $v > 0; });

    $timeEnd = microtime(1);

    echo "Map time, PCRE matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $timeStart = microtime(1);

    $matches = preg_grep('/[A-C][A-F][P-Z]/i',$values);

    $timeEnd = microtime(1);

    echo "Internal preg_grep, matching: ",round($timeEnd - $timeStart,4),"s";

    echo "Total matches: ",count($matches),"

    ";

    $end = microtime(1);

    echo 'Elapsed time: ',round($end - $start,4),"

    ";

    Retrieve time: 3.2316s

    Iterate time, cheap-op: 0.3293s

    Odd length matches: 498164

    Iterate time, PCRE matching: 0.8386s

    Total matches: 111960

    Map time, PCRE matching: 1.7313s

    Total matches: 111960

    Internal preg_grep, matching: 0.4964s

    Total matches: 111960

    Elapsed time: 6.6327s

    I have not profiled server stats but am trying to provide the relative cost of the preg_match, preg_grep call.

    Iterating on large arrays in php user land is not a good idea, but the numbers aren't bad for preg_match, which does a lot more than filtering.

    preg_grep is the most comparable to LIKE as it simply filters values.

    Thanks for the test code and the results timings. I appreciate it.

    Each PHP snippet (except Preg_Grep) would appear to have lost to LIKE by quit a bit especially since you had to do an initial load. Preg_Grep appears to be a contender for LIKE unless you consider the initial load. I'm fairly ignorant of PHP and its built in tools so don't laugh too hard when I ask if the "Preg_" functionality is considered to be Regex or not.

    Correct me if I've made a mistake in interpreting what you've posted, please, but I'm not seeing an advantage compared to LIKE when LIKE is capable of doing the job for something simple like this. In fact, I see a couple of disadvantages in that this had to be done outside the database and, if you had to write the rows back to a table, you'll have the extra overhead of doing so instead of just using the result set in the LIKE query.

    You asked for example performance numbers of regex outside of SQL Server. I have provided a simple example of PCRE matches in a single thread in PHP user land.

    Cadavre's post of a test of Solomon Rutzky's SQL# regex functions with an apparent CPU time of 43 seconds tells me the problem is not regex (http://www.sqlservercentral.com/Forums/Topic1296195-60-2.aspx).

    Here are more examples on better hardware, loading the test data from a newline-delimited file:

    Retrieve time: 0.1751s

    Value count: 1000000

    Character classes

    Pattern: /[A-C][A-F][P-Z]/

    Matches: 111960

    Random Matches: DUCWEPSACU WHONSORCAPO RSUTUCERGO KKATBUHBEWHI NAPRAFRONSE

    Elapsed: 0.1118s

    *

    Pattern: /AB*C/i

    Matches: 36665

    Random Matches: RVEPLACROVO HROHREPACSI RVOKACWUWHE NACLUJUSI MGIWACSOGE

    Elapsed: 0.0744s

    Possessive *

    Pattern: /AB*+C/i

    Matches: 36665

    Random Matches: FLEWCUPACSI LIWHUZACRU HBOLBAFACHA MGIWACLISA NACLUCSESHI

    Elapsed: 0.1139s

    Alternation

    Pattern: /(ABC)|(BCA)|D/i

    Matches: 149917

    Random Matches: RTIWCIRDAZO TUDOHWEBRI JEHRITURDA DRAWSOBIGU PLUDROSKEVRO

    Elapsed: 0.3457s

    Alternation with *

    Pattern: /(A.*C)|(B.*A)|D/i

    Matches: 328915

    Random Matches: STESHAGHACLA SWOPRAHWUCO ZUNSUWSARDU SKOLIDOGI PSASANOWCI

    Elapsed: 0.2878s

    Anchoring

    Pattern: /^PA/i

    Matches: 5000

    Random Matches: PAPAGLELBA PAGLORTUGHA PAWTATWIBWE PAGLOSATWA PAMARHODO

    Elapsed: 0.0697s

    Anchoring

    Pattern: /WE$/i

    Matches: 15000

    Random Matches: CROZEHREWE RDAGEHSOTWE NUBOSWIWE JUHSAHWUWE FOTWASWIWE

    Elapsed: 0.0724s

    Anchoring, non-greedy *

    Pattern: /^S.*?E.*O$/i

    Matches: 14239

    Random Matches: SNEVORGUSCO SWOSLEGLURHO SWOSLEFUHSO SKORDEDOSKO SWUHEHROLPO

    Elapsed: 0.0893s

    Elapsed time: 1.3622s

    That's 8 relatively simple patterns, 1 million matches each, including one, AB*C, which in TSQL would require a custom scalar function with iteration, all in < 1.4 seconds in a single thread in PHP user land. I'm not suggesting a replacement for LIKE. I'm suggesting that regex isn't the performance horror some people think it is.

    Edit: I was not able to attach this to the forum post. Here is my test data if anyone would like to chew on it with another runtime: https://www.dropbox.com/s/2s8mlpd9pm5sgzh/test%20data.txt?dl=0

  • Alan Burstein

    SSC Guru

    Points: 61067

    Steve Jones - SSC Editor (3/10/2016)


    PATINDEX is powerful, without a doubt. I have more of a fondness for charindex, mostly because I can indicate a start position. However, as Luis showed, you can get around this.

    Certainly there is power in PATINDEX, and I didn't tackle that here. I wanted to focus on a few common problems that I have seen across the last 6 months in questions. It seems many people are struggling with simple parsing.

    A more complex treatment of PATINDEX would have to wait for another piece. Unless you want to write one 😉

    I started an article and will submit it this week. 😀

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan Burstein

    SSC Guru

    Points: 61067

    quagmired (3/17/2016)


    Jeff Moden (3/17/2016)


    quagmired (3/17/2016)


    My personal favorite is, given the same accuracy, which ever is the fastest and least resource intensive. That's usually not REGEX as called from SQL SERVER even when using a CLR. There have been a ton of threads on this very subject where it has been found that if CHARINDEX or PATINDEX can be used in a straight forward manner (and sometimes in a complicated manner), it beats and sometimes blows the doors off of Regex calls.

    All the more reason to have a nimble regex library (PCRE) running natively within SQL server with a instance-wide cache of compiled patterns. The cumbersome TSQL solutions for even relatively simple patterns are a joke when a short, sweet, regex will suffice. It's 2016. Why don't we have this?

    So what happens when something doesn't fit a "compiled pattern"?

    Mid string searches in SQL suck. Regex has it's own problems. The reason either of them exist is to be able to interrogate improperly formed data. The best thing (although rarely the easiest thing) to do would be to have a "Moden Style" pork chop dinner with whomever the data provider is and get them to fix their junk. 😉

    Not looking to turn this into a drag 'em out, but...

    Most runtimes maintain a cache of compiled patterns. If a new one comes along, compile it, and keep it around for the next time. I have no idea, but I'd be willing to bet that SQL Server does this with LIKE patterns.

    Regex or any variety of pattern matching exists for a whole lot more than mid-string searches. Think data validation. I absolutely agree that storing structured data out of place in a blob is a piss-poor practice. But pattern matching still has its place in SQL, especially in analytics, and providing a subset of full blown regex is just putting up unnecessary hoops and hurdles.

    Name a problem with regex that is not a problem with what you get in standard SQL pattern matching.

    Parting thought: how would you match a 1 followed by an arbitrary number of 0s followed by a 1 with T-SQL?

    I have to run and will be back later to add my 2 cents but here's how I'd do it.

    USE tempdb -- safe place to test

    GO

    -- There's a reason I'm recomending a perm tally table here

    IF OBJECT_ID('dbo.tally') IS NULL

    BEGIN

    CREATE TABLE dbo.tally (N int not null);

    INSERT dbo.tally

    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a, sys.all_columns b;

    ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY(N) WITH FILLFACTOR = 100;

    ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE(N);

    END

    GO

    -- test data

    IF OBJECT_ID('tempdb.dbo.strings') IS NOT NULL DROP TABLE dbo.strings;

    CREATE TABLE dbo.strings(StrID int identity primary key, StrTxt varchar(100) NOT NULL);

    INSERT dbo.strings

    SELECT s

    FROM

    (VALUES

    ('10xxxxx01'),('1yyyy11'),('zz0zz'),('xx11xx'),('110'), -- these don't match the pattern

    ('xx10001'),('ff101'),('100000000000000000012345') -- these do match the pattern

    )t(s);

    -- technique #1: only return rows that match the pattern

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt)

    AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    -- technique #2: Return all rows with a flag

    SELECT StrID, StrTxt, MatchesPatternFlag = ISNULL(N/N,0)

    FROM dbo.strings s

    LEFT JOIN dbo.tally t

    ON t.N < LEN(s.StrTxt)

    AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • quagmired

    Old Hand

    Points: 324

    Alan.B (3/20/2016)


    I have to run and will be back later to add my 2 cents but here's how I'd do it.

    USE tempdb -- safe place to test

    GO

    -- There's a reason I'm recomending a perm tally table here

    IF OBJECT_ID('dbo.tally') IS NULL

    BEGIN

    CREATE TABLE dbo.tally (N int not null);

    INSERT dbo.tally

    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a, sys.all_columns b;

    ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY(N) WITH FILLFACTOR = 100;

    ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE(N);

    END

    GO

    -- test data

    IF OBJECT_ID('tempdb.dbo.strings') IS NOT NULL DROP TABLE dbo.strings;

    CREATE TABLE dbo.strings(StrID int identity primary key, StrTxt varchar(100) NOT NULL);

    INSERT dbo.strings

    SELECT s

    FROM

    (VALUES

    ('10xxxxx01'),('1yyyy11'),('zz0zz'),('xx11xx'),('110'), -- these don't match the pattern

    ('xx10001'),('ff101'),('100000000000000000012345') -- these do match the pattern

    )t(s);

    -- technique #1: only return rows that match the pattern

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt)

    AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    -- technique #2: Return all rows with a flag

    SELECT StrID, StrTxt, MatchesPatternFlag = ISNULL(N/N,0)

    FROM dbo.strings s

    LEFT JOIN dbo.tally t

    ON t.N < LEN(s.StrTxt)

    AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million. Run this against a table containing '1' + replicate('0',999998) + '1'.

    If we had a decent flavor of regular expressions, a solution would look like: SELECT whatever FROM wherever WHERE someString REGEX_MATCH '10*+1'

    And what if we wanted "a 1 followed by arbitrarily many 0s or 2s followed by a 1"? SELECT whatever FROM wherever WHERE someString REGEX_MATCH '1[02]*+1'

    Back to the drawing board with TSQL as it is.

  • Alan Burstein

    SSC Guru

    Points: 61067

    quagmired (3/20/2016)


    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

    Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

    IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

    GO

    CREATE FUNCTION dbo.fnTally

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. 😉

    7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN INT)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

    , E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

    GO

    ... Now we don't need a table at all.

    I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

    IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

    GO

    -- Use logic from solution #1 for an indexed view

    -- 1st the index

    CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

    SELECT StrID, StrTxt

    FROM dbo.strings

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    GO

    CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

    Now when we run this query:

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    Or this query:

    SELECT * FROM dbo.vw_StringPat1;

    We get this index plan:

    Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

    UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

    SELECT s.*

    FROM dbo.strings s

    CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

    WHERE item LIKE '%1%0%1%';

    You would have to modify the function however to support varchar(max) (which is not that hard to do).

    Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • quagmired

    Old Hand

    Points: 324

    Alan.B (3/20/2016)


    quagmired (3/20/2016)


    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

    Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

    IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

    GO

    CREATE FUNCTION dbo.fnTally

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. 😉

    7. There will be a substantial reduct

    ion in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN INT)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

    , E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

    GO

    ... Now we don't need a table at all.

    I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

    IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

    GO

    -- Use logic from solution #1 for an indexed view

    -- 1st the index

    CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

    SELECT StrID, StrTxt

    FROM dbo.strings

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    GO

    CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

    Now when we run this query:

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    Or this query:

    SELECT * FROM dbo.vw_StringPat1;

    We get this index plan:

    Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

    UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

    SELECT s.*

    FROM dbo.strings s

    CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

    WHERE item LIKE '%1%0%1%';

    You would have to modify the function however to support varchar(max) (which is not that hard to do).

    Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

    Perhaps you will consider the rest of my response to your prior post later on?

    I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

    That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

    Name that regex solution (and given the barrier of entry for regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

    Seriously, for a string of length N, creating N strings of 0s of all lengths up to N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

    And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

  • quagmired

    Old Hand

    Points: 324

    Alan.B (3/20/2016)


    quagmired (3/20/2016)


    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

    Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

    IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

    GO

    CREATE FUNCTION dbo.fnTally

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. 😉

    7. There will be a substantial reduct

    ion in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN INT)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

    , E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

    GO

    ... Now we don't need a table at all.

    I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

    IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

    GO

    -- Use logic from solution #1 for an indexed view

    -- 1st the index

    CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

    SELECT StrID, StrTxt

    FROM dbo.strings

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    GO

    CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

    Now when we run this query:

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    Or this query:

    SELECT * FROM dbo.vw_StringPat1;

    We get this index plan:

    Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

    UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

    SELECT s.*

    FROM dbo.strings s

    CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

    WHERE item LIKE '%1%0%1%';

    You would have to modify the function however to support varchar(max) (which is not that hard to do).

    Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

    Perhaps you will consider the rest of my response to your prior post later on?

    I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

    That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

    Name that regex solution (and given the barrier of entry for modern regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

    Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

    And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

  • quagmired

    Old Hand

    Points: 324

    Alan.B (3/20/2016)


    quagmired (3/20/2016)


    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

    Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

    IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

    GO

    CREATE FUNCTION dbo.fnTally

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. 😉

    7. There will be a substantial reduct

    ion in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN INT)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

    , E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

    GO

    ... Now we don't need a table at all.

    I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

    IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

    GO

    -- Use logic from solution #1 for an indexed view

    -- 1st the index

    CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

    SELECT StrID, StrTxt

    FROM dbo.strings

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    GO

    CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

    Now when we run this query:

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    Or this query:

    SELECT * FROM dbo.vw_StringPat1;

    We get this index plan:

    Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

    UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

    SELECT s.*

    FROM dbo.strings s

    CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

    WHERE item LIKE '%1%0%1%';

    You would have to modify the function however to support varchar(max) (which is not that hard to do).

    Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

    Perhaps you will consider the rest of my response to your prior post later on?

    I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

    That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

    Name that regex solution (and given the barrier of entry for real regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

    Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

    And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

  • quagmired

    Old Hand

    Points: 324

    Alan.B (3/20/2016)


    quagmired (3/20/2016)


    This is exactly what I'm talking about when I mention "cumbersome TSQL solution". For arbitrary 0s, I'd say a tally table would have to have over 2 billion records, but we can stick with 1 million.

    Nothing cumbersome about it. A tally table has many, many uses but I digress. If we're talking about anything other than varchar(max) all we need is a tally table with 8000 records. If we're talking about varchar(max) then we could get the job done using a tally table function such as the one below (written by Jeff Moden coincidentally):

    IF OBJECT_ID('dbo.fnTally') IS NOT NULL DROP FUNCTION dbo.fnTally

    GO

    CREATE FUNCTION dbo.fnTally

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Billion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. 😉

    7. There will be a substantial reduct

    ion in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN INT)

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E3(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c) --10E3 or 1 Thousand rows

    , E9(N) AS (SELECT 1 FROM E3 a, E3 b, E3 c) --10E9 or 1 Billion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E9; -- Values from 1 to @MaxN

    GO

    ... Now we don't need a table at all.

    I don't want to stray from the point I was trying to make so indulge me for moment and let's pretend we're dealing with a varchar(8000), nvarchar(4000) column or smaller. Now let's take my solution a little further using the tally table and sample data and apply that logic to an indexed view:

    IF OBJECT_ID('tempdb.dbo.vw_StringPat1') IS NOT NULL DROP VIEW dbo.vw_StringPat1;

    GO

    -- Use logic from solution #1 for an indexed view

    -- 1st the index

    CREATE VIEW dbo.vw_StringPat1 WITH SCHEMABINDING AS

    SELECT StrID, StrTxt

    FROM dbo.strings

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    GO

    CREATE UNIQUE CLUSTERED INDEX pk_vw_StringPat1 ON dbo.vw_StringPat1(StrID, StrTxt);

    Now when we run this query:

    SELECT StrID, StrTxt

    FROM dbo.strings s

    CROSS JOIN dbo.tally t

    WHERE t.N < LEN(s.StrTxt) AND StrTxt LIKE '%1'+REPLICATE('0',t.N)+'1%';

    Or this query:

    SELECT * FROM dbo.vw_StringPat1;

    We get this index plan:

    Again, for my example we're talking about varchar(8000), nvarchar(4000) or less. And this would not always be the best solution there. That said, for resolving your 1<1 or more 0's>1 problem this solution blows the pants off a regex-based solution in performance and number of reads. I'm not trying to trash regex or belittle your point. I'm just saying that you may be underestimating the power of the relational model, SQL Server and T-SQL.

    UPDATE: Here's another solution that works great for varchar(<anything>). Using the Pattern Based Splitter referenced in my signature (PatternSplitCM), we can do this:

    SELECT s.*

    FROM dbo.strings s

    CROSS APPLY dbo.PatternSplitCM(StrTxt,'%[01]%')

    WHERE item LIKE '%1%0%1%';

    You would have to modify the function however to support varchar(max) (which is not that hard to do).

    Edit: Added new solution, fixed a couple typos, added comment about modifying PatternSplitCM to varchar(max)

    Perhaps you will consider the rest of my response to your prior post later on?

    I am not attempting to discredit tally tables. I'm not sure what you mean by the "relational model" in this context but I whole-heartedly endorse a set-based approach anywhere it can be had. But what we are talking about has nothing to do with that.

    That said, for resolving your 1<1 or more 0's>>1 problem this solution blows the pants off a regex-based solution in performance and number of reads.

    Name that regex solution (and given the barrier of entry for real regex to SQL, PROVE IT). It certainly does not exist in SQL Server to date, and if it did, it would kick the crap out of your solution.

    Seriously, for a string of length N, creating N strings of 0s of lengths 1 through N per value to compare against on top of %LIKE%? That's ridiculous to do against one string. Run that against a million values, one or two of which are VERY long. You just wasted multiple kilowatt-hours.

    And here we are talking about ONE of an infinitude of simple problems that must be handled outside of SQL Server.

Viewing 15 posts - 31 through 45 (of 47 total)

You must be logged in to reply to this topic. Login to reply