Find a String in a String

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 714316

    Comments posted to this topic are about the item Find a String in a String

  • Luis Cazares

    SSC Guru

    Points: 183516

    It's a nice article that explains this problem which unfortunately is more common that it should be.

    I'd take a different approach which includes some "safe nets". It seems that your last code is trunctating the last digit.

    SELECT *

    FROM dbo.Customers

    CROSS APPLY ( SELECT CHARINDEX('PO:', CustomerNotes + 'PO:') + 3) x(StartofPO)

    CROSS APPLY ( SELECT SUBSTRING(CustomerNotes, StartofPO, 20)) y(TruncNotes)

    CROSS APPLY ( SELECT LEFT(TruncNotes, PATINDEX('%[^0-9]%', TruncNotes + '.')-1)) z(PO);

    SELECT *

    FROM dbo.Customers

    CROSS APPLY ( SELECT CHARINDEX('PO:', CustomerNotes + 'PO:') + 2) x(StartofPO)

    CROSS APPLY ( SELECT STUFF(CustomerNotes, 1, StartofPO, '')) y(TruncNotes)

    CROSS APPLY ( SELECT LEFT(TruncNotes, PATINDEX('%[^0-9]%', TruncNotes + '.')-1)) z(PO);

    Basically, I remove the first part of the string and then identifying the end of the PO. I can't make it in a single step because the PATINDEX does not have a starting position parameter.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Steve Jones - SSC Editor

    SSC Guru

    Points: 714316

    Thanks for the catch. I updated the math to get the PO.

    I wonder which of these works better in complex queries. It's worth some performance testing to see. I hadn't thought about cross apply here, but it's an interesting approach.

  • Alan Burstein

    SSC Guru

    Points: 61006

    Good article Steve. I think you under sold the power of PATINDEX...

    CHARINDEX can start at a certain position in the string while PATINDEX can take wildcards. In this simplistic case, we can use either one.

    PATINDEX can accept patterns, as Luis demonstrates above... wildcard or otherwise. This is what makes it so special. It's what makes things PatternSplitCM (referenced in my signature) possible.

    Over the years I have always believed that, for complex string problems, you need Regular Expressions (Regex) but no longer feel that way. With CHARINDEX, PATINDEX, SUBSTRING, STUFF, a few other T-SQL string functions I can't think of at the moment and a Tally table you can do pretty much anything you would do using Regex.

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 714316

    PATINDEX is powerful, without a doubt. I have more of a fondness for charindex, mostly because I can indicate a start position. However, as Luis showed, you can get around this.

    Certainly there is power in PATINDEX, and I didn't tackle that here. I wanted to focus on a few common problems that I have seen across the last 6 months in questions. It seems many people are struggling with simple parsing.

    A more complex treatment of PATINDEX would have to wait for another piece. Unless you want to write one 😉

  • ManicStar

    SSCoach

    Points: 17992

    This is why i loathe long free-flowing 'note' fields where people embed data they then expect reports or queries to pull from in a consistent way.

    Good example is credit card #'s in note fields. Even using regular expression processing its difficult to find all types of patterns that represent the way the user entered the CC#'s in the 'note'.

    If you need to report or find PO data, then it should go into a field in a table called something like 'PO Number', NOT in a note.

    Note fields should be for misc notes, not data fields.

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 714316

    Completely agree. However, when the data is in there, even if you have modified and built a good data model, this is one way you might move that data back to the correct fields.

    Ideally, you'd also remove the data from here.

  • ManicStar

    SSCoach

    Points: 17992

    Steve Jones - SSC Editor (3/10/2016)


    Completely agree. However, when the data is in there, even if you have modified and built a good data model, this is one way you might move that data back to the correct fields.

    Ideally, you'd also remove the data from here.

    I've written way too many 'scrub' routines to pull data out of notes and populate a table.

    What usually ends up happening is they still keep using the note field, in one case i turned over my regular expressions to the dev so they could write a validater to keep the credit card #'s out of the notes.

    Note fields are evil, basically, and should only be used in Notepad or Notepad ++ 🙂

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 714316

    However, I'm not sure we'll ever get rid of "notes" fields.

  • ManicStar

    SSCoach

    Points: 17992

    Steve Jones - SSC Editor (3/10/2016)


    However, I'm not sure we'll ever get rid of "notes" fields.

    *sigh* probably not. Its the 'cheap','fast' way to extend the data model, just add the new field to the 'note'.

    Then 12 months down the road, 'pull the <whatever> out of the notes so we can search on it'.

    *headdesk*

  • Jeff Moden

    SSC Guru

    Points: 993774

    Manic Star (3/10/2016)


    This is why i loathe long free-flowing 'note' fields where people embed data they then expect reports or queries to pull from in a consistent way.

    Good example is credit card #'s in note fields. Even using regular expression processing its difficult to find all types of patterns that represent the way the user entered the CC#'s in the 'note'.

    If you need to report or find PO data, then it should go into a field in a table called something like 'PO Number', NOT in a note.

    Note fields should be for misc notes, not data fields.

    I loath such things not so much because there is a notes "field" but because people do really, really sssttttooooppppiiiidddd things like violating PCI and PII standards by putting things like Credit Card Numbers and SSNs in the notes "field". One "easy" way to protect the data at rest is to encrypt notes "fields". That does not, however, do the job correctly because anyone that has the necessary casual access to such "fields" can see that PCI or PII information without the necessary controls. When I did work as a "front ender" (more than a decade ago), I had a module that would replace the digits of such information with "X"s and would still leave phone numbers and other numbers intact. It's not that difficult and well worth the effort to protect the customers, which inherently protects the company.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • Jeff Moden

    SSC Guru

    Points: 993774

    Shifting gears, this is a good introductory article for those people that really have no clue on how to do such a thing. It's not meant to be the end-all to be-all for mid-string lookups. I suspect that some people have read it as such and graded it low because of their misunderstanding of the purpose of the article.

    I'll also say that grading an article low without providing feedback to the author of what the problem is accomplishes nothing because it doesn't tell the author what they may have done wrong and that also doesn't help people looking for a solution.

    Yes, I realize that 3 stars means "average" but I'd recommend that any vote with 3 stars or less be accompanied with a brief description of why the article is consider to be "only" average and what the author could do to make it better in your opinion.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • ManicStar

    SSCoach

    Points: 17992

    Jeff Moden (3/10/2016)


    I loath such things not so much because there is a notes "field" but because people do really, really sssttttooooppppiiiidddd things like violating PCI and PII standards by putting things like Credit Card Numbers and SSNs in the notes "field". One "easy" way to protect the data at rest is to encrypt notes "fields". That does not, however, do the job correctly because anyone that has the necessary casual access to such "fields" can see that PCI or PII information without the necessary controls. When I did work as a "front ender" (more than a decade ago), I had a module that would replace the digits of such information with "X"s and would still leave phone numbers and other numbers intact. It's not that difficult and well worth the effort to protect the customers, which inherently protects the company.

    Totally agree. I had a version of that module too 🙂

  • ekampel

    Valued Member

    Points: 61

    The patindex function and use of substrings is very useful for creating spiders to parse data from web pages.

  • skeleton567

    SSCarpal Tunnel

    Points: 4937

    Great article. Steve, especially for those who want to really reach an expert level of SQL use. Microsoft SQL Server has by far the best set of string manipulation operators that I ever saw in 42 years of IT experience, including front-end development, SQL development, database administration and 11 years of IT management. Anyone aspiring to more than a cursory knowledge of query-writing should be sure they completely MASTER string manipulation. I turned 73 years old today, and I still actively create SQL code, now just for myself to use in maintaining and updating my retirement portfolio and investment transaction history. I have used Quicken for my records since 1986 and can account for every penny of my retirement savings and spending since that point. You can't count on software development companies to consistently cover all your needs for data in their packages, so I routinely develop code that will scan and parse data out of text emails and PDF documents and such, run it through SQL Server to store, manipulate, and create proper formats for importing into Quicken. One word of advice here is to never develop code that depends on finding data at a hard-coded position within a string, since any and all textual information is subject to change without notice or reason. Thus, your SQL string-parsing code must always handle text data on a 'this time' basis instead of 'it's always ...' . For instance, you can't count on the PO number being nine digits and beginning in a certain fixed position within the string. Safer is to search for the first non-blank CHARACTER, other than punctuation such as the colon, located following the literal PO (don't assume it's numeric!) and continuing until you find a space, and allowing for special characters such as hyphens which might be included. This usually involves evaluating each character by type and extracting those wanted, appending them to a new string data element, then if appropriate converting to numeric if all individual positions have passed the numeric test, and all such variations of evaluation so you avoid SQL errors in the process. Incidentally, a good way to throw away unneeded spaces to routinely use MyString = LTRIM(RTRIM(MyString)) to remove leading data you have parsed and extracted, and move you to the next succeeding data element. Always, always, ALWAYS test the data-type of an extracted variable BEFORE you move it to it's final variable location. String parsing without causing SQL errors, is definitely an art form, and extremely important for reliable code.

    Rick
    Simplicity is the ultimate sophistication.
    - L. DaVinci

Viewing 15 posts - 1 through 15 (of 47 total)

You must be logged in to reply to this topic. Login to reply