Wildcard Searches

  • Comments posted to this topic are about the item Wildcard Searches

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hi,

    Thanks for the article. I decided to try your email wildcard like search on our email list with a small tweak.

    Not all email address end with .com or .something. As in 90% of the emails in my db they end with .co.za, .co.gh, .co.zm - you get the idea.

    Herewith the tweaked code:

    SELECT *

    FROM dbo.Emails

    WHERE EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%.[A-Za-z][A-Za-z]%'

    OR EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%'

    I'm sure this should pick up most email adress combinations, but has anyone else come across other nuances in emails?

    Ciao!

    Michael

    Michael Gilchrist
    Database Specialist
    There are 10 types of people in the world, those who understand binary and those that don't. 😀

  • It would be interesting to see some performance information with some large data to see how this performs.

    I'd always thought that a lot of these kinds of operations would be fairly slow on anything other than trivially small amount of data (i.e. < 100K)

  • Thanks for this nice piece Luis!

    😎

  • @Peter.row:

    Of course it would use high CPU if you use a complicated pattern against a big data set.

    And if you have no covering index (where the column is part of the index or included) it will result in a very I/O intensive table scan.

    Depending on your scenario you could deliver all possible matching data to your application and do the filtering / searches at the client side (the most programming languages supports RegEx). It's your goal to decide if the bottleneck of your server is the CPU or the network.

  • Hey Luis

    In the section "Starting by ‘some string’"

    The comments of this

    --C) Return all rows when the name starts by any character between A and L

    SELECT *

    FROM dbo.LIKETest

    WHERE Name LIKE '[A-D]%';

    It must say

    --C) Return all rows when the name starts by any character between A and D

    Just replace the last letter.

  • 3 comments:

    SELECT *

    FROM dbo.LIKETest

    WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';

    That is insane.

    '[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'

    TIL 9000-19-39 is a valid date.

    Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.

    Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?

  • Great article Luis! Informative, to the point. I picked up a couple things.

    Interesting technique using REPLICATE, I would add that I like to use CHAR() more often in my searches. E.g CHAR(32) for spaces, CHAR(9) for tabs when I'm using LIKE, PATINDEX or CHARINDEX. Much easier to distinguish than the literal version of each.

    There are lots of things that can be done with patterns, like a pattern based splitter.

    That would make for a great article 😉

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Michael G (9/29/2015)


    Hi,

    Thanks for the article. I decided to try your email wildcard like search on our email list with a small tweak.

    Not all email address end with .com or .something. As in 90% of the emails in my db they end with .co.za, .co.gh, .co.zm - you get the idea.

    Herewith the tweaked code:

    SELECT *

    FROM dbo.Emails

    WHERE EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%.[A-Za-z][A-Za-z]%'

    OR EmailAddress LIKE N'%[A-Za-z0-9.+_]@[A-Za-z0-9.+_]%.[A-Za-z][A-Za-z]%'

    I'm sure this should pick up most email adress combinations, but has anyone else come across other nuances in emails?

    Ciao!

    Michael

    Actually, the code in the article should include your examples as well as the ones that have no domain type, only country code. The problem is the false positives that might show up, such as the nonsense from Bruce Banner.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • peter.row (9/29/2015)


    It would be interesting to see some performance information with some large data to see how this performs.

    I'd always thought that a lot of these kinds of operations would be fairly slow on anything other than trivially small amount of data (i.e. < 100K)

    I didn't include any performance testing because there aren't alternatives to this. If there are, please share them with us. I can measure the queries, but without equivalent queries, there's no point to it.

    This is about getting the correct/desired information regardless of performance, which is a real world problem.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • standardbluecaboose (9/29/2015)


    3 comments:

    SELECT *

    FROM dbo.LIKETest

    WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';

    That is insane.

    I'm not sure if this is good or bad.

    '[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'

    TIL 9000-19-39 is a valid date.

    That's why it's referred as a basic validation. A single pattern can't validate for every single date, especially when dealing with Feb-29 on leap years. I used one validation in the past that would even check if the date was YYYY/MM/DD, DD/MM/YYYY or MM/DD/YYYY with some priority over the last two formats. That's only for data cleansing and starting on SQL 2012 we can forget about all this by using TRY_CONVERT()

    Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.

    Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?

    I'm sorry, I did forget about this part.

    It's actually a simple set of good habits to allow an index seek whenever possible. A few examples in here:

    SELECT *

    FROM dbo.LIKETest

    WHERE Name LIKE 'B%'; -- SARGable, Uses Index Seek

    SELECT *

    FROM dbo.LIKETest

    WHERE CHARINDEX( 'B', Name) = 1; -- non SARGable, Uses Index Scan

    SELECT *

    FROM dbo.LIKETest

    WHERE LEFT( Name, 1) = 'B'; -- non SARGable, Uses Index Scan

    SELECT *

    FROM dbo.LIKETest

    WHERE Name LIKE '[A-D]%'; -- SARGable, Uses Index Seek

    SELECT *

    FROM dbo.LIKETest

    WHERE PATINDEX( '[A-D]%', Name) = 1; -- non SARGable, Uses Index Scan

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hi Luis,

    If possible a sequel of this article could be a performance comparison between LIKE and PATINDEX, and maybe also include CONTAINS.

    Also if Master Services is installed, you could use the REGEX match function from that database to look for patterns.

    Thanks for the explanation about LIKE wildcards.

    I have had the issue that if I use too complex wildcards, then PATINDEX doesn't work, and in those cases I have to stay with LIKE.

    Regards,

    José

  • Alan.B (9/29/2015)


    Great article Luis! Informative, to the point. I picked up a couple things.

    Interesting technique using REPLICATE, I would add that I like to use CHAR() more often in my searches. E.g CHAR(32) for spaces, CHAR(9) for tabs when I'm using LIKE, PATINDEX or CHARINDEX. Much easier to distinguish than the literal version of each.

    There are lots of things that can be done with patterns, like a pattern based splitter.

    That would make for a great article 😉

    Interesting, I prefer the single space over the CHAR(32) but the CHAR(9) over the tab literal. Maybe I'd just have to get used to it.

    About the article, you mean writing about the things that can be done? or the splitter? because the splitter already has an article written by Dwain.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • jose.pla (9/29/2015)


    Hi Luis,

    If possible a sequel of this article could be a performance comparison between LIKE and PATINDEX, and maybe also include CONTAINS.

    Also if Master Services is installed, you could use the REGEX match function from that database to look for patterns.

    Thanks for the explanation about LIKE wildcards.

    I have had the issue that if I use too complex wildcards, then PATINDEX doesn't work, and in those cases I have to stay with LIKE.

    Regards,

    José

    I haven't had the opportunity to work with CONTAINS or REGEX. If you are familiar with them, write an article and submit it. It would be very interesting to see how it compares with the basic T-SQL. I know the flexibility would be different and there are additional requirements to implement those options, but some people might find them worth it for their applications.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (9/29/2015)


    standardbluecaboose (9/29/2015)


    3 comments:

    SELECT *

    FROM dbo.LIKETest

    WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%';

    That is insane.

    I'm not sure if this is good or bad.

    '[1-9][0-9][0-9][0-9]-[0-1][0-9]-[0-3][0-9]'

    TIL 9000-19-39 is a valid date.

    That's why it's referred as a basic validation. A single pattern can't validate for every single date, especially when dealing with Feb-29 on leap years. I used one validation in the past that would even check if the date was YYYY/MM/DD, DD/MM/YYYY or MM/DD/YYYY with some priority over the last two formats. That's only for data cleansing and starting on SQL 2012 we can forget about all this by using TRY_CONVERT()

    Some people use the PATINDEX or CHARINDEX (which doesn’t allow wildcards) functions to validate if a string contains certain characters. That’s a bad idea and it’ll be discussed further in this article.

    Did I miss the part where the author discusses why it's a bad idea? Did I skim over it?

    I'm sorry, I did forget about this part.

    It's actually a simple set of good habits to allow an index seek whenever possible. A few examples in here:

    SELECT *

    FROM dbo.LIKETest

    WHERE Name LIKE 'B%'; -- SARGable, Uses Index Seek

    SELECT *

    FROM dbo.LIKETest

    WHERE CHARINDEX( 'B', Name) = 1; -- non SARGable, Uses Index Scan

    SELECT *

    FROM dbo.LIKETest

    WHERE LEFT( Name, 1) = 'B'; -- non SARGable, Uses Index Scan

    SELECT *

    FROM dbo.LIKETest

    WHERE Name LIKE '[A-D]%'; -- SARGable, Uses Index Seek

    SELECT *

    FROM dbo.LIKETest

    WHERE PATINDEX( '[A-D]%', Name) = 1; -- non SARGable, Uses Index Scan

    Thanks for the quick reply.

    The WHERE ' ' + Name + ' ' LIKE '%[^A-Za-z]Richard[^A-Za-z]%'; is insane in a bad way. I have no suggestions for how to make it better, but it feels like such a hack. However, a lot about SQL feels like a hack to me.

    Thank you for the update about the PATINDEX.

    I will say that Column LIKE '%X%'` and `PATINDEX('X', Column) both do a full scan.

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply