Help with using mulitple replace in a function

  • I would like to create a function that performs several replaces.

    Our collation is case insensitive.

    I want to compare two different records to try to determine if they are duplicates. Names and addresses often have different punctuation. I'd like to strip that out so that when I compare two names or addresses, I will find more matches. So O'Malley becomes OMalley, Mary Ann becomes MaryAnn, etc. I'd also like to change letters with tildes, accents, etc. to letters without them. What would be an efficient way to code this? Here's what I've done so far.

    CREATE FUNCTION [dbo].[WRT_simple_name]

    (

    @LAST_NAME char(30)

    )

    RETURNS char(30)

    AS

    BEGIN

    -- Declare the return variable here

    DECLARE @L_NAME char(30)

    -- Add the T-SQL statements to compute the return value here

    Set @L_NAME =

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(rtrim(@LAST_NAME),'''','')

    ,'.','')

    ,' ','')

    ,'-','')

    ,'_','')

    ,'~','')

    ,'ž','z')

    ,'ó','o')

    ,'š','s')

    ,'í','i')

    ,'é','e')

    -- Return the result of the function

    RETURN @l_name

    END

    I do not intend to ACTUALLY replace the data, I just want this to have more "hits" on comparisons.

  • Julie Breutzmann (7/14/2010)


    ...

    What would be an efficient way to code this?

    ...

    If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Have you tried doing comparisons using the SOUNDEX function?

    Hope this helps,
    Rich

    [p]
    [/p]

  • rdouglas66 (7/14/2010)


    Have you tried doing comparisons using the SOUNDEX function?

    I have tried, have you? I don't think it will help much in what the OP trying to do...

    Soundex is just a pity attemp of "fuzzy" search

    Soundex('LLoyds') ! = Soundex('Loyds')

    but

    Soundex('Brighton') = Soundex('Bristol')

    No much use!

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Thanks for the suggestions.

    While I don't dispute that CLR functions in C# are better, I am currently restricted to TSQL.

    I tried using the soundex function and there were too many false positives.

  • Julie try this function agaisnt your data.

    this strips out everything that is not a number and not a character...so spaces, control characters, brackets or whatever get stripped out;

    that seems to be what you are after for your comparison you want to try.

    you might need to adapt it to substitute ,'ó','o', as it currently strips out the ,'ó'

    it's a scalar function, so it's not the fastest, but this sounds like a one time cleanup, and not something that needs to be converted to an ITVF:

    usage :

    select dbo.StripNonAlphaNumeric(FieldName) ,FieldName from SomeTable

    CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(8000)

    ;WITH tally (N) as

    (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii upper case letters A-Z is 65 thru 90

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii lower case letters a-z is 97 thru 122

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    FROM tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.

    It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.

    Thanks for your suggestion.

  • Joe Celko (7/14/2010)


    SQL is not a text language. Get a copy of Mellisa Data or other address data scrubbing tools. They will get you to CASS and 9-1-1 Standards much quicker than trying to learn then code them in your own SQL.

    nah, if you are familiar with any tool, whether SQL or some programming language, there's no reason to buy/load/learn another application when you can use the existing tools to do the same work.

    regardless of whether SQL is a "text" language, the job can be done there fast and easy, leaving the resulting data in a place where you can use it or update it;

    external tools might scrub the data better , but the learning curve to use them, plus an additional cost, would not be worth it for the once in a while check for misspelled duplicates.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Eugene Elutin (7/14/2010)


    Julie Breutzmann (7/14/2010)


    ...

    What would be an efficient way to code this?

    ...

    If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

    Maybe in this case... maybe not. I've beaten certain CLR functions.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Julie Breutzmann (7/14/2010)


    That looks promising, Lowell. I'd need to add a piece for special letters like those with accents, umlauts, etc. I've used tally tables before, thanks to Jeff Moden's well-written article, so I understand how this works.

    It's not something for "production" but we do occasionally get large files (50,000 records) to add to our records so avoiding duplication is important.

    Thanks for your suggestion.

    Hi Julie,

    First, thanks for the compliment. Glad to have been of service.

    Second, I agree with Joe Celko on this one. CASS Certification programs really don't cost that much (less than you'll spend trying to write your own code to some point of perfection) with the side effect of having fulling validated addresses. In other words, you'll not only be able to search for dupes, but you also be able to find out where customers have 1) flat out lied or 2) don't really know what their correct address is (and it happens more than you can believe) or 3) have phat phingered things so badly that there's no chance of determining if there's a dupe or not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I would love to purchase such a program. I have one in mind that is partnered with our vendor and would provide numerous benefits to our institution, but it's not gonna happen this year. I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁

  • Julie Breutzmann (7/15/2010)


    I work in higher ed and our budget has been cut significantly. To make matters worse a significant portion of our budget is non-discretionary. 🙁

    Damn... I'll never understand politicians. :crazy:

    Does Lowell's function do the trick for you? I'm a little concerned about performance there because of the scalar nature of the function but I won't be near a machine with T-SQL on it until Monday so I can't do much about it for demonstration purposes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I work at a private college, but it's still politics to some degree. Even though my department head explained this to the cabinet, I'm not sure that they understand or maybe protecting their own budgets mattered more to them.

    I haven't had a chance to play with Lowell's suggestion, yet. Things have been too hectic and though important, it's currently not a high priority. I will certainly post to let people know how this works for me.

    Julie

  • I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.

    This should be useable for files this size or smaller. Not so sure about larger ones, even though we only deal with large ones a few times a year. I may consider adding a table to the database with the id, and the converted first name, last name, and address line 1, with triggers to keep it up-dated. That way the function only needs to run on the (relatively) smaller input file and should be significantly faster. I probably won't get to this for a bit.

    Here's the function I ended up with.

    CREATE FUNCTION wrt_Simple_name_2(@OriginalText VARCHAR(100))

    RETURNS VARCHAR(100)

    BEGIN

    DECLARE @CleanedText VARCHAR(100)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 48 AND 57

    THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

    --ascii upper case letters A-Z is 65 thru 90

    WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 65 AND 90

    THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

    --ascii lower case letters a-z is 97 thru 122

    WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 97 AND 122

    THEN SUBSTRING(@OriginalText,wrt_tally.N,1)

    --ascii "foreign" letters are 192 - 240

    WHEN ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) BETWEEN 193 AND 254

    THEN table_detail.TBL_VALUE_ALT1

    ELSE '' END

    FROM wrt_tally

    left outer join table_detail on ASCII(SUBSTRING(@OriginalText,wrt_tally.N,1)) = table_detail.table_value

    and table_detail.column_name = 'wrt_cm_letter_translation'

    WHERE wrt_tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    A few comments:

    Any custom tables, views, functions, etc. need to start with 'wrt' per the rules of our database vendor.

    wrt_tally is our custom tally table.

    table_detail is a code table that I used to "convert" letters such as "a umlaut" to "a".

    I very much appreciate all the assistance offered.

    Julie

  • Julie Breutzmann (7/16/2010)


    I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.

    That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply