Using a parameter for substring length

  • Hello - I have a report used to compare records looking for duplicates. The sql creates two temp tables: one is the "base" table and the other contains the "base" information plus the probable duplicate record, and this is the data appearing in the report. I am using substrings to compare the data. I'd like to make the "length" portion of the substring a parameter for the first name, last name and address fields. I'd also like the user to be able to put in any value for the length.

    If I hard-code the substrings, all this works great. If I put a parameter into the length of the substring function and have a parameter definition of integer, it will not work. I'm looking for answers to the following questions (and I've looked at BOL and Googled, etc., and had no joy): Can I do what I'm looking to do? If so, then how do I set up the parameter so that the user can enter a value to be passed? Any good further reading/examples? Thank you very much for your time and any assistance! ๐Ÿ™‚

    Below is a sample of the code, showing all the things I have going on with this which work (obviously, this is a representation):

    create table

    #dup_universe (

    constituentid varchar(20)

    ,firstname varchar(50)

    ,lastname varchar(100)

    ,prefaddress varchar(150)

    ,prefzip varchar(12)

    )

    CREATE NONCLUSTERED INDEX [ix_dup_universe] ON #dup_universe

    (

    constituentid,

    firstname,

    lastname,

    prefaddress,

    prefzip

    )

    ------------------------------------------------------------------------------------------

    insert into #dup_universe

    SELECT distinct

    DATA

    from database

    CREATE TABLE -- need a table to measure the quality of dup detection

    #dup_rating ( --columns from a and from b

    dupid varchar(20)

    ,dupfirst varchar(50)

    ,duplast varchar(100)

    ,dupprefaddr varchar(150) NULL

    ,dupprefzip varchar(12) NULL --20

    ,matchid varchar(20) --dupid

    ,matchfirst varchar(50) --dupfirst

    ,matchlast varchar(100) --duplast

    ,matchaddress varchar(150) NULL --dupprefaddr

    ,matchprefzip varchar(12) NULL --dupprefzip --20

    )

    CREATE NONCLUSTERED INDEX [ix_dup_rating] ON #dup_rating

    (

    dupid,

    dupfirst,

    duplast,

    dupprefaddr,

    matchid,

    matchfirst,

    matchlast,

    matchaddress

    )

    insert into #dup_rating

    select distinct

    a.constituentid dupid

    ,a.firstname dupfirst

    ,a.lastname duplast

    ,a.prefaddress dupprefaddr

    ,a.prefzip dupzip

    ,b.constituentid matchid

    ,b.firstname matchfirst

    ,b.lastname matchlast

    ,b.prefaddress dupprefaddr

    ,b.prefzip dupzip

    from #dup_universe a

    JOIN #dup_universe b

    on substring(a.firstname,1,10) = substring(b.firstname,1,10) --@firstlen

    and substring(a.lastname,1,10) = substring(b.lastname,1,10) --@lastlen

    and substring(a.prefaddress,1,15) = substring(b.prefaddress,1,15) --@addrlen

    and a.constituentid <> b.constituentid

    where a.constituentid not in (select r.dupid from #dup_rating r)

    order by a.lastname

    SELECT * from #dup_rating

    DROP TABLE #DUP_UNIVERSE

    DROP TABLE #DUP_RATING

  • Y u r using Substr, if u want to compare firstname and lastname.. u can compare by removing leading and trailing spaces in the string... ltrim()... rtrim() can be used for the same...

    For complete matching of the prefaddress... u can use like...

    and substring(a.prefaddress,1,len(a.prefaddress)) = substring(b.prefaddress,1,len(b.prefaddress)) --@addrlen

    Let me know if that doesn't complete ur requirement....

  • Hi, Sumit - that's not actually what I'm looking for - I want to use a substring because when we do the name or address comparison, I always want to start at the beginning and then allow the user to select how many characters from there she wants to match on.

    UPDATE: I (of course) finally just set up the variables and then kicked off the code in Studio. With the substring lengths hard-coded, it takes 3-5 minutes to run. A long time, but not bad considering what it's doing. With the parameter variables, it was still running 2.5 hours later when I finally killed it. This is a huge decrease in performance. Any ideas as to why this is? Is it just not possible to do what I want to do with any expectation of reasonable performance?

    Thank you for your input --

    Donna

  • Donna

    Try comparing rows on zipcode, surname, LEFT(forename,1) and the first word of the first address line. This works well in the UK because postcode plus the first word of the first address line, i.e. house number, is pretty much unique, not sure how it will work for US zipcodes. Speed is good if postcode and surname are indexed, and false positives are low.

    Cheers

    ChrisM

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply