Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL Expand / Collapse
Author
Message
Posted Wednesday, June 10, 2009 9:51 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Thursday, November 14, 2013 3:13 PM
Points: 306, Visits: 1,458
"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."

Nice, a quote from a Yogi and a quote from a Yoda.

Great article. Thanks for the code samples.

Now if we can just adjust western culture to shorten William to Will rather than Bill we'll all be better off.

Who decided that Robert should become BOB?!

-D
Post #732418
Posted Wednesday, June 10, 2009 11:13 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, March 28, 2013 12:21 PM
Points: 58, Visits: 199
We did this very same "string distance" matching algorithms 3-4 years ago at a place I worked. It works great for small datasets, but SSIS fuzzy matching is much faster and easier to use for large datasets.

_______________________________
[size="5"]Jody Claggett
SQL Server Reporting Analyst
[/size]
Post #732495
Posted Wednesday, June 10, 2009 11:48 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, July 18, 2014 8:56 AM
Points: 861, Visits: 2,355
And, if anyone cares, here is some U.S. centric address sanitization to add to the clean_string function or set of functions. Given that addresses vary significantly country to country, please review carefully before using in your country.

At the end is also primitive SQL to turn consecutive spaces into single spaces, which can be trivially separated from the PO Box replacements.

Unfortunately, this trivial but tedious method does result in very long SQL, and substantial CPU time for the cleansing function.

-- 238 nestings, ALLEY to LOOP plus " BY PASS " and " BL ", and " BLUFFS ", " BROOKS ", " BURGS ", " CENTERS ", " CLIFFS ", " CIRCLES ", " CORNERS ", " COURTS ", " COVES ", " DRIVES ", " ESTATES ", " EXTENSIONS ", " FALLS ", " FIELDS ", " FLATS ", " FORDS ", " FORGES ", " FORKS ", " GARDENS ", " GLENS ", " GREENS ", " GROVES ", " HARBORS ", " HILLS ", " ISLANDS ", " JUNCTIONS ", " KEYS ", " KNOLLS ", " LAKES ", " LIGHTS ", " LOCKS ", and so on to the singular abbreviation from the official plural abbreviations
-- spaces remain at this point to allow for word separations, created in bulk via Excel and global search and replace, meaningless self-replacements removed.
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' ALLEE ',' ALY '),' ALLEY ',' ALY '),' ALLY ',' ALY '),' ANEX ',' ANX '),' ANNEX ',' ANX '),' ANNX ',' ANX '),' ARCADE ',' ARC '),' AV ',' AVE '),' AVEN ',' AVE '),' AVENU ',' AVE '),' AVENUE ',' AVE '),' AVN ',' AVE '),' AVNUE ',' AVE '),' BAYOO ',' BYU '),' BAYOU ',' BYU '),' BEACH ',' BCH '),' BEND ',' BND '),' BLUF ',' BLF '),' BLUFF ',' BLF '),' BLUFFS ',' BLF '),' BLFS ',' BLF '),' BOT ',' BTM '),' BOTTM ',' BTM '),' BOTTOM ',' BTM '),' BL ',' BLVD '),' BOUL ',' BLVD '),' BOULEVARD ',' BLVD '),' BOULV ',' BLVD '),' BRANCH ',' BR '),' BRNCH ',' BR '),' BRDGE ',' BRG '),' BRIDGE ',' BRG '),' BROOK ',' BRK '),' BROOKS ',' BRK '),' BRKS ',' BRK '),' BURG ',' BG '),' BURGS ',' BG '),' BGS ',' BG '),' BYPA ',' BYP '),' BYPAS ',' BYP '),' BYPASS ',' BYP '),' BY PASS ',' BYP '),' BYPS ',' BYP '),' CAMP ',' CP '),' CMP ',' CP '),' CANYN ',' CYN '),' CANYON ',' CYN '),' CNYN ',' CYN '),' CAPE ',' CPE '),' CAUSEWAY ',' CSWY '),' CAUSWAY ',' CSWY ') ,' CEN ',' CTR '),' CENT ',' CTR '),' CENTER ',' CTR '),' CENTR ',' CTR '),' CENTRE ',' CTR '),' CNTER ',' CTR '),' CNTR ',' CTR '),' CENTERS ',' CTR '),' CTRS ',' CTR '),' CIRC ',' CIR '),' CIRCL ',' CIR '),' CIRCLE ',' CIR '),' CRCL ',' CIR '),' CRCLE ',' CIR '),' CIRCLES ',' CIR '),' CIRS ',' CIR '),' CLIFF ',' CLF '),' CLFS ',' CLF '),' CLIFFS ',' CLF '),' CLUB ',' CLB '),' COMMON ',' CMN '),' CORNER ',' COR '),' CORNERS ',' COR '),' CORS ',' COR '),' COURSE ',' CRSE '),' COURT ',' CT '),' CRT ',' CT '),' COURTS ',' CT '),' CTS ',' CT '),' COVE ',' CV '),' CVS ',' CV '),' COVES ',' CV '),' CK ',' CRK '),' CR ',' CRK '),' CREEK ',' CRK '),' CRECENT ',' CRES '),' CRESCENT ',' CRES '),' CRESENT ',' CRES '),' CRSCNT ',' CRES ') ,' CRSENT ',' CRES '),' CRSNT ',' CRES '),' CREST ',' CRST '),' CROSSING ',' XING '),' CRSSING ',' XING '),' CRSSNG ',' XING '),' CROSSROAD ',' XRD '),' CURVE ',' CURV '),' DALE ',' DL '),' DAM ',' DM '),' DIV ',' DV '),' DIVIDE ',' DV '),' DVD ',' DV '),' DRIV ',' DR '),' DRIVE ',' DR '),' DRV ',' DR '),' DRIVES ',' DR '),' DRS ',' DR '),' ESTATE ',' EST '),' ESTATES ',' EST '),' ESTS ',' EST '),' EXP ',' EXPY '),' EXPR ',' EXPY '),' EXPRESS ',' EXPY '),' EXPRESSWAY ',' EXPY '),' EXPW ',' EXPY '),' EXTENSION ',' EXT '),' EXTN ',' EXT '),' EXTNSN ',' EXT '),' EXTENSIONS ',' EXT '),' EXTS ',' EXT '),' FALLS ',' FALL '),' FLS ',' FALL '),' FERRY ',' FRY '),' FRRY ',' FRY '),' FIELD ',' FLD '),' FIELDS ',' FLD '),' FLDS ',' FLD '),' FLAT ',' FLT '),' FLATS ',' FLT '),' FLTS ',' FLT '),' FORD ',' FRD '),' FRDS ',' FRD '),' FORDS ',' FRD '),' FOREST ',' FRST '),' FORESTS ',' FRST '),' FORG ',' FRG '),' FORGE ',' FRG '),' FRGS ',' FRG '),' FORGES ',' FRG '),' FORK ',' FRK '),' FORKS ',' FRK '),' FRKS ',' FRK '),' FORT ',' FT '),' FRT ',' FT '),' FREEWAY ',' FWY '),' FREEWY ',' FWY '),' FRWAY ',' FWY '),' FRWY ',' FWY '),' GARDEN ',' GDN '),' GARDN ',' GDN '),' GRDEN ',' GDN '),' GRDN ',' GDN '),' GARDENS ',' GDN '),' GDNS ',' GDN '),' GRDNS ',' GDN '),' GATEWAY ',' GTWY '),' GATEWY ',' GTWY '),' GATWAY ',' GTWY '),' GTWAY ',' GTWY '),' GLEN ',' GLN '),' GLNS ',' GLN '),' GLENS ',' GLN '),' GREEN ',' GRN '),' GRNS ',' GRN '),' GREENS ',' GRN '),' GROV ',' GRV '),' GROVE ',' GRV '),' GRVS ',' GRV '),' GROVES ',' GRV '),' HARB ',' HBR '),' HARBOR ',' HBR '),' HARBR ',' HBR '),' HBRS ',' HBR ') ,' HRBOR ',' HBR ') ,' HARBORS ',' HBR '),' HAVEN ',' HVN '),' HAVN ',' HVN '),' HEIGHT ',' HTS '),' HEIGHTS ',' HTS '),' HGTS ',' HTS '),' HT ',' HTS '),' HIGHWAY ',' HWY '),' HIGHWY ',' HWY '),' HIWAY ',' HWY '),' HIWY ',' HWY '),' HWAY ',' HWY '),' HILL ',' HL '),' HILLS ',' HL '),' HLS ',' HL '),' HLLW ',' HOLW '),' HOLLOW ',' HOLW '),' HOLLOWS ',' HOLW '),' HOLWS ',' HOLW '),' INLET ',' INLT '),' ISLAND ',' IS '),' ISLND ',' IS '),' ISLANDS ',' IS '),' ISLNDS ',' IS '),' ISS ',' IS '),' ISLES ',' ISLE '),' JCTION ',' JCT '),' JCTN ',' JCT '),' JUNCTION ',' JCT '),' JUNCTN ',' JCT '),' JUNCTON ',' JCT '),' JCTNS ',' JCT '),' JCTS ',' JCT '),' JUNCTIONS ',' JCT '),' KEY ',' KY '),' KEYS ',' KY '),' KYS ',' KY '),' KNL ',' KNL '),' KNOL ',' KNL '),' KNOLL ',' KNL '),' KNLS ',' KNL '),' KNOLLS ',' KNL '),' LAKE ',' LK '),' LK ',' LK '),' LAKES ',' LK '),' LKS ',' LK '),' LAND ',' LAND '),' LANDING ',' LNDG '),' LNDNG ',' LNDG '),' LA ',' LN '),' LANE ',' LN '),' LANES ',' LN '),' LIGHT ',' LGT '),' LIGHTS ',' LGT '),' LGTS ',' LGT '),' LOAF ',' LF '),' LOCK ',' LCK '),' LCKS ',' LCK '),' LOCKS ',' LCK '),' LDGE ',' LDG '),' LODG ',' LDG '),' LODGE ',' LDG '),' LOOPS ',' LOOP ')


-- 201 nestings, MALL to WELLS plus " UNIT " then all secondary designators APARTMENT to UPPER, and " MANORS ", " MEADOWS ", " MILLS ", " MOUNTAINS ", " PINES ", " PLAINS ", " POINTS ", " PORTS ", " RAPIDS ", " RIDGES ", " ROADS ", " SHOALS ", " SHORES ", " SPRINGS ", " SQUARES ", " STREETS ", " UNIONS ", " VALLEYS ", " VIEWS ", " VILLAGES ", " WAYS ", " WELLS ", and so on to the singular abbreviation from the official plural abbreviations
-- spaces remain at this point to allow for word separations, created in bulk via Excel and global search and replace, meaningless self-replacements removed.
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' MANOR ',' MNR '),' MNR ',' MNR '),' MANORS ',' MNR '),' MNRS ',' MNR '),' MEADOW ',' MDW '),' MDWS ',' MDW '),' MEADOWS ',' MDW ') ,' MEDOWS ',' MDW '),' MEWS ',' MEWS '),' MILL ',' ML '),' MILLS ',' ML '),' MLS ',' ML ') ,' MISSION ',' MSN '),' MISSN ',' MSN '),' MSSN ',' MSN '),' MOTORWAY ',' MTWY '),' MNT ',' MT '),' MOUNT ',' MT '),' MNTAIN ',' MTN '),' MNTN ',' MTN '),' MOUNTAIN ',' MTN '),' MOUNTIN ',' MTN '),' MTIN ',' MTN '),' MTNS ',' MTN '),' MNTNS ',' MTN ') ,' MOUNTAINS ',' MTN '),' NECK ',' NCK '),' ORCHARD ',' ORCH '),' ORCHRD ',' ORCH '),' OVL ',' OVAL '),' OVERPASS ',' OPAS '),' PK ',' PARK '),' PRK ',' PARK '),' PARKS ',' PARK '),' PARKWAY ',' PKWY '),' PARKWY ',' PKWY '),' PKWAY ',' PKWY '),' PKY ',' PKWY '),' PARKWAYS ',' PKWY ') ,' PKWYS ',' PKWY '),' PASSAGE ',' PSGE '),' PATHS ',' PATH '),' PIKES ',' PIKE '),' PINE ',' PNE '),' PINES ',' PNE '),' PNES ',' PNE '),' PLACE ',' PL '),' PLAIN ',' PLN '),' PLAINES ',' PLN '),' PLAINS ',' PLN '),' PLNS ',' PLN '),' PLAZA ',' PLZ '),' PLZA ',' PLZ '),' POINT ',' PT '),' POINTS ',' PT '),' PTS ',' PT '),' PORT ',' PRT '),' PORTS ',' PRT '),' PRTS ',' PRT '),' PRAIRIE ',' PR '),' PRARIE ',' PR '),' PRR ',' PR '),' RAD ',' RADL '),' RADIAL ',' RADL '),' RADIEL ',' RADL '),' RANCH ',' RNCH '),' RANCHES ',' RNCH '),' RNCHS ',' RNCH '),' RAPID ',' RPD '),' RAPIDS ',' RPD '),' RPDS ',' RPD '),' REST ',' RST '),' RDGE ',' RDG '),' RIDGE ',' RDG '),' RDGS ',' RDG '),' RIDGES ',' RDG '),' RIVER ',' RIV '),' RIVR ',' RIV '),' RVR ',' RIV '),' ROAD ',' RD '),' RDS ',' RD '),' ROADS ',' RD '),' ROUTE ',' RTE '),' SHOAL ',' SHL '),' SHLS ',' SHL '),' SHOALS ',' SHL '),' SHOAR ',' SHR '),' SHORE ',' SHR '),' SHR ',' SHR '),' SHOARS ',' SHR '),' SHORES ',' SHR '),' SHRS ',' SHR '),' SKYWAY ',' SKWY '),' SPNG ',' SPG '),' SPRING ',' SPG '),' SPRNG ',' SPG '),' SPGS ',' SPG '),' SPNGS ',' SPG '),' SPRINGS ',' SPG '),' SPRNGS ',' SPG '),' SPURS ',' SPUR '),' SQR ',' SQ '),' SQRE ',' SQ '),' SQU ',' SQ '),' SQUARE ',' SQ '),' SQRS ',' SQ '),' SQUARES ',' SQ ') ,' SQS ',' SQ '),' STATION ',' STA '),' STATN ',' STA '),' STN ',' STA '),' STRAV ',' STRA '),' STRAVE ',' STRA '),' STRAVEN ',' STRA '),' STRAVENUE ',' STRA '),' STRAVN ',' STRA '),' STRVN ',' STRA '),' STRVNUE ',' STRA '),' STREAM ',' STRM '),' STREME ',' STRM '),' STR ',' ST '),' STREET ',' ST '),' STRT ',' ST '),' STREETS ',' ST '),' STS ',' ST '),' SUMIT ',' SMT '),' SUMITT ',' SMT '),' SUMMIT ',' SMT '),' TERR ',' TER '),' TERRACE ',' TER '),' THROUGHWAY ',' TRWY '),' TRACE ',' TRCE '),' TRACES ',' TRCE '),' TRACK ',' TRAK '),' TRACKS ',' TRAK '),' TRK ',' TRAK '),' TRKS ',' TRAK '),' TRAFFICWAY ',' TRFY '),' TR ',' TRL '),' TRAIL ',' TRL '),' TRAILS ',' TRL '),' TRLS ',' TRL '),' TUNEL ',' TUNL '),' TUNLS ',' TUNL '),' TUNNEL ',' TUNL '),' TUNNELS ',' TUNL '),' TUNNL ',' TUNL '),' TPK ',' TPKE '),' TRNPK ',' TPKE '),' TRPK ',' TPKE '),' TURNPIKE ',' TPKE '),' TURNPK ',' TPKE '),' UNDERPASS ',' UPAS '),' UNION ',' UN '),' UNIONS ',' UN '),' UNS ',' UN '),' VALLEY ',' VLY '),' VALLY ',' VLY '),' VLLY ',' VLY '),' VALLEYS ',' VLY ') ,' VLYS ',' VLY ') ,' VDCT ',' VIA '),' VIADCT ',' VIA '),' VIADUCT ',' VIA '),' VIEW ',' VW '),' VIEWS ',' VW '),' VWS ',' VW '),' VILL ',' VLG '),' VILLAG ',' VLG '),' VILLAGE ',' VLG '),' VILLG ',' VLG '),' VILLIAGE ',' VLG '),' VILLAGES ',' VLG '),' VLGS ',' VLG '),' VILLE ',' VL '),' VIST ',' VIS '),' VISTA ',' VIS '),' VST ',' VIS '),' VSTA ',' VIS '),' WALKS ',' WALK '),' WY ',' WAY '),' WAYS ',' WAY '),' WELL ',' WL '),' WELLS ',' WL '),' WLS ',' WL '),' APARTMENT ',' APT '),' BASEMENT ',' BSMT ') ,' BUILDING ',' BLDG '),' DEPARTMENT ',' DEPT '),' FLOOR ',' FL '),' FRONT ',' FRNT '),' HANGAR ',' HNGR '),' LOBBY ',' LBBY '),' LOWER ',' LOWR '),' OFFICE ',' OFC '),' PENTHOUSE ',' PH '),' ROOM ',' RM '),' SPACE ',' SPC '),' SUITE ',' STE '),' TRAILER ',' TRLR '),' UNT ',' UNIT '),' UPPER ',' UPPR ')


-- Turn any set of spaces into a single space. Why 13 2:1 replacements? Because 2^13 is 8192, just in excess of VARCHAR(8000), so we _know_ that any number of contiguous spaces is now one space.
-- Also, standardize the most common misspelled PO BOX types _after_ the spaces are collapsed.
-- single spaces remain at this point to allow for word separations
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '), ' po bx ',' PO BOX '), ' p o box ',' PO BOX '), ' P0 BOX ',' PO BOX ')



Post #732523
Posted Wednesday, June 10, 2009 1:46 PM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, July 18, 2014 8:56 AM
Points: 861, Visits: 2,355
Ira, I apologize, but it appears there is at least one error in the algorithm, most likely dealing with transpositions (or I accidentally changed something important when making what I thought were cosmetic changes).

Most importantly, I would not expect ('Yvette','Yevett'), ('Martha','Marhta'), and ('Shackleford','Shackelford') to have the 0.999999 scores they're getting.

Additionally, ('Massey','Massie') is also scored significantly lower by your T-SQL code than by the Winkler paper below (0.8888 vs. 0.9444).

Most other scores are either almost identical, or relatively similar.

Jaro-Winkler output validation against a paper by William E. Winkler and Yves Thibaudeau, U.S. Census: url=http://www.census.gov/srd/papers/pdf/rr91-9.pdf
Referenced from the U.S. National Institute of Standards and Technology: url=http://www.itl.nist.gov/div897/sqg/dads/HTML/jaroWinkler.html

SELECT [dbo].[JaroWinkler]('shackleford','shackelford') -- expect 0.9848
SELECT [dbo].[JaroWinkler]('cunningham','cunnigham') -- expect 0.9833
SELECT [dbo].[JaroWinkler]('campell','campbell') -- expect 0.9792
SELECT [dbo].[JaroWinkler]('nichleson','nichulson') -- expect 0.9630
SELECT [dbo].[JaroWinkler]('massey','massie') -- expect 0.9444
SELECT [dbo].[JaroWinkler]('abroms','abrams') -- expect 0.9333
SELECT [dbo].[JaroWinkler]('galloway','calloway') -- expect 0.9167
SELECT [dbo].[JaroWinkler]('lampley','campley') -- expect 0.9048
SELECT [dbo].[JaroWinkler]('dixon','dickson') -- expect 0.8533
SELECT [dbo].[JaroWinkler]('frederick','fredrick') -- expect 0.9815
SELECT [dbo].[JaroWinkler]('michele','michelle') -- expect 0.9792
SELECT [dbo].[JaroWinkler]('jesse','jessie') -- expect 0.9722
SELECT [dbo].[JaroWinkler]('marhta','martha') -- expect 0.9667
SELECT [dbo].[JaroWinkler]('jonathon','jonathan') -- expect 0.9583
SELECT [dbo].[JaroWinkler]('julies','juluis') -- expect 0.9333
SELECT [dbo].[JaroWinkler]('jeraldine','geraldine') -- expect 0.9246
SELECT [dbo].[JaroWinkler]('yvette','yevett') -- expect 0.9111
SELECT [dbo].[JaroWinkler]('tanya','tonya') -- expect 0.8933
SELECT [dbo].[JaroWinkler]('dwayne','duane') -- expect 0.8578

Post #732588
Posted Wednesday, June 10, 2009 3:06 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, April 14, 2014 1:31 AM
Points: 42, Visits: 289
Very interesting, I am stepping through the code, I am getting the same results. I will let you know.

Ira Warren Whiteside
Post #732628
Posted Sunday, June 14, 2009 10:35 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, June 30, 2014 12:41 AM
Points: 79, Visits: 166
Any update on the potential error in the function?

Martin



Kindest Regards,

Martin
Post #734789
Posted Wednesday, June 17, 2009 7:37 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, January 29, 2014 8:13 PM
Points: 194, Visits: 1,096
Hi,
Article is very interesting.
I use to work with name and address matching for a while.
Never tried Jaro Winkler algorithm and want to thank Ira for that.
What striked my mind is that good matching itself is just a part of matching process.
Another interesting part of matching is data parsing.
Good parsing gives better matching results.
It would be interesting to hear about data parsing and it's common issues.
Regards
Anatoliy
Post #737100
Posted Thursday, June 18, 2009 1:19 PM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, July 18, 2014 8:56 AM
Points: 861, Visits: 2,355
I agree that parsing is critical. On the other hand, how to parse is also very much an "it depends" question.

In general, for my only slightly imperfect matching I tend to combine split fields (addr1, addr2, first name, last name), clean them up (nested REPLACE() for SQL), remove extraneous elements (professional designations, etc.), remove all remaining spaces, and then compare them. With at least some matching on another selective field (license number, for instance), this produces high quality results.

For later, fuzzier passes... well, that's where splitting pieces apart, cross-matching, Jaro-Winkler, Double Metaphone, and so on come in.

As far as splitting pieces apart, for the ' ' + addr1 + ' ' + addr2 + ' ' (etc.), I have code based on U.S. addresses that tries to find all leading digits, all suite/apartment indicators (whether they be first or last), whatever's in the middle, and the first "word" of whatever's in the middle. These are then used in various combinations with City+State and/or Zip code on the address portion of the matching.

In pure T-SQL, splitting the pieces apart is fairly resource heavy.
Post #737853
Posted Thursday, June 18, 2009 6:21 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, January 29, 2014 8:13 PM
Points: 194, Visits: 1,096
Hi,
In my case I had also Middle Name involved in matching process.
Also we could have initials instead of full names which actually makes it more complicated,
and even more complicated for Asian names.
Regarding addresses, I want to mention that from my poiny of view addressses are virtual.
There are different address databases around but there is no absolute one.
Addresses people use are another source of address virtualisation we dealing with.
And to build up to date address DB is quite complex task.
Post #738002
Posted Friday, June 19, 2009 11:51 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Friday, July 18, 2014 8:56 AM
Points: 861, Visits: 2,355
In my case, middle name/initial wasn't of any real value in and of itself.

First initials are often an issue.

One step I have not yet taken is splitting up names into individual words, so Guthridge-Henderson matches with either Guthridge or Henderson at a fuzzyish level of matching.

Likewise, matching on common nicknames (Dick vs. Richard, Bob vs. Rob vs. Robert) I haven't implemented yet.

Matching on the first initial of nicknames, however, I'd judge as a lost cause (D matching R).

Addresses take quite a lot of cleanup, and even then, they're hard to match. Sending them out to a good address cleanup service is probably the best bet, but failing that, we have to try ourselves.

Post #738538
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse