Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL


Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

Author
Message
Darren Wallace
Darren Wallace
Old Hand
Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)Old Hand (308 reputation)

Group: General Forum Members
Points: 308 Visits: 1459
"karo yaa na karo, koshish jaisa kuch nahi hai"
"Do, or do not. There is no try."

Nice, a quote from a Yogi and a quote from a Yoda.

Great article. Thanks for the code samples.

Now if we can just adjust western culture to shorten William to Will rather than Bill we'll all be better off.

Who decided that Robert should become BOB?!

-D
Jody Claggett-376930
Jody Claggett-376930
Valued Member
Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)Valued Member (58 reputation)

Group: General Forum Members
Points: 58 Visits: 199
We did this very same "string distance" matching algorithms 3-4 years ago at a place I worked. It works great for small datasets, but SSIS fuzzy matching is much faster and easier to use for large datasets.

_______________________________
[size="5"]Jody Claggett
SQL Server Reporting Analyst
[/size]
Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1033 Visits: 2673
And, if anyone cares, here is some U.S. centric address sanitization to add to the clean_string function or set of functions. Given that addresses vary significantly country to country, please review carefully before using in your country.

At the end is also primitive SQL to turn consecutive spaces into single spaces, which can be trivially separated from the PO Box replacements.

Unfortunately, this trivial but tedious method does result in very long SQL, and substantial CPU time for the cleansing function.


-- 238 nestings, ALLEY to LOOP plus " BY PASS " and " BL ", and " BLUFFS ", " BROOKS ", " BURGS ", " CENTERS ", " CLIFFS ", " CIRCLES ", " CORNERS ", " COURTS ", " COVES ", " DRIVES ", " ESTATES ", " EXTENSIONS ", " FALLS ", " FIELDS ", " FLATS ", " FORDS ", " FORGES ", " FORKS ", " GARDENS ", " GLENS ", " GREENS ", " GROVES ", " HARBORS ", " HILLS ", " ISLANDS ", " JUNCTIONS ", " KEYS ", " KNOLLS ", " LAKES ", " LIGHTS ", " LOCKS ", and so on to the singular abbreviation from the official plural abbreviations
-- spaces remain at this point to allow for word separations, created in bulk via Excel and global search and replace, meaningless self-replacements removed.
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' ALLEE ',' ALY '),' ALLEY ',' ALY '),' ALLY ',' ALY '),' ANEX ',' ANX '),' ANNEX ',' ANX '),' ANNX ',' ANX '),' ARCADE ',' ARC '),' AV ',' AVE '),' AVEN ',' AVE '),' AVENU ',' AVE '),' AVENUE ',' AVE '),' AVN ',' AVE '),' AVNUE ',' AVE '),' BAYOO ',' BYU '),' BAYOU ',' BYU '),' BEACH ',' BCH '),' BEND ',' BND '),' BLUF ',' BLF '),' BLUFF ',' BLF '),' BLUFFS ',' BLF '),' BLFS ',' BLF '),' BOT ',' BTM '),' BOTTM ',' BTM '),' BOTTOM ',' BTM '),' BL ',' BLVD '),' BOUL ',' BLVD '),' BOULEVARD ',' BLVD '),' BOULV ',' BLVD '),' BRANCH ',' BR '),' BRNCH ',' BR '),' BRDGE ',' BRG '),' BRIDGE ',' BRG '),' BROOK ',' BRK '),' BROOKS ',' BRK '),' BRKS ',' BRK '),' BURG ',' BG '),' BURGS ',' BG '),' BGS ',' BG '),' BYPA ',' BYP '),' BYPAS ',' BYP '),' BYPASS ',' BYP '),' BY PASS ',' BYP '),' BYPS ',' BYP '),' CAMP ',' CP '),' CMP ',' CP '),' CANYN ',' CYN '),' CANYON ',' CYN '),' CNYN ',' CYN '),' CAPE ',' CPE '),' CAUSEWAY ',' CSWY '),' CAUSWAY ',' CSWY ') ,' CEN ',' CTR '),' CENT ',' CTR '),' CENTER ',' CTR '),' CENTR ',' CTR '),' CENTRE ',' CTR '),' CNTER ',' CTR '),' CNTR ',' CTR '),' CENTERS ',' CTR '),' CTRS ',' CTR '),' CIRC ',' CIR '),' CIRCL ',' CIR '),' CIRCLE ',' CIR '),' CRCL ',' CIR '),' CRCLE ',' CIR '),' CIRCLES ',' CIR '),' CIRS ',' CIR '),' CLIFF ',' CLF '),' CLFS ',' CLF '),' CLIFFS ',' CLF '),' CLUB ',' CLB '),' COMMON ',' CMN '),' CORNER ',' COR '),' CORNERS ',' COR '),' CORS ',' COR '),' COURSE ',' CRSE '),' COURT ',' CT '),' CRT ',' CT '),' COURTS ',' CT '),' CTS ',' CT '),' COVE ',' CV '),' CVS ',' CV '),' COVES ',' CV '),' CK ',' CRK '),' CR ',' CRK '),' CREEK ',' CRK '),' CRECENT ',' CRES '),' CRESCENT ',' CRES '),' CRESENT ',' CRES '),' CRSCNT ',' CRES ') ,' CRSENT ',' CRES '),' CRSNT ',' CRES '),' CREST ',' CRST '),' CROSSING ',' XING '),' CRSSING ',' XING '),' CRSSNG ',' XING '),' CROSSROAD ',' XRD '),' CURVE ',' CURV '),' DALE ',' DL '),' DAM ',' DM '),' DIV ',' DV '),' DIVIDE ',' DV '),' DVD ',' DV '),' DRIV ',' DR '),' DRIVE ',' DR '),' DRV ',' DR '),' DRIVES ',' DR '),' DRS ',' DR '),' ESTATE ',' EST '),' ESTATES ',' EST '),' ESTS ',' EST '),' EXP ',' EXPY '),' EXPR ',' EXPY '),' EXPRESS ',' EXPY '),' EXPRESSWAY ',' EXPY '),' EXPW ',' EXPY '),' EXTENSION ',' EXT '),' EXTN ',' EXT '),' EXTNSN ',' EXT '),' EXTENSIONS ',' EXT '),' EXTS ',' EXT '),' FALLS ',' FALL '),' FLS ',' FALL '),' FERRY ',' FRY '),' FRRY ',' FRY '),' FIELD ',' FLD '),' FIELDS ',' FLD '),' FLDS ',' FLD '),' FLAT ',' FLT '),' FLATS ',' FLT '),' FLTS ',' FLT '),' FORD ',' FRD '),' FRDS ',' FRD '),' FORDS ',' FRD '),' FOREST ',' FRST '),' FORESTS ',' FRST '),' FORG ',' FRG '),' FORGE ',' FRG '),' FRGS ',' FRG '),' FORGES ',' FRG '),' FORK ',' FRK '),' FORKS ',' FRK '),' FRKS ',' FRK '),' FORT ',' FT '),' FRT ',' FT '),' FREEWAY ',' FWY '),' FREEWY ',' FWY '),' FRWAY ',' FWY '),' FRWY ',' FWY '),' GARDEN ',' GDN '),' GARDN ',' GDN '),' GRDEN ',' GDN '),' GRDN ',' GDN '),' GARDENS ',' GDN '),' GDNS ',' GDN '),' GRDNS ',' GDN '),' GATEWAY ',' GTWY '),' GATEWY ',' GTWY '),' GATWAY ',' GTWY '),' GTWAY ',' GTWY '),' GLEN ',' GLN '),' GLNS ',' GLN '),' GLENS ',' GLN '),' GREEN ',' GRN '),' GRNS ',' GRN '),' GREENS ',' GRN '),' GROV ',' GRV '),' GROVE ',' GRV '),' GRVS ',' GRV '),' GROVES ',' GRV '),' HARB ',' HBR '),' HARBOR ',' HBR '),' HARBR ',' HBR '),' HBRS ',' HBR ') ,' HRBOR ',' HBR ') ,' HARBORS ',' HBR '),' HAVEN ',' HVN '),' HAVN ',' HVN '),' HEIGHT ',' HTS '),' HEIGHTS ',' HTS '),' HGTS ',' HTS '),' HT ',' HTS '),' HIGHWAY ',' HWY '),' HIGHWY ',' HWY '),' HIWAY ',' HWY '),' HIWY ',' HWY '),' HWAY ',' HWY '),' HILL ',' HL '),' HILLS ',' HL '),' HLS ',' HL '),' HLLW ',' HOLW '),' HOLLOW ',' HOLW '),' HOLLOWS ',' HOLW '),' HOLWS ',' HOLW '),' INLET ',' INLT '),' ISLAND ',' IS '),' ISLND ',' IS '),' ISLANDS ',' IS '),' ISLNDS ',' IS '),' ISS ',' IS '),' ISLES ',' ISLE '),' JCTION ',' JCT '),' JCTN ',' JCT '),' JUNCTION ',' JCT '),' JUNCTN ',' JCT '),' JUNCTON ',' JCT '),' JCTNS ',' JCT '),' JCTS ',' JCT '),' JUNCTIONS ',' JCT '),' KEY ',' KY '),' KEYS ',' KY '),' KYS ',' KY '),' KNL ',' KNL '),' KNOL ',' KNL '),' KNOLL ',' KNL '),' KNLS ',' KNL '),' KNOLLS ',' KNL '),' LAKE ',' LK '),' LK ',' LK '),' LAKES ',' LK '),' LKS ',' LK '),' LAND ',' LAND '),' LANDING ',' LNDG '),' LNDNG ',' LNDG '),' LA ',' LN '),' LANE ',' LN '),' LANES ',' LN '),' LIGHT ',' LGT '),' LIGHTS ',' LGT '),' LGTS ',' LGT '),' LOAF ',' LF '),' LOCK ',' LCK '),' LCKS ',' LCK '),' LOCKS ',' LCK '),' LDGE ',' LDG '),' LODG ',' LDG '),' LODGE ',' LDG '),' LOOPS ',' LOOP ')


-- 201 nestings, MALL to WELLS plus " UNIT " then all secondary designators APARTMENT to UPPER, and " MANORS ", " MEADOWS ", " MILLS ", " MOUNTAINS ", " PINES ", " PLAINS ", " POINTS ", " PORTS ", " RAPIDS ", " RIDGES ", " ROADS ", " SHOALS ", " SHORES ", " SPRINGS ", " SQUARES ", " STREETS ", " UNIONS ", " VALLEYS ", " VIEWS ", " VILLAGES ", " WAYS ", " WELLS ", and so on to the singular abbreviation from the official plural abbreviations
-- spaces remain at this point to allow for word separations, created in bulk via Excel and global search and replace, meaningless self-replacements removed.
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' MANOR ',' MNR '),' MNR ',' MNR '),' MANORS ',' MNR '),' MNRS ',' MNR '),' MEADOW ',' MDW '),' MDWS ',' MDW '),' MEADOWS ',' MDW ') ,' MEDOWS ',' MDW '),' MEWS ',' MEWS '),' MILL ',' ML '),' MILLS ',' ML '),' MLS ',' ML ') ,' MISSION ',' MSN '),' MISSN ',' MSN '),' MSSN ',' MSN '),' MOTORWAY ',' MTWY '),' MNT ',' MT '),' MOUNT ',' MT '),' MNTAIN ',' MTN '),' MNTN ',' MTN '),' MOUNTAIN ',' MTN '),' MOUNTIN ',' MTN '),' MTIN ',' MTN '),' MTNS ',' MTN '),' MNTNS ',' MTN ') ,' MOUNTAINS ',' MTN '),' NECK ',' NCK '),' ORCHARD ',' ORCH '),' ORCHRD ',' ORCH '),' OVL ',' OVAL '),' OVERPASS ',' OPAS '),' PK ',' PARK '),' PRK ',' PARK '),' PARKS ',' PARK '),' PARKWAY ',' PKWY '),' PARKWY ',' PKWY '),' PKWAY ',' PKWY '),' PKY ',' PKWY '),' PARKWAYS ',' PKWY ') ,' PKWYS ',' PKWY '),' PASSAGE ',' PSGE '),' PATHS ',' PATH '),' PIKES ',' PIKE '),' PINE ',' PNE '),' PINES ',' PNE '),' PNES ',' PNE '),' PLACE ',' PL '),' PLAIN ',' PLN '),' PLAINES ',' PLN '),' PLAINS ',' PLN '),' PLNS ',' PLN '),' PLAZA ',' PLZ '),' PLZA ',' PLZ '),' POINT ',' PT '),' POINTS ',' PT '),' PTS ',' PT '),' PORT ',' PRT '),' PORTS ',' PRT '),' PRTS ',' PRT '),' PRAIRIE ',' PR '),' PRARIE ',' PR '),' PRR ',' PR '),' RAD ',' RADL '),' RADIAL ',' RADL '),' RADIEL ',' RADL '),' RANCH ',' RNCH '),' RANCHES ',' RNCH '),' RNCHS ',' RNCH '),' RAPID ',' RPD '),' RAPIDS ',' RPD '),' RPDS ',' RPD '),' REST ',' RST '),' RDGE ',' RDG '),' RIDGE ',' RDG '),' RDGS ',' RDG '),' RIDGES ',' RDG '),' RIVER ',' RIV '),' RIVR ',' RIV '),' RVR ',' RIV '),' ROAD ',' RD '),' RDS ',' RD '),' ROADS ',' RD '),' ROUTE ',' RTE '),' SHOAL ',' SHL '),' SHLS ',' SHL '),' SHOALS ',' SHL '),' SHOAR ',' SHR '),' SHORE ',' SHR '),' SHR ',' SHR '),' SHOARS ',' SHR '),' SHORES ',' SHR '),' SHRS ',' SHR '),' SKYWAY ',' SKWY '),' SPNG ',' SPG '),' SPRING ',' SPG '),' SPRNG ',' SPG '),' SPGS ',' SPG '),' SPNGS ',' SPG '),' SPRINGS ',' SPG '),' SPRNGS ',' SPG '),' SPURS ',' SPUR '),' SQR ',' SQ '),' SQRE ',' SQ '),' SQU ',' SQ '),' SQUARE ',' SQ '),' SQRS ',' SQ '),' SQUARES ',' SQ ') ,' SQS ',' SQ '),' STATION ',' STA '),' STATN ',' STA '),' STN ',' STA '),' STRAV ',' STRA '),' STRAVE ',' STRA '),' STRAVEN ',' STRA '),' STRAVENUE ',' STRA '),' STRAVN ',' STRA '),' STRVN ',' STRA '),' STRVNUE ',' STRA '),' STREAM ',' STRM '),' STREME ',' STRM '),' STR ',' ST '),' STREET ',' ST '),' STRT ',' ST '),' STREETS ',' ST '),' STS ',' ST '),' SUMIT ',' SMT '),' SUMITT ',' SMT '),' SUMMIT ',' SMT '),' TERR ',' TER '),' TERRACE ',' TER '),' THROUGHWAY ',' TRWY '),' TRACE ',' TRCE '),' TRACES ',' TRCE '),' TRACK ',' TRAK '),' TRACKS ',' TRAK '),' TRK ',' TRAK '),' TRKS ',' TRAK '),' TRAFFICWAY ',' TRFY '),' TR ',' TRL '),' TRAIL ',' TRL '),' TRAILS ',' TRL '),' TRLS ',' TRL '),' TUNEL ',' TUNL '),' TUNLS ',' TUNL '),' TUNNEL ',' TUNL '),' TUNNELS ',' TUNL '),' TUNNL ',' TUNL '),' TPK ',' TPKE '),' TRNPK ',' TPKE '),' TRPK ',' TPKE '),' TURNPIKE ',' TPKE '),' TURNPK ',' TPKE '),' UNDERPASS ',' UPAS '),' UNION ',' UN '),' UNIONS ',' UN '),' UNS ',' UN '),' VALLEY ',' VLY '),' VALLY ',' VLY '),' VLLY ',' VLY '),' VALLEYS ',' VLY ') ,' VLYS ',' VLY ') ,' VDCT ',' VIA '),' VIADCT ',' VIA '),' VIADUCT ',' VIA '),' VIEW ',' VW '),' VIEWS ',' VW '),' VWS ',' VW '),' VILL ',' VLG '),' VILLAG ',' VLG '),' VILLAGE ',' VLG '),' VILLG ',' VLG '),' VILLIAGE ',' VLG '),' VILLAGES ',' VLG '),' VLGS ',' VLG '),' VILLE ',' VL '),' VIST ',' VIS '),' VISTA ',' VIS '),' VST ',' VIS '),' VSTA ',' VIS '),' WALKS ',' WALK '),' WY ',' WAY '),' WAYS ',' WAY '),' WELL ',' WL '),' WELLS ',' WL '),' WLS ',' WL '),' APARTMENT ',' APT '),' BASEMENT ',' BSMT ') ,' BUILDING ',' BLDG '),' DEPARTMENT ',' DEPT '),' FLOOR ',' FL '),' FRONT ',' FRNT '),' HANGAR ',' HNGR '),' LOBBY ',' LBBY '),' LOWER ',' LOWR '),' OFFICE ',' OFC '),' PENTHOUSE ',' PH '),' ROOM ',' RM '),' SPACE ',' SPC '),' SUITE ',' STE '),' TRAILER ',' TRLR '),' UNT ',' UNIT '),' UPPER ',' UPPR ')


-- Turn any set of spaces into a single space. Why 13 2:1 replacements? Because 2^13 is 8192, just in excess of VARCHAR(8000), so we _know_ that any number of contiguous spaces is now one space.
-- Also, standardize the most common misspelled PO BOX types _after_ the spaces are collapsed.
-- single spaces remain at this point to allow for word separations
SET @String = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@String,' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '),' ',' '), ' po bx ',' PO BOX '), ' p o box ',' PO BOX '), ' P0 BOX ',' PO BOX ')




Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1033 Visits: 2673
Ira, I apologize, but it appears there is at least one error in the algorithm, most likely dealing with transpositions (or I accidentally changed something important when making what I thought were cosmetic changes).

Most importantly, I would not expect ('Yvette','Yevett'), ('Martha','Marhta'), and ('Shackleford','Shackelford') to have the 0.999999 scores they're getting.

Additionally, ('Massey','Massie') is also scored significantly lower by your T-SQL code than by the Winkler paper below (0.8888 vs. 0.9444).

Most other scores are either almost identical, or relatively similar.

Jaro-Winkler output validation against a paper by William E. Winkler and Yves Thibaudeau, U.S. Census: url=http://www.census.gov/srd/papers/pdf/rr91-9.pdf
Referenced from the U.S. National Institute of Standards and Technology: url=http://www.itl.nist.gov/div897/sqg/dads/HTML/jaroWinkler.html


SELECT [dbo].[JaroWinkler]('shackleford','shackelford') -- expect 0.9848
SELECT [dbo].[JaroWinkler]('cunningham','cunnigham') -- expect 0.9833
SELECT [dbo].[JaroWinkler]('campell','campbell') -- expect 0.9792
SELECT [dbo].[JaroWinkler]('nichleson','nichulson') -- expect 0.9630
SELECT [dbo].[JaroWinkler]('massey','massie') -- expect 0.9444
SELECT [dbo].[JaroWinkler]('abroms','abrams') -- expect 0.9333
SELECT [dbo].[JaroWinkler]('galloway','calloway') -- expect 0.9167
SELECT [dbo].[JaroWinkler]('lampley','campley') -- expect 0.9048
SELECT [dbo].[JaroWinkler]('dixon','dickson') -- expect 0.8533
SELECT [dbo].[JaroWinkler]('frederick','fredrick') -- expect 0.9815
SELECT [dbo].[JaroWinkler]('michele','michelle') -- expect 0.9792
SELECT [dbo].[JaroWinkler]('jesse','jessie') -- expect 0.9722
SELECT [dbo].[JaroWinkler]('marhta','martha') -- expect 0.9667
SELECT [dbo].[JaroWinkler]('jonathon','jonathan') -- expect 0.9583
SELECT [dbo].[JaroWinkler]('julies','juluis') -- expect 0.9333
SELECT [dbo].[JaroWinkler]('jeraldine','geraldine') -- expect 0.9246
SELECT [dbo].[JaroWinkler]('yvette','yevett') -- expect 0.9111
SELECT [dbo].[JaroWinkler]('tanya','tonya') -- expect 0.8933
SELECT [dbo].[JaroWinkler]('dwayne','duane') -- expect 0.8578


Ira Warren Whiteside
Ira Warren Whiteside
SSC Rookie
SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)SSC Rookie (42 reputation)

Group: General Forum Members
Points: 42 Visits: 296
Very interesting, I am stepping through the code, I am getting the same results. I will let you know.

Ira Warren Whiteside
Martin N
Martin N
SSC Journeyman
SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)SSC Journeyman (80 reputation)

Group: General Forum Members
Points: 80 Visits: 177
Any update on the potential error in the function?

Martin


Kindest Regards,

Martin
marat-oz
marat-oz
SSC-Enthusiastic
SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)

Group: General Forum Members
Points: 196 Visits: 1096
Hi,
Article is very interesting.
I use to work with name and address matching for a while.
Never tried Jaro Winkler algorithm and want to thank Ira for that.
What striked my mind is that good matching itself is just a part of matching process.
Another interesting part of matching is data parsing.
Good parsing gives better matching results.
It would be interesting to hear about data parsing and it's common issues.
Regards
Anatoliy
Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1033 Visits: 2673
I agree that parsing is critical. On the other hand, how to parse is also very much an "it depends" question.

In general, for my only slightly imperfect matching I tend to combine split fields (addr1, addr2, first name, last name), clean them up (nested REPLACE() for SQL), remove extraneous elements (professional designations, etc.), remove all remaining spaces, and then compare them. With at least some matching on another selective field (license number, for instance), this produces high quality results.

For later, fuzzier passes... well, that's where splitting pieces apart, cross-matching, Jaro-Winkler, Double Metaphone, and so on come in.

As far as splitting pieces apart, for the ' ' + addr1 + ' ' + addr2 + ' ' (etc.), I have code based on U.S. addresses that tries to find all leading digits, all suite/apartment indicators (whether they be first or last), whatever's in the middle, and the first "word" of whatever's in the middle. These are then used in various combinations with City+State and/or Zip code on the address portion of the matching.

In pure T-SQL, splitting the pieces apart is fairly resource heavy.
marat-oz
marat-oz
SSC-Enthusiastic
SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)SSC-Enthusiastic (196 reputation)

Group: General Forum Members
Points: 196 Visits: 1096
Hi,
In my case I had also Middle Name involved in matching process.
Also we could have initials instead of full names which actually makes it more complicated,
and even more complicated for Asian names.
Regarding addresses, I want to mention that from my poiny of view addressses are virtual.
There are different address databases around but there is no absolute one.
Addresses people use are another source of address virtualisation we dealing with.
And to build up to date address DB is quite complex task.
Nadrek
Nadrek
Ten Centuries
Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)Ten Centuries (1K reputation)

Group: General Forum Members
Points: 1033 Visits: 2673
In my case, middle name/initial wasn't of any real value in and of itself.

First initials are often an issue.

One step I have not yet taken is splitting up names into individual words, so Guthridge-Henderson matches with either Guthridge or Henderson at a fuzzyish level of matching.

Likewise, matching on common nicknames (Dick vs. Richard, Bob vs. Rob vs. Robert) I haven't implemented yet.

Matching on the first initial of nicknames, however, I'd judge as a lost cause (D matching R).

Addresses take quite a lot of cleanup, and even then, they're hard to match. Sending them out to a good address cleanup service is probably the best bet, but failing that, we have to try ourselves.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search