If, like me, you have ever had to administer the
organisations marketing database then at some time or other you would probably
have had to deal the maintenance of the organisations contact/mailing list.
For me this involved cleaning it up, removing duplicate
names, providing updates and selection facilities for that list.
Maintaining the company mailing list was a good way of
introducing database discipline to the marketing department and gaining a
better understanding of the companies business.
From a database and programming perspective it represents a
good meaty challenge.
In Great Britain addresses can comprise of many lines of
information. Not all of these are
compulsory or necessary but may be used nonetheless.
- House/Building Name
- Street number
- Double-dependant locality
- Post Town
- Postal Code
The secret of deduplication is to identify correctly the
different lines of your address so that you are comparing post town with post
town, street with street etc.
If you can’t do this then you will end up matching an
address with a street element of Chester Street with an address with a post
town of Chester le Street.
The first thing I decided was that I was going to have a
specific dedupe database that would hold a table for each address file that I
was going to dedupe.
These would start off structurally identical to their
original source files and I would append my own fields to the end of each
In any address there can be only one postal code, county and
There can be between one and three lines that could be
called the street element, and between one and three that I could use as catch
alls for superfluous address information.
Next, I decided that I needed fields to hold both a SOUNDEX
code and a METAPHONE code for my street lines.
Soundex has been around for over a century, and was
originally used to help match addresses written on file cards.
It is fairly good at catering for basic miss-spellings such
as wrong vowels and double consonants. Where it falls down is that a miss-spelt word has to start at the very
least with the same letter as the correct word.
It breaks down text into 6 possible codes plus a preceding
letter. The rules are as follows:-
- Always retain the first character or number.
- Discard all vowels including the Welsh vowels H, W and Y.
- Discard any other non-alphabetic characters.
- Replace B,F,P, V with 1.
- Replace C, G, J, K, Q, S, X, Z with 2
- Replace D & T with 3
- Replace L with 4.
- Replace N & M with 5
- Replace R with 6.
- If any numbers occur in pairs or more only retain the first number.
In the SQL Server version only the first 4 characters of the
code are retained. Why, oh why didn’t
Microsoft put this limit on it.
There are some variations on this function.
- Replace the vowels with 0 and then strip, carry out all the steps above, then at
the end, remove all the zeros.
the text for duplicate letters and eliminate. This caters for Welsh place names such as Llangollen.
The metaphone function was defined by Lawrence
Phillips. It is too detailed to
document here, but click here for one of many good references on the web.
Metaphone breaks down text into 16 phonetic sounds and
caters for the sounds that a combination of letters can make and also
combinations that are silent i.e. the gh at the end of through.
You may see references to a function called
double-metaphone. This is a variation
that can handle transposition of letters within the text.
I wish that the metaphone function was included in SQL
Server, but alas, there aren’t that many deduplication packages that use it
Identifying the address elements
There is a database in Great Britain called the Post Office
Address File or PAF for short. It holds
every deliverable address in GB.
There are quite a few PC based systems such as Quick Address
that will match your internal databases against the PAF database to make sure
your addresses are postally correct and up-to-date, and also to identify the
address elements for you.
The systems themselves aren’t expensive, but subscribing to
the PAF file is. I didn’t have the
budget for this so I had to design my own system.
I decided that the best way to proceed was to work backward
through the elements within an address.
Firstly, 99.9% of my addresses were in GB so I knew I didn’t
have to worry about the country.
This is the single most important address element in a
British marketing database. There are
1.6 million of these covering 22 million households, averaging about 15
households per postal code. Therefore
identifying this puts you are within 15 households of your target.
Apart from its use in deduplication there are many datasets
of marketing information that can be bought that rely on the postal code as the
primary key for their data. These
include grid reference systems, geodemographics, census data etc.
A British postal code consists of two halves.
The first half identifies the sorting office and general
area of the address. The 2nd
have goes right down to the specific side of the street. If you are a business and receive a lot of
mail then you may have a postal code all to yourself.
From a database perspective you can identify postal codes
because they can be pattern matched.
The 2nd half is always [0-9][A-Z][A-Z] so this is
easy to look for.
The 1st half will generally match to either of
The letters identify the main sorting office i.e. SK =
Stockport. The numbers identify the
district covered by the Stockport sorting office.
I say generally because London (capital of England) breaks
these rules. Legend has it that when
WWII came along and all the postmen went off to fight, women took over
(officially) and decided to employ logic to London post codes so the letters
define an direction as well as an area within London. W= West 1, EC = East Central.
In addition to the patterns listed so far we have
Overall though extracting postal codes is a pattern matching
Counties are not needed in UK addresses as far as the post
office is concerned. It simply isn’t
necessary for their systems. Give them
a nice postal code and they’re happy.
Systems are one thing people are another. The English Civil War was between the House
of Lancaster and the House of York. As far as the inhabitants of Lancashire and Yorkshire are concerned we are merely
in a state cease-fire brought about when Edward VII (a Lancastrian) married
Elizabeth of York. That was over 500 years ago but if you miss off the county for their addresses or, heaven forbid,
mix them up then hostilities will resume.
From a database perspective there are a limited number of
counties and the sort of people who use them are not the sort to miss-spell
them. You can do an exact match against
a look-up table in order to extract these.
The only thing to bare in mind is that there are short names
for some counties.
Lancashire is often written as Lancs so the look-up table
needs to have these short entries as well.
When the counties are extracted from an address I always
recorded the long version of the name in my dedupe tables. This is so I had a standardised list of
Post Towns again are limited in number but more numerous
than counties. Again, you can do a
match against a lookup table.
For my purposes I defined street lines as
with a number or
with a word/abbreviation such as St or Street.
Now it is not enough to simply extract the street lines.
Consider St John Street.This could be written as
- St John St
- Saint John St
- St John Street
- Saint John Street.
What I did was build a substitution table with the following
||The abbreviation to search for.
||The full version of the abbreviation.
||S = Street.
B = Business.
N = Name.
A = Any.
||P = Prefix.
S = Suffix.
A = Any.
This table has two functions.
- Identify a street line from its contents.
- Replace the abbreviated version of a word with the full
Other Address Lines
Having gone through all the stages listed so far any
remaining address lines I classified as “General” and didn’t bother to extract
Names of businesses were always in their own field but on
extraction I used the same substitution table I used for my street addresses to
turn things Mitsubishi Ltd to Mitsubishi Limited. This is to make the comparison between the two simpler.
The actual dedupe
Firstly, note that I didn't use a limited character version of either SOUNDEX or METAPHONE.
As far as I was concerned I wanted the full phonetic encoding of the words as a starting point so that I could
tune the dedupe process later.
I had identified and extracted my address elements
and standardised terms within those elements. The next stage was to use my unlimited length SOUNDEX and METAPHONE to encode the extracted
street and business name lines.
I didn’t need to do this for postal code, county or town
because these conform to a standard list of limited values.
I now had encoded values that would enable me to compare two
business addresses and be fairly sure that I had a close match.
Points to bear in mind
No single query is going to identify all the dupes. That is why I used both SOUNDEX and
Data omissions also affect the results of a query so
different combinations of elements will help you whittle away at the list of
As stated earlier I could truncate my codes for comparison purposes to act as a tuning mechanism on the dedupe.
As I had three address lines I had an iterative procedure to
match any of the three lines to any of the other three lines in conjunction
with other elements.
The basic rule with deduplication is set the rules so they
are tightly defined i.e. you are 99% sure that a match is a
match, then slacken off the rules to make the matching more generic.
After I did the dedupe my company found that response rates
to direct mail actually dropped. Not
because the dedupe hadn’t worked but because of the way that the addresses were
classified for marketing purposes.
We used to mail to plumbers with plumber specific offerings
and to builders with building specific offerings. The problem came because the same name could appear once in
plumbing and one under building. The
deduplication process was effectively saying you are either one or the other,
not both and in the process of doing so limiting the range of products offered
to that person.
If you are aware of this pitfall then you can turn it to
your advantage in situations where you have a customer who has been buying
product ‘X’ off you for years, then all of a sudden their name and address
crops up on a list under a totally different classification.
Possible future amendments?
If SOUNDEX gets turned into a letter then one of six values
and METAPHONE gets turned into one of 16 letters then one thing I am keen to
try is to compress these values so that one byte holds two characters, thus a
single Int field could hold up to 4 characters of a METAPHONE word.
Because when you are deduping a large number of addresses
the encoded values could represent a large volume of data. It is going to be indexed but it is going to
be faster to compare two int values than two varchar(4)s.
I believe that Lawrence Phillips himself now works for Verity (the search engine people) and does something similar,
in addition he uses a version of metaphone that reduces the phonetic characters from 16 to 12 encoded values.
Beyond that I should like to try foreign language versions
of either SOUNDEX or METAPHONE on international data.