A Pattern for Email Address storage and retrieval

,

Have you ever looked at a database schema and found over 90 occurrences of the same string/text/character field/column being stored in many tables? Well I have, and it sent off red flags all over the place. Just basic data modeling drives us to store any particular entity only once.

Email addresses are a good example of an element that can be stored once and a surrogate key used to relate it to other entities. However, I have found taking an additional step is useful. Since it is so common to search based on the email address, and they can be up to 100 characters or more in length, I found it is beneficial to generate a (usually) unique integer representation of the string for searching. An easy way to do this is with the SQL Server CHECKSUM() function. I have also considered the possibility that more than one email address could generate the same checksum value. The indexing and the code shown below accommodate resolving that situation.

The database layer can still receive an email address to search, but by doing an easy CHECKSUM(), the resulting integer is used for the search against an integer indexed column. The index should include the email address for the final matching.

Below I show a design pattern structure to store email addresses (column: EmailAddress) in a single table (dbo.EmailAddress) and to retrieve them efficiently. The key items are it will store an NVARCHAR email address and a checksum value (column: EmailCheckSum), stored as an integer, used to improve the retrieval efficiency.


Here is some test data:

Regarding the test data:

  1. All email addresses are unique. This is enforced by the unique index ncuxEmailAddress_EmailAddress.
  2. I artificially made the EmailCheckSum for IDs 1 & 2 to be the same (More on this later)

To read the data, we want to only filter on EmailCheckSum, so that we are using the more efficient index on the integer column EmailCheckSum, instead of the less efficient EmailAddress NVARCHAR(255). Since the design allows for the potential of the same EmailCheckSum value for different EmailAddresses, we filter out any potential extra rows with the same checksum value, use the following two steps:

First, insert the data into a table variable, to allow for a multiple row result set.

Note: the WHERE clause for reading the table is only on EmailCheckSum. We can get the EmailAddress as well without a key look up, since we added EmailAddress in the INCLUDE clause on the index ncixEmailAddress_EmailCheckSum_incl

INSERT INTO @AddressTable (EmailAddressID, EmailAddress) 
SELECT EmailAddressID , EmailAddress
 FROM dbo.EmailAddress
 WHERE EmailCheckSum = CHECKSUM('TheMostInterestingMan@dosequis.com')

Second, query the data from the table variable, filtering the one true match on the email address

SELECT EmailAddressID , EmailAddress
 FROM @AddressTable
 WHERE EmailAddress = 'TheMostInterestingMan@dosequis.com'

Based on this sample code and logic, when passed the email address 'TheMostInterestingMan@dosequis.com', when the checksum is generated, it would be 1720775697. When step one is executed, EmailAddressIDs 1 and 2 would be returned.

Then when the second step is executed, filtering on the email address, the one true match will be returned.

Add (or Put) Logic

  1. Take the Email Address parameter supplied, perform a CHECKSUM, and read the EmailAddress table.
  2. If one or more rows match, return the one where the email address strings match and return the corresponding EmailAddressID.
  3. If no rows matched, add the EmailAddress and return the new EmailAddressID

Lookup Logic

  1. Take the Email Address parameter supplied, perform a CHECKSUM, and read the EmailAddress table.
  2. If one or more rows match, find the one where the strings match and return the EmailAddressID.
  3. If no rows matched, return NULL result set.

In conclusion, using the checksum as the only filtering column and including the nvarchar (full email address) column, the index and retrieval will be more efficient.

Please use the create and sample code and take it for a test drive. I have used this with good results, but would love to hear other ideas, comments and suggestions on the topic.

Resources

Rate

4.39 (33)

Share

Share

Rate

4.39 (33)