A Pattern for Email Address storage and retrieval

  • SQLMajor

    Mr or Mrs. 500

    Points: 576

    Comments posted to this topic are about the item A Pattern for Email Address storage and retrieval

  • Bojidar Alexandrov

    Old Hand

    Points: 378

    Nice exercise but you still have an unique index on EmailAddress so why not instead of the two indexes, make one unique over EmailCheckSum and EmailAddress and miss all the burden with the temp table - selecting directly rows matching them both...

  • samot-dwarf

    SSC Eights!

    Points: 984

    I think because it would need to compare both - the Checksum and the email if you would add both to an index / where condition, even if the checksum is unique

    And comparing a nvarchar(255) to another is slow (compared to comparing the checksum-INT).

    On the other hand it depends on the real szenario if the overhead (creating a table var, inserting into it, querying it) is faster.

    If the example would not be about email adresses but about files (e.g. attachments as varbinary(max)). In this case the compare of the CHECKSUM (or better HASHBYTES('SH1', file) + the filesize (for reducing the amount wrong positives)) would be faster than comparing some 20 MB files direct

  • gjpyne

    SSC Enthusiast

    Points: 127

    One observation... Email addresses are case insensitive in practice and some email systems even allow the dot to be ignored in matching the local part! A user may use different versions of the same email address. ABCname@Gmail.com is the same as abcname@gmail.com or AB.Cname@GMAIL.com. If your example uses the upper function and removing the dot in the local part (to the left of the @ sign) for both storing/comparing the email address and the calculation of the checksum it would be improved. Reference https://en.wikipedia.org/wiki/Email_address

  • ChrisM@Work

    SSC Guru

    Points: 186044

    Is there any particular reason for storing the email address as NVARCHAR?

    [font="Arial"]“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw[/font]


    For fast, accurate and documented assistance in answering your questions, please read this article[/url].
    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]
    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]
    [url

  • shoestringdba

    SSCertifiable

    Points: 6206

    I'd be curious to see how this would work to search and return a range of similar addresses, say from a given domain such as all the email addresses from @mydomain.com.

    ____________
    Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

  • sknox

    SSChampion

    Points: 12227

    Did you try this single-step process?

    SELECT EmailAddressID , EmailAddress

    FROM dbo.EmailAddress

    WHERE EmailCheckSum = CHECKSUM('TheMostInterestingMan@dosequis.com')

    AND EmailAddress = 'TheMostInterestingMan@dosequis.com'

    Although there are no guarantees of the order of evaluation of conditionals, the fact that EmailCheckSum is the index key and EmailAddress is included in the index should lead the optimizer to an index seek, and allow an efficient exact search without employing a table variable.

  • akljfhnlaflkj

    SSC Guru

    Points: 76202

    First off our data is fairly well normalized so we don't have the same data strung out over multiple tables, with the exception of keys of course. Our EmailAddress table has an unique alternate index on the EmailAddress field. It seems efficient and not caused us any grief time-wise in managing the data. Your proposal may be faster but I would think for us it is just an interesting exercise. I certainly wouldn't propose our staff spend any time messing with this when we have so many other pressing issues. Thank you anyway. It was an interesting read nonetheless.

  • triman165910

    SSC Enthusiast

    Points: 140

    I apologize for asking ahead of time, but why the CHECKSUM()? yes I understand the unique number, most of the time, but why not a unique index on the actual email address? maybe a GUID if you need something unique? thanks.

  • mlandry

    SSC Veteran

    Points: 283

    Good point re UPPER()

    Also, why stop at one CHECKSUM() when you could use two -- one for the local part and one for the domain -- and a bigint. This would dramatically lower the collisions. And lose the '@' because all email addresses contain it so it adds nothing.

    select convert(bigint, checksum(upper(substring('John.Smith@example.com', 1, charindex('@', 'John.Smith@example.com')-1))))

    * power(convert(bigint, 2),32) | checksum(upper(substring('John.Smith@example.com', charindex('@', 'John.Smith@example.com')+1, 255)))

  • doug 88182

    Old Hand

    Points: 327

    It is not hard to create email addresses where Checksums overlap.

    if checksum('ks@gmail.com') = checksum('nc@gmail.com')

    print 'Checksum function is weak'

    If you really want to do something like this, I'd recommend hashbytes:

    if convert(bigint,hashbytes('md5','ks@gmail.com')) <> convert(bigint,hashbytes('md5','nc@gmail.com'))

    print 'Hashbytes creates a better number space'

    But really, I think the overhead that is introduced in this process is counter productive. Why not just index the column? SQL server will do the hashing for you in a way that matches your collation. SQL automatically computes, stores & maintains the hashes for you in a b-tree when you make an index on a varchar column.

  • roger.plowman

    SSChampion

    Points: 10147

    Are we actually sure NVarchar(255) searches are inefficient? Since string comparisons are left to right the first 5 (certainly the first 10) characters would almost certainly eliminate the vast majority of the search space.

    Given SQL Server is all about efficiency I'd be interested in seeing a performance breakdown between the Checksum() and index seek versions.

    Although I absolutely agree email addresses should be their own entity.

  • vinaysharma77

    SSC Journeyman

    Points: 76

    Guys,

    Just a quick note about CHECKSUM. If you are planning on using CHECKSUM in just saving the data and retrieving it, then I think you should be good.

    I’ve burnt my hands in using CHECKSUM for the purpose of comparison and expecting it to return a unique value every time for every string and ran into two strings generating the exact same value for CHECKSUM. This was back in 2010 and we were using SQL Server 2008, so I thought this might've been fixed in SQL Server 2012. Looks like apparently not..

    I ran the CHECKSUM function on SQL Server 2012 database for those two string values and sure enough, it still returns the same results:

    Select CHECKSUM('SQL CAL 6.5 WinNT All Lng MVL ISV Royalty Embed')

    Select CHECKSUM('SQL CAL Runtime 6.5 WinNT All Lng MVL ISV Royalty Embed Runtime')

  • doug 88182

    Old Hand

    Points: 327

    roger.plowman (9/24/2015)


    Given SQL Server is all about efficiency I'd be interested in seeing a performance breakdown between the Checksum() and index seek versions.

    I managed a database with > 10 Million email addresses for a while. We did have the unique emails pulled out into a separate table which is probably a good practice in most relational data use cases.

    The original developer had built a hashing process with an indexed computed column on the table, thinking that the numeric output had to perform better. In our usage case, it did not. It just used up extra resources, since we had to index the varchar representation of the email address anyway and I felt that it made the database relationships overly complicated.

    Researching how the performance of the bigint hash could be worse than the varchar column, I recall discovering that:

    1) the varchar index has some order, that can help some "like" queries

    2) the varchar indexes contain collation specific hashes that are more efficient than a checksum or binary hash.

  • ATXJon

    Valued Member

    Points: 71

    triman165910 (9/24/2015)


    I apologize for asking ahead of time, but why the CHECKSUM()? yes I understand the unique number, most of the time, but why not a unique index on the actual email address? maybe a GUID if you need something unique? thanks.

    Your two options give up the advantages of the CHECKSUM()- faster seek time on searches. The INT comparison is much faster than comparing text. Using the e-mail address as the unique index forces text comparison, while the GUID provides no value as a search parameter.

    Edit - I should add that my statement above is regarding the intent of the original article. Doug's experience seems to demonstrate that in practice, this might not be the optimal solution.

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply