A Pattern for Email Address storage and retrieval

  • There appears to be quite a misunderstanding of how an index seek works - on character data. Characters are compared position by position, as soon as one differs, then the comparison will stop and based on what the difference was, the search "jumps" to a new part of the index.

    If you had a lot of addresses which all start with the same characters then there might be a performance benefit of calculating the checksum and searching on that. I haven't tried a test, but my guess is that the performance benefit on seeking will be negated by the additional mucking about with temporary/memory tables and resolving the checksum clash.

    Further the structure does not provide well for creating views. The suggestion that the WHERE clause include both the checksum and the actual address is quite sensible (with the index arranged to have checksum first you can be pretty sure that a sensible execution plan is chosen).

    Having an index on just the checksum would lead to a smaller index and thus less disk activity, but an index with checksum and actual address will be bigger than just the address and make disk activity worse. But if you do have a lot of mail addresses all starting with the same 10 or more characters then there could be a performance benefit. albeit with a larger index (checksum + address)

    But then, there is no need to actually create an index with both, if only a few addresses resolve to the same checksum a bit of scanning won't be a problem.

  • I guess it is possible that I misunderstood what SQL server is doing with the keyhash values stored in the index pages. It is certainly hashing the strings and storing them in the index. I'd be interested in finding out more about that if anyone can share.

    https://www.mssqltips.com/sqlservertip/1578/using-dbcc-page-to-examine-sql-server-table-and-index-data/

    My practical experience was that looking for an exact indexed string in a large email table was faster than calculating a hash, looking up the indexed hash, and then double checking to make sure that the string matched in the end. Perhaps that was a quirk of query optimizer, the state of our stats or server platform. I agree that Indexed strings definitely take more pages and get more index levels faster because the keys are bigger, which can be a problem, but if you are going to index the email address anyway for a unique constraint or something, then that index is not going to be additional overhead.

    I found that adding things like an indexed computed column with the domain name of the email address was more useful for our particular work load.

  • Isn't this just an application-level implementation of hash index?

  • You need to be very careful with your data types. If you have a variable that is varchar and a variable that is nvarchar, they make different hashes. Your programmers need to be careful or they will get incorrect results.

    They don't make the same checksum:

    select checksum( N'Test'), checksum( 'Test')

    Another pitfall you want to avoid if you add a checksum or a hash to a table is keeping things in sync. Consider using a computed column or a trigger to keep things in sync. You are screwed if someone updates an email address without updating the hash.

    I find it very helpful to add a domain column to the table for indexed searching. Here's an example where you can use an index on a computed column to allow indexed searching by domain names. I'm using a computed column to ensure that the domain name stays in sync with the email address. It also reduces the footprint of the base table.

    CREATE TABLE #Emails(

    EmailId int identity(1,1) not null

    ,EmailAddress varchar(254)

    ,Domain as right(EmailAddress, charindex('@',reverse(EmailAddress))-1)

    ,constraint PK_Emails primary key (EmailId )

    ,constraint UX_Emails$EmailAddress unique (EmailAddress)

    ,constraint CK_Emails$Format check (EmailAddress like '%@%.%')

    )

    CREATE INDEX IX_Emails$Domain on #Emails(Domain)

    insert #emails(EmailAddress) select 'testemail@testemail.org'

    select * from #Emails

  • Interesting - I have tried a little - rough - test.

    CREATE TABLE [dbo].[TestEmail](

    [Id] [int] IDENTITY(1,1) NOT NULL,

    [nvarchar](200) NOT NULL,

    [chk] [int] NOT NULL,

    CONSTRAINT [PK_TestEmail] PRIMARY KEY CLUSTERED

    (

    [Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    CREATE UNIQUE NONCLUSTERED INDEX IX_Email_TestEmail ON dbo.TestEmail

    (

    email

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE UNIQUE NONCLUSTERED INDEX IX_Chk_Email_TestEmail ON dbo.TestEmail

    (

    chk,

    email

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    I filled it with 10 million rows. and tried 2 queries:

    SELECT email FROM TestEmail WHERE chk = @chk AND email = @email

    SELECT email FROM TestEmail WHERE email = @email

    @email = one of my generated emails, and @chk is the corresponding checksum

  • Both queries used the appropriate index.
  • Both queries required 4 logical reads (well the index does cover the query)
  • the first query took unmeasurable time (0 ms) the second took 31ms. But other tests with fewer rows didn't show such a remarkable gain
  • So using the checksum has helped the select.

    Space used

    rows reserveddata index_size unused

    1000000 530328 KB 177784 KB 352400 KB 144 KB

    Indexes on small tables can be relatively big.

    So if all the caveats are considered:

  • ensure you use the same data types (NVARCHAR or VARCHAR)
  • either use UPPER or use a case insensitive collation
  • probably best to use a computed column to ensure consistency
  • consider splitting domain out so searches by domain are possible
  • there is a performance gain that is measureable - but I think 10 Million addresses is going it some.

    Other observations

    I tried using:

    SELECT * FROM TestEmail WHERE chk = @chk AND email = @emails

    SELECT * FROM TestEmail WHERE email =

    This made the second query slower still (7 logical reads and 78 ms) because additional work was needed to fetch the other row values. whilst not having a big effect on the first query (4 reads and 2 ms)

  • andrew_c_worth@yahoo.co.uk (9/24/2015)


    Interesting - I have tried a little - rough - test.

    CREATE TABLE [dbo].[TestEmail](

    [Id] [int] IDENTITY(1,1) NOT NULL,

    [nvarchar](200) NOT NULL,

    [chk] [int] NOT NULL,

    CONSTRAINT [PK_TestEmail] PRIMARY KEY CLUSTERED

    (

    [Id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    CREATE UNIQUE NONCLUSTERED INDEX IX_Email_TestEmail ON dbo.TestEmail

    (

    email

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE UNIQUE NONCLUSTERED INDEX IX_Chk_Email_TestEmail ON dbo.TestEmail

    (

    chk,

    email

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    I filled it with 10 million rows. and tried 2 queries:

    SELECT email FROM TestEmail WHERE chk = @chk AND email = @email

    SELECT email FROM TestEmail WHERE email = @email

    @email = one of my generated emails, and @chk is the corresponding checksum

  • Both queries used the appropriate index.
  • Both queries required 4 logical reads (well the index does cover the query)
  • the first query took unmeasurable time (0 ms) the second took 31ms. But other tests with fewer rows didn't show such a remarkable gain
  • So using the checksum has helped the select.

    Space used

    rows reserveddata index_size unused

    1000000 530328 KB 177784 KB 352400 KB 144 KB

    Indexes on small tables can be relatively big.

    So if all the caveats are considered:

  • ensure you use the same data types (NVARCHAR or VARCHAR)
  • either use UPPER or use a case insensitive collation
  • probably best to use a computed column to ensure consistency
  • consider splitting domain out so searches by domain are possible
  • there is a performance gain that is measureable - but I think 10 Million addresses is going it some.

    Other observations

    I tried using:

    SELECT * FROM TestEmail WHERE chk = @chk AND email = @emails

    SELECT * FROM TestEmail WHERE email =

    This made the second query slower still (7 logical reads and 78 ms) because additional work was needed to fetch the other row values. whilst not having a big effect on the first query (4 reads and 2 ms)

    Why are you using NVARCHAR for an email address?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Well partly because the OP was doing so, and partly I assume that it is possible that some languages might need it.

  • I'm working on a problem similar to this one and solved it using a similar technique. I have millions of pdf attachments and there was a developer bug where many duplicates were inserted with different primary keys. PDFs all have the same header data so a string search would take a while. What I did was to get a binary_checksum() on all PDFs and then do a slower compare on matched checksums to determine if the files were in fact duplicates.

    Sql Server uses hashes internally all the time but I rarely think to do it at the business level since it adds complexity and rarely speeds things up.

  • I've never looked at the distribution of addresses for 10mil plus rows. Do you wind up with good even stats, or do you hit big groups (webmaster@....) that might cause issues (unless you used the hash)?

  • Thanks to all for commenting and contributing ideas. I think it was many moons ago when I first used this technique. In doing a small scale test of a little more than 1 million rows in the table, with an average length of 31 for email addresses, on SQL 2014 SP1 on a laptop, a couple of observations:

    1)Selecting by checksum and email address had twice the query cost and logical reads as selecting by email address alone.

    2)Also, the index size for the checksum and including the address was 74% of the size of the clustered index; whereas the index on just the email address was 130% of the size of the clustered index.

    So I would agree with most of you, it is likely the extra checksum isn’t really helping improve the retrieval efficiency. It is also increasing table size and index size. It looks like SQL Server has improved its character indexing significantly. Of course, keeping emails addresses in a single table with an INT or BIGINT surrogate key still makes a ton of sense.

    Thanks again for all of the discussion.

  • What also matters is the use case for creating such indexes. True - indexes add storage cost and I have seen DBAs frowning on creating extra indexes as managing indexes during bulk inserts is a mess except that you resort to disabling / dropping them and later restore / create them.

    Most of the time end users would prefer to do 'LIKE' search - which means the cost of maintaining a checksum index vs NVARCHAR index needs to be assessed since LIKE operator cannot use checksum (a number).

    Only a performance test for a specific population of data in the context of search use cases will help decide creating a right type of index as there is no absolute guide / solution for such scenario.

  • ChrisM@Work (9/24/2015)


    Why are you using NVARCHAR for an email address?

    We now have Unicode URLs, so why not Unicode characters for email addresses. You can Google Unicode Email and there's also this Wikipedia article: https://en.wikipedia.org/wiki/International_email

    There's some interesting issues that we're going to have to cope with at some point, especially if you're a multi-national org.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Re the 10 Mill addresses and the stats.

    To be blunt I generated random character strings for the addresses (there as an @ in the middle), so they hardly represent a real world example - I was more interested in whether using a hash provided a performance improvement on selecting a record.

    The result was to me quite surprising - the hash made a big improvement to query performance. As others say there is more to this - is the insert performance hit significant (multiple indexes to maintain).

  • Wayne West (9/29/2015)


    ChrisM@Work (9/24/2015)


    Why are you using NVARCHAR for an email address?

    We now have Unicode URLs, so why not Unicode characters for email addresses. You can Google Unicode Email and there's also this Wikipedia article: https://en.wikipedia.org/wiki/International_email

    There's some interesting issues that we're going to have to cope with at some point, especially if you're a multi-national org.

    It's on the way - possibly - but it isn't here yet. Now bearing in mind that the point of the article was efficiency, I'd still argue in favour of VARCHAR.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • If you are doing single email look ups, I don't think anyone doubts the hash lookup can give you a reduction in logical I/O for slight increase in disk space, and probably memory usage & CPU load in most cases. This might be a worthwhile trade off for some people. Generally I think it's about a wash because my systems were not limited by single email look-ups.

    The scenarios where I was database limited were the cases where I'd get a list of 1M unhashed email addresses and I needed to look up marketing campaigns that used EmailId. The overhead of creating the hash on the 1M records to join on the hash made the benefit vs programming complexity undesirable. If I had a chance to do it again, I would stick with the simpler schema and simpler programming and let the database solve the joins for me in the way that it wants to do it.

    Here's some sample code:

    /*

    !!!! Test requires about 1GB of tempdb space !!!!

    */

    -- make test tables

    CREATE TABLE #Emails(

    EmailId int identity(1,1) not null primary key

    ,EmailAddress varchar(254)

    ,constraint UX_Emails$EmailAddress unique (EmailAddress)

    )

    CREATE INDEX IX_Emails$EmailAddress on #Emails(EmailAddress)

    CREATE TABLE #HashedEmails(

    EmailId int identity(1,1) not null primary key

    ,EmailAddress varchar(254)

    ,Active bit

    ,HashVal as checksum(EmailAddress)

    ,constraint UX_HashedEmails$EmailAddress unique (EmailAddress)

    )

    CREATE INDEX IX_HashedEmails$EmailAddress on #HashedEmails(EmailAddress)

    -- make one million fakey emails

    insert #Emails(EmailAddress)

    select top 1000000 LEFT(newid(), convert(varbinary,newid()) & 5 + 4)

    +'@'+ LEFT(newid(), convert(varbinary,newid()) & 8 + 10 ) + '.' + 'COM'

    from sys.columns c

    cross join sys.columns c2

    cross join sys.columns c3

    -- add the 1 mil fakey emails

    -- + 4 million more new fakey emails

    insert #HashedEmails(EmailAddress, Active)

    select EmailAddress, 1 from #Emails

    union all

    select top 4000000 LEFT(newid(), convert(varbinary,newid()) & 5 + 4)

    +'@'+ LEFT(newid(), convert(varbinary,newid()) & 8 + 10 ) + '.' + 'COM'

    , 1

    from sys.columns c

    cross join sys.columns c2

    cross join sys.columns c3

    exec tempdb..sp_spaceused #HashedEmails

    -- 612MB

    CREATE INDEX IX_HashedEmails$HashVal on #HashedEmails(HashVal)

    exec tempdb..sp_spaceused #HashedEmails

    -- 682MB

    -- hash index adds 80MB or about ~16 bytes per row

    -- join on email alone is slightly cheaper

    select COUNT(*)

    from #Emails e

    join #HashedEmails he

    on he.EmailAddress = e.EmailAddress

    where he.Active = 1

    select COUNT(*)

    from #Emails e

    join #HashedEmails he

    on he.EmailAddress = e.EmailAddress

    and he.HashVal = CHECKSUM(e.EmailAddress)

    where he.Active = 1

    Rfc3696 says that emails are varchar(254) with CI collation and that apps that use email adresses should "xn--" encode non ascii characters to fit in ascii.

    Applications should be prepared to accept and process the encoded form (those strings are consistent with the "LDH rule" (see Section 2) so should not raise any separate issues) and the use of local, and potentially other, characters as appropriate to local systems and circumstances.

  • Viewing 15 posts - 16 through 30 (of 37 total)

    You must be logged in to reply to this topic. Login to reply