andrew_c_worth@yahoo.co.uk (9/24/2015)
Interesting - I have tried a little - rough - test.CREATE TABLE [dbo].[TestEmail](
[Id] [int] IDENTITY(1,1) NOT NULL,
[nvarchar](200) NOT NULL,
[chk] [int] NOT NULL,
CONSTRAINT [PK_TestEmail] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE UNIQUE NONCLUSTERED INDEX IX_Email_TestEmail ON dbo.TestEmail
(
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE UNIQUE NONCLUSTERED INDEX IX_Chk_Email_TestEmail ON dbo.TestEmail
(
chk,
) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
I filled it with 10 million rows. and tried 2 queries:
SELECT email FROM TestEmail WHERE chk = @chk AND email = @email
SELECT email FROM TestEmail WHERE email = @email
@email = one of my generated emails, and @chk is the corresponding checksum
Both queries used the appropriate index. Both queries required 4 logical reads (well the index does cover the query) the first query took unmeasurable time (0 ms) the second took 31ms. But other tests with fewer rows didn't show such a remarkable gain So using the checksum has helped the select.
Space used
rows reserveddata index_size unused
1000000 530328 KB 177784 KB 352400 KB 144 KB
Indexes on small tables can be relatively big.
So if all the caveats are considered:
ensure you use the same data types (NVARCHAR or VARCHAR) either use UPPER or use a case insensitive collation probably best to use a computed column to ensure consistency consider splitting domain out so searches by domain are possible there is a performance gain that is measureable - but I think 10 Million addresses is going it some.
Other observations
I tried using:
SELECT * FROM TestEmail WHERE chk = @chk AND email = @emails
SELECT * FROM TestEmail WHERE email =
This made the second query slower still (7 logical reads and 78 ms) because additional work was needed to fetch the other row values. whilst not having a big effect on the first query (4 reads and 2 ms)
Why are you using NVARCHAR for an email address?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Well partly because the OP was doing so, and partly I assume that it is possible that some languages might need it.
I'm working on a problem similar to this one and solved it using a similar technique. I have millions of pdf attachments and there was a developer bug where many duplicates were inserted with different primary keys. PDFs all have the same header data so a string search would take a while. What I did was to get a binary_checksum() on all PDFs and then do a slower compare on matched checksums to determine if the files were in fact duplicates.
Sql Server uses hashes internally all the time but I rarely think to do it at the business level since it adds complexity and rarely speeds things up.
I've never looked at the distribution of addresses for 10mil plus rows. Do you wind up with good even stats, or do you hit big groups (webmaster@....) that might cause issues (unless you used the hash)?
Thanks to all for commenting and contributing ideas. I think it was many moons ago when I first used this technique. In doing a small scale test of a little more than 1 million rows in the table, with an average length of 31 for email addresses, on SQL 2014 SP1 on a laptop, a couple of observations:
1)Selecting by checksum and email address had twice the query cost and logical reads as selecting by email address alone.
2)Also, the index size for the checksum and including the address was 74% of the size of the clustered index; whereas the index on just the email address was 130% of the size of the clustered index.
So I would agree with most of you, it is likely the extra checksum isn’t really helping improve the retrieval efficiency. It is also increasing table size and index size. It looks like SQL Server has improved its character indexing significantly. Of course, keeping emails addresses in a single table with an INT or BIGINT surrogate key still makes a ton of sense.
Thanks again for all of the discussion.
What also matters is the use case for creating such indexes. True - indexes add storage cost and I have seen DBAs frowning on creating extra indexes as managing indexes during bulk inserts is a mess except that you resort to disabling / dropping them and later restore / create them.
Most of the time end users would prefer to do 'LIKE' search - which means the cost of maintaining a checksum index vs NVARCHAR index needs to be assessed since LIKE operator cannot use checksum (a number).
Only a performance test for a specific population of data in the context of search use cases will help decide creating a right type of index as there is no absolute guide / solution for such scenario.
ChrisM@Work (9/24/2015)
Why are you using NVARCHAR for an email address?
We now have Unicode URLs, so why not Unicode characters for email addresses. You can Google Unicode Email and there's also this Wikipedia article: https://en.wikipedia.org/wiki/International_email
There's some interesting issues that we're going to have to cope with at some point, especially if you're a multi-national org.
-----
[font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]
Re the 10 Mill addresses and the stats.
To be blunt I generated random character strings for the addresses (there as an @ in the middle), so they hardly represent a real world example - I was more interested in whether using a hash provided a performance improvement on selecting a record.
The result was to me quite surprising - the hash made a big improvement to query performance. As others say there is more to this - is the insert performance hit significant (multiple indexes to maintain).
Wayne West (9/29/2015)
ChrisM@Work (9/24/2015)
Why are you using NVARCHAR for an email address?We now have Unicode URLs, so why not Unicode characters for email addresses. You can Google Unicode Email and there's also this Wikipedia article: https://en.wikipedia.org/wiki/International_email
There's some interesting issues that we're going to have to cope with at some point, especially if you're a multi-national org.
It's on the way - possibly - but it isn't here yet. Now bearing in mind that the point of the article was efficiency, I'd still argue in favour of VARCHAR.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
If you are doing single email look ups, I don't think anyone doubts the hash lookup can give you a reduction in logical I/O for slight increase in disk space, and probably memory usage & CPU load in most cases. This might be a worthwhile trade off for some people. Generally I think it's about a wash because my systems were not limited by single email look-ups.
The scenarios where I was database limited were the cases where I'd get a list of 1M unhashed email addresses and I needed to look up marketing campaigns that used EmailId. The overhead of creating the hash on the 1M records to join on the hash made the benefit vs programming complexity undesirable. If I had a chance to do it again, I would stick with the simpler schema and simpler programming and let the database solve the joins for me in the way that it wants to do it.
Here's some sample code:
/*
!!!! Test requires about 1GB of tempdb space !!!!
*/
-- make test tables
CREATE TABLE #Emails(
EmailId int identity(1,1) not null primary key
,EmailAddress varchar(254)
,constraint UX_Emails$EmailAddress unique (EmailAddress)
)
CREATE INDEX IX_Emails$EmailAddress on #Emails(EmailAddress)
CREATE TABLE #HashedEmails(
EmailId int identity(1,1) not null primary key
,EmailAddress varchar(254)
,Active bit
,HashVal as checksum(EmailAddress)
,constraint UX_HashedEmails$EmailAddress unique (EmailAddress)
)
CREATE INDEX IX_HashedEmails$EmailAddress on #HashedEmails(EmailAddress)
-- make one million fakey emails
insert #Emails(EmailAddress)
select top 1000000 LEFT(newid(), convert(varbinary,newid()) & 5 + 4)
+'@'+ LEFT(newid(), convert(varbinary,newid()) & 8 + 10 ) + '.' + 'COM'
from sys.columns c
cross join sys.columns c2
cross join sys.columns c3
-- add the 1 mil fakey emails
-- + 4 million more new fakey emails
insert #HashedEmails(EmailAddress, Active)
select EmailAddress, 1 from #Emails
union all
select top 4000000 LEFT(newid(), convert(varbinary,newid()) & 5 + 4)
+'@'+ LEFT(newid(), convert(varbinary,newid()) & 8 + 10 ) + '.' + 'COM'
, 1
from sys.columns c
cross join sys.columns c2
cross join sys.columns c3
exec tempdb..sp_spaceused #HashedEmails
-- 612MB
CREATE INDEX IX_HashedEmails$HashVal on #HashedEmails(HashVal)
exec tempdb..sp_spaceused #HashedEmails
-- 682MB
-- hash index adds 80MB or about ~16 bytes per row
-- join on email alone is slightly cheaper
select COUNT(*)
from #Emails e
join #HashedEmails he
on he.EmailAddress = e.EmailAddress
where he.Active = 1
select COUNT(*)
from #Emails e
join #HashedEmails he
on he.EmailAddress = e.EmailAddress
and he.HashVal = CHECKSUM(e.EmailAddress)
where he.Active = 1
Rfc3696 says that emails are varchar(254) with CI collation and that apps that use email adresses should "xn--" encode non ascii characters to fit in ascii.
Applications should be prepared to accept and process the encoded form (those strings are consistent with the "LDH rule" (see Section 2) so should not raise any separate issues) and the use of local, and potentially other, characters as appropriate to local systems and circumstances.
Viewing 15 posts - 16 through 30 (of 38 total)
You must be logged in to reply to this topic. Login to reply