Indexing URLs and eMail address

  • What is the best way to index on URLS's and email address? I looked into creating a CheckSum column but also read that problems may result from some characters found in domain names. Does SQL 2008 provide a better way of doing this?

    Thanks!

  • RGDavis396 (11/29/2009)


    What is the best way to index on URLS's and email address?

    Same way you'd index anything else. Normal index on the column. Is there some reason you don't want to do this?

    I looked into creating a CheckSum column but also read that problems may result from some characters found in domain names.

    I suppose you could do a checksum (or hash) then index and filter on that. Question is, why go to all the added complexity just to index a string column. Are there some considerations that you haven't mentioned?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Searches against string values tend to be slower than those against numerical values and we are looking for ways to speed up searches against commonly used fields such as an email address.

  • RGDavis396 (11/30/2009)


    Searches against string values tend to be slower than those against numerical values and we are looking for ways to speed up searches against commonly used fields such as an email address.

    Adding an index will speed up the search Email address are usually not that long so you should still get good results. I wouldn't bother using Checksum coumn.

    If your URLS are really long then you could possibly use Full-text indexing, but this does have a lot of limitations and is better at searching specific phrases in very large strings such as a specific word in a document.

  • RGDavis396 (11/30/2009)


    Searches against string values tend to be slower than those against numerical values

    Have you tested against large volumes of data under load? If so, are you seeing a measurable, significant performance difference? Is that performance difference significant enough to go to extreme lengths to solve it?

    I did a quick test on the Adventureworks database, this is what I got over 10 runs (test code follows)

    Runs

    1: ID = 18 ms, Email = 22 ms

    2: ID = 17 ms, Email = 29 ms

    3: ID = 18 ms, Email = 24 ms

    4: ID = 17 ms, Email = 25 ms

    5: ID = 17 ms, Email = 24 ms

    6: ID = 19 ms, Email = 23 ms

    7: ID = 20 ms, Email = 22 ms

    8: ID = 17 ms, Email = 28 ms

    9: ID = 17 ms, Email = 24 ms

    10: ID = 18 ms, Email = 24 ms

    So a max of 12ms difference over 1000 executions, 0.012ms difference. You'd probably spend more than that in calculating the checksum/hash

    Wasn't a completely fair test, the seek by ID is on the unique clustered index, I made the NC on the email column unique to make things a bit more even. Should have created a unique id column with NC index to be completely fair

    DECLARE @id INT, @Email VARCHAR(5000)

    SET NOCOUNT ON

    DECLARE @i INT =0

    DECLARE @StartTime DATETIME2

    SET @StartTime = SYSDATETIME()

    WHILE (@i<1000)

    BEGIN

    SELECT @id = BusinessEntityID, @Email = EmailAddress

    FROM person.EmailAddress

    WHERE BusinessEntityID = 42

    SET @i = @i+1

    END

    PRINT ' By ID. Time elapsed ' + CAST(DATEDIFF(ms, @StartTime, SYSDATETIME()) AS VARCHAR(20))

    GO

    DECLARE @id INT, @Email VARCHAR(5000)

    SET NOCOUNT ON

    DECLARE @i INT =0

    DECLARE @StartTime DATETIME2

    SET @StartTime = SYSDATETIME ()

    WHILE (@i<1000)

    BEGIN

    SELECT @id = BusinessEntityID, @Email = EmailAddress

    FROM person.EmailAddress

    WHERE EmailAddress = N'james0@adventure-works.com'

    SET @i = @i+1

    END

    PRINT ' By Email. Time elapsed ' + CAST(DATEDIFF(ms, @StartTime, SYSDATETIME()) AS VARCHAR(20))

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I tend to agree with Gail that I'd just index the fields. If there are specific values that you need to pull out, you could add a computed column that just pulls out those values, like the domain from an email address, and index that column.

  • Thanks for your help everyone! It appears that it is as good as it is going to get so I think that we are going to leave things as they are.

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply