search index

  • In a table with a million records, you want to search for all users who have an email "@gmail", what is the most convenient option?

    a. Email Index

    b. Index by Email that also includes the UserID

    c. Neither one is necessary.

  • I'm not going to answer your question.  YOU are.  Here's a test table that you can try all 3 answers on.  Don't let the "million" row thing scare you... this only takes a little over a second on my box.

    --===== Create a million row test table for this problem
    DROP TABLE IF EXISTS dbo.SimulatedUserTable;
    GO
    WITH cteGenUserID AS
    (
    SELECT TOP 1000000
    UserID = 10000000+ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM sys.all_columns ac1
    CROSS JOIN sys.all_columns ac2
    )
    SELECT UserID = ISNULL(UserID,0) --This ISNULL makes this column NOT NULL in the resulting table.
    ,Email = REPLACE(CONCAT(RIGHT(NEWID(),CRYPT_GEN_RANDOM(1)%10+10),'@',CHAR(UserID%26+65),'mail'),'-','.')
    ,Fluff = CONVERT(CHAR(200),'X') --Just simulate 200 bytes of the bulk of other columns
    INTO dbo.SimulatedUserTable
    FROM cteGenUserID
    ;
    ALTER TABLE dbo.SimulatedUserTable
    ADD CONSTRAINT PK_SimulatedUserTable PRIMARY KEY CLUSTERED (UserID)
    ;
    --===== Let's see what we've got
    SELECT TOP 10 *
    FROM dbo.SimulatedUserTable
    ORDER BY UserID
    ;

    Here's what the first 10 rows look like.  They'll be different for you because it's random data for the Email addresses.

    Also, if you're going for an interview, be able to answer the question of "Why did you pick that one?" and be prepared to explain why the other ones are not the right answer.

    IMHO and as always, "It Depends".  If you have to do this often, then none of the answers are actually correct.  That means that you should also be prepared for the eventual question of "Can you think of a better way?"

    p.s. Actual tests with code are the only true way to learn from such "nifty" questions.  The thing I hate about these types of questions of this type is, what do they mean by the word "convenient" and convenient to who?  Convenient to a Developer may mean doing the least amount of work possible to get the answer. 😀  If they mean most convenient for the user, they're going to expect the best performance possible.  And that's why "It Depends".  This question is "nifty" for an in-person interview because it creates a what if environment where you can justify and of the 3 answers and a couple of  "cool" answers to get the best performance.  On an actual exam, it sucks as a question because of the words "most convenient".  They also didn't tell you how wide the rows are and that can also make a huge difference in what the answer should be.  Don't be afraid to change that "200" to "2000" to see what I mean.

    • This reply was modified 1 year, 8 months ago by  Jeff Moden. Reason: Added more suggestions/info

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Naturally b is the answer to that specific question.

    As to why:

    (1) the table has 1M rows; therefore, creating an index with only the columns you need will typically drastically cut the I/O, and thus the time, required to read all matching data

    (2) You need to include the UserID in the index because the query will need that data; note that the q says: "you want to search for all users".  If the UserID is not in the index, SQL will ignore the index and have to scan the whole table, which is what you're trying to avoid

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • "It Depends"on what "most convenient" means.  You've just added an index that's going to do page splits all over the place unless you use a Fill Factor, which most will call a waste of space but will help prevent page splits because of its random order but only if you maintain the index correctly and that won't be "convenient" for the DBA and if the DBA makes it "convenient" for themself and simply ignores the index, the slow down from pages splits won't be "convenient" for the end users and the space wasted won't be "convenient" for the people that are concerned about memory, etc.

    Yep... as written, "b" may be the most correct answer for the very reasons you state but, in that case, I'd have to say it's a really poorly written question to begin with. 😀

    And I was really hoping that, with the test data, the OP would have learned all of this and the supposed correct answer on their own. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff:  you text data was amazing!!! becouse that I undertood your point about the issue....and thanks to ScottPletcher for your explanation. My doubt was cleared.!

     

     

  • That is a simple q, I say give it a simple answer.  There's no need to over-analyze every q or nit pick its wording.  Of course interview qs don't contain every nuance of every possible condition, since an interview needs to completed sometime that day.

    I actually think "most convenient" is appropriate enough for that q (and thus avoiding "ideal" or "best-performing").

    If you really want to deep-dive the q, the proper thing would be to redesign the data to encode (convert to a numeric code) all email domains.  Because:

    (1) then the domain name is a constant physical length in the table (say 2 bytes, unless you need to store > ~64K domains)

    (2) if a domain changes, you just change the text name (like, say, [@]facebook to [@]meta) in the lookup table, you don't have to change strings throughout tables

    But, again, I wouldn't expect nor appreciate that as an answer to the original q.

    One could argue that you should also encode the name portion of an email address, but that's a much more complicated decision.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • marcelo salgado wrote:

    Jeff:  you text data was amazing!!! becouse that I undertood your point about the issue....and thanks to ScottPletcher for your explanation. My doubt was cleared.!

    Thank you for the kind feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ScottPletcher wrote:

    That is a simple q, I say give it a simple answer.  There's no need to over-analyze every q or nit pick its wording.

    You're missing the point, Scott.  I agree that the answer closest to being right is "b".  What if they author of the question say's it's actually "c" and you're looking for a job.  You wouldn't nit pick that at all? 😉

    And what's wrong with trying to get the OP to come to their own conclusion and THEN providing the answer as a part of the discussion if they don't get it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    ScottPletcher wrote:

    That is a simple q, I say give it a simple answer.  There's no need to over-analyze every q or nit pick its wording.

    You're missing the point, Scott.  I agree that the answer closest to being right is "b".  What if they author of the question say's it's actually "c" and you're looking for a job.  You wouldn't nit pick that at all? 😉

    And what's wrong with trying to get the OP to come to their own conclusion and THEN providing the answer as a part of the discussion if they don't get it?

    Of the choices given, to me the only possible answer is b.  I, myself, don't try to sharpshoot during an interview, since I believe it is counterproductive.  Also, I consider the wording actually correct, since the "most convenient" thing to do is slap an index on email and include UserID in that index (i.e. without worrying about the mechanics of how that index will function).

    As to the latter point, I don't object to just leading the person to the answer, although it wouldn't be my approach for something this simple.  However, I think you had an oopsie moment, since this index:

    ALTER TABLE dbo.SimulatedUserTable

    ADD CONSTRAINT PK_SimulatedUserTable PRIMARY KEY CLUSTERED (UserID)

    doesn't meet the conditions of the problem.  That would greatly confuse me if I were trying to learn from the example.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Who says?  It didn't specify anything about the table.  Consider it to be "most convenient". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    Who says?  It didn't specify anything about the table.  Consider it to be "most convenient". 😀

    The OP said:

    you want to search for all users who have an email "@gmail"'

    The table you posted does not support that using an index.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    Jeff Moden wrote:

    Who says?  It didn't specify anything about the table.  Consider it to be "most convenient". 😀

    The OP said:

    you want to search for all users who have an email "@gmail"'

    The table you posted does not support that using an index.

    That's another problem with the question.  It doesn't specifically say whether or not that's in a separate column or not and so, in the absence of any other information, you have to assume that part of email addresses are NOT in a separate column.

    That's another reason why the question has a major suck factor.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It doesn't.  You're wayyyy over-analyzing this.  Again, it's a simple q with a simple answer.  Save the big analysis for a more significant q.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    It doesn't.  You're wayyyy over-analyzing this.  Again, it's a simple q with a simple answer.  Save the big analysis for a more significant q.

    Says the guy with the critique on the simple test table. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply