Detect duplicate rows - with a twist

  • Jeff,

    I knew you'd have a bit to say here when you read it. I modifed his table def for RawData to match the def in his cursor routine.

    Scott,

    Yes, your index is limited to 900 bytes, not including any included columns you might add for what ever reason.

  • Jeff Moden (4/27/2009)


    Just a parting thought. I recommended working on a staging table with separate columns and indexes on the couple of columns that actually determine a dupe CDR. The recommendation is not made lightly. Using the method I posted, I dupe check across 31 sets (1 for each day of the month) of 3 databases and a "recycle" database for a total of (upto but usually 1 or 2 less depending on the months) 94 databases with about 4 million rows in each database... in about 11 minutes.

    Thanks for your post, Jeff. My skepticism about RBAR being completely eliminated is waning. I do have one other stored proc that uses cursors and given the excellent set-based solutions here I'm motivated to try to re-write that one as well. If I get stuck (which is probably likely) I may ask for more help.

    I'm interested as well in your "staging table" comments. Would this be only for detecting duplicate records, or would you then create a view over all of the tables for use when data needs to be accessed across all tables? Also, I am completely unfamiliar with partitioning. Could similar performance gains be realized by partitioning the table by date? Or month?

  • Lynn Pettis (4/27/2009)


    Jeff,

    I knew you'd have a bit to say here when you read it. I modifed his table def for RawData to match the def in his cursor routine.

    Scott,

    Yes, your index is limited to 900 bytes, not including any included columns you might add for what ever reason.

    Sorry about the confusion over mis-matched data sizes for RawData. The RawData column used to be VARCHAR(300) but we encountered a new format that is longer than 300 bytes so it was expanded in our test database.

    But it brings up a question I have about VARCHAR columns that maybe you guys would be kind enough to explain to me. As I understand it, the VARCHAR column only occupies the amount of space required by the actual value(s) stored in the column. If that is the case, why not create all VARCHAR columns as VARCHAR(MAX)? I realize there are reasons involving business rules, but I'm asking more from a technical perspective.

  • It's simple why you wouldn't want to do that. First of all the data types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) are blob data types that can hold up to 2 GB of data. As such, SQL Server handles them differently than the regular data types VARCHAR(n), NVARCHAR(n), and VARBINARY(n). If the value of VARCHAR(MAX) can be stored in row, SQL Server will do so, but if it would cause the row to exceed 8060 bytes, SQL Server has to do additional work to store the data.

    Second, when you are defining your data you should use the data type that best represents your data. If you have a variable string that would not exceed 512 bytes, you'd declare it as varchar(512), not varchar(1024) or even varchar(max).

    This is just a quick and very brief explaination and I'm sure others will provide you with more info as well.

  • Scott Roberts (4/29/2009)


    Sorry about the confusion over mis-matched data sizes for RawData. The RawData column used to be VARCHAR(300) but we encountered a new format that is longer than 300 bytes so it was expanded in our test database.

    But it brings up a question I have about VARCHAR columns that maybe you guys would be kind enough to explain to me. As I understand it, the VARCHAR column only occupies the amount of space required by the actual value(s) stored in the column. If that is the case, why not create all VARCHAR columns as VARCHAR(MAX)? I realize there are reasons involving business rules, but I'm asking more from a technical perspective.

    As Lynn stated, the "MAX" datatypes are BLOBS that are handled quite differently... they end up being "out of row" and can actually be quite the performance penalty for that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Scott Roberts (4/29/2009)


    I'm interested as well in your "staging table" comments. Would this be only for detecting duplicate records, or would you then create a view over all of the tables for use when data needs to be accessed across all tables? Also, I am completely unfamiliar with partitioning. Could similar performance gains be realized by partitioning the table by date? Or month?

    Truth be spoken, I never import directly from file to final table. I always go through a staging table to fully validate both the data and the integrity (ie, partial file, bad lines, etc) before I insert the data into the final table(s).

    Partitioning the tables, whether using a partitioned view (deprecated) or using "real" partitioned tables have many performance advantages an not just in the code at hand. The reduced size of the tables makes index maintenance a lot less "obtrusive" and both can be made to very quickly move older data to an "online archive" status and it can be done in a highly automated, virtually hands-free fashion. For example, I keep 120 days worth of CDRs in an "active online" status. I keep a year's worth of CDR's in an "online archived" status in case any FBI or homeland security requests for call records come in. I keep 3 years worth of CDR's in a "near online archive" on a separate, very limited access server (obviously, not part of a partitioned table on the main server but is partition on the archive server). Anything older than 3 years is available by month on tape.

    So far as the speed of 93 day (3 month) dupe checks... it doesn't matter if the days are in separate databases/tables or one big one, you still need to check against all CDR's so no performance advantage there although partioning sure does make the code a lot easier to write as opposed to separate tables or databases.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/29/2009)


    Scott Roberts (4/29/2009)


    Sorry about the confusion over mis-matched data sizes for RawData. The RawData column used to be VARCHAR(300) but we encountered a new format that is longer than 300 bytes so it was expanded in our test database.

    But it brings up a question I have about VARCHAR columns that maybe you guys would be kind enough to explain to me. As I understand it, the VARCHAR column only occupies the amount of space required by the actual value(s) stored in the column. If that is the case, why not create all VARCHAR columns as VARCHAR(MAX)? I realize there are reasons involving business rules, but I'm asking more from a technical perspective.

    As Lynn stated, the "MAX" datatypes are BLOBS that are handled quite differently... they end up being "out of row" and can actually be quite the performance penalty for that.

    MSDN states that, by default, "MAX" data types are stored in-row unless they exceed 8,000 bytes, in which case it appears that only the "remaining" data gets stored out-of-row.

    http://msdn.microsoft.com/en-us/library/ms173530.aspx

    Keep in mind that I'm not advocating defining every thing as "MAX". I'm just wondering if there is any technical difference between VARCHAR(300), VARCHAR(5000), and VARCHAR(MAX) where the actual value of the data stored in the column is always less than 8,000 bytes. I'm getting the feeling that setting the number characters for a VARCHAR column is really just a business rule definition and has no real technical significance.

  • Scott Roberts (4/29/2009)


    Jeff Moden (4/29/2009)


    Scott Roberts (4/29/2009)


    Sorry about the confusion over mis-matched data sizes for RawData. The RawData column used to be VARCHAR(300) but we encountered a new format that is longer than 300 bytes so it was expanded in our test database.

    But it brings up a question I have about VARCHAR columns that maybe you guys would be kind enough to explain to me. As I understand it, the VARCHAR column only occupies the amount of space required by the actual value(s) stored in the column. If that is the case, why not create all VARCHAR columns as VARCHAR(MAX)? I realize there are reasons involving business rules, but I'm asking more from a technical perspective.

    As Lynn stated, the "MAX" datatypes are BLOBS that are handled quite differently... they end up being "out of row" and can actually be quite the performance penalty for that.

    MSDN states that, by default, "MAX" data types are stored in-row unless they exceed 8,000 bytes, in which case it appears that only the "remaining" data gets stored out-of-row.

    http://msdn.microsoft.com/en-us/library/ms173530.aspx

    Keep in mind that I'm not advocating defining every thing as "MAX". I'm just wondering if there is any technical difference between VARCHAR(300), VARCHAR(5000), and VARCHAR(MAX) where the actual value of the data stored in the column is always less than 8,000 bytes. I'm getting the feeling that setting the number characters for a VARCHAR column is really just a business rule definition and has no real technical significance.

    Yes, there are technical differences.

    From BOL:

    Columns that are of the large object (LOB) data types ntext, text, varchar(max), nvarchar(max), varbinary(max), xml, or image cannot be specified as key columns for an index. Also, a view definition cannot include ntext, text, or image columns, even if they are not referenced in the CREATE INDEX statement.

    Also, an index is limited to 900 bytes, so you could index a table on a varchar(300) column but not on a varchar(5000) column.

  • Lynn Pettis (4/29/2009)


    Yes, there are technical differences.

    From BOL:

    Columns that are of the large object (LOB) data types ntext, text, varchar(max), nvarchar(max), varbinary(max), xml, or image cannot be specified as key columns for an index. Also, a view definition cannot include ntext, text, or image columns, even if they are not referenced in the CREATE INDEX statement.

    Also, an index is limited to 900 bytes, so you could index a table on a varchar(300) column but not on a varchar(5000) column.

    You mentioned that earlier, sorry for not catching it.

    I've been side-tracked by more pressing projects, but I intend to try all 3 solutions provided here and I'll try to remember to post the results so you guys can see *exactly* how much you helped. I'm sure all 3 solutions are infinitely faster than my old RBAR procedure.

    I'm also going to take a hard look at partitioning the table, but that will require a little more education on my part before I dive into it.

    Thanks again for the help from all of you. I really, really appreciate it. Not only did I get a better solution to my problem, but I learned a few things in the process.

  • Scott Roberts (4/29/2009)


    MSDN states that, by default, "MAX" data types are stored in-row unless they exceed 8,000 bytes, in which case it appears that only the "remaining" data gets stored out-of-row.

    Understood... but there is a performance penalty (depending on what you want to do, of course) even when they stay in row. We found that out during some "splitter" tests on another thread.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm also going to take a hard look at partitioning the table, but that will require a little more education on my part before I dive into it.

    Word of advice: You will find the process of partitioning a table to be fairly straightforward. But you really need to do some analysis and thinking about the distribution of values within your table before deciding on a partition scheme and function. Good luck.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 11 posts - 31 through 41 (of 41 total)

You must be logged in to reply to this topic. Login to reply