how to find duplicate rows, dynamically, with a loop

  • Sorry but I'm not familiar with any of that.  at the end of the day I want to have a process that scans all tables in all databases and identified duplicates.  obviously it would be crazy to create a bunch of individual queries and enter the column names for each table in each DB.  That is the hardest part for me, the aspect of dynamically searching.

  • Jonathan AC Roberts wrote:

    Jeff Moden wrote:

    Jonathan AC Roberts wrote:

    Jeff Moden wrote:

    Of course, that recommendation also comes with the warning that if the hashbytes are different, then the rows are guaranteed to be different.  However (and you know this but I have to say it out loud for others that may read this), if the hashbytes are the same, there is no guarantee that the rows are identical because of the collisions that can occur within any hashing algorithm.  After you examined the true differences with hashbytes, then you have to do the kind of check your doing to prove they're identical.

    If you use SHA2_256 or SHA2_512 the probability of a hash collision for any two different rows in a table is as good as zero. For example, if you are using SHA2_256 and have 4.8×10^(29) rows in your table then there is a probability of 10^(-18) that there will be at least one clash. These numbers are way outside the realm of the possible. If you are using SHA2_512 then it's even more unlikely, you need to have 1.6×10^(68) rows in your table for there to be a 10^(-18)  probability of there being at least one clash.

    https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

    Yes.  Totally agreed about the probability.  If someone wants to trust that ultra-low probability, it'll probably never be wrong but... it's not a guarantee and I just want people to know that.

    To wit, I also use unique constraints on Random GUID columns that are supposed to contain unique values in SQL Server even though the probability of a dupe is also incredibly low because they're actually Type 4 UIDs, which are not guaranteed to be globally unique. 

    How do you hash all the columns together in a row? What do you do about null values in columns?

    I went back to our production code to double-check ad it's a good thing I did because I remembered it incorrectly.  We're not using HASHBYTES.  We're using BINARY_CHECKSUM.  You can do it dynamically or you can hard code it to do a checksum on only the columns that actually determine if rows are duplicated.  Here's the snippet of code from one of our production stored procedures.

            ,DupeCheck = BINARY_CHECKSUM
    (
    CollateralDescription
    ,CollateralTypeCode
    ,PropertyAddressBlock
    ,property_type_code
    ,AppraisedValue
    ,AppraisedDate
    ,InsurableValue
    ,RequiredInsuranceCoverage
    ,FloodRequiredFlag
    ,FloodZone
    ,FloodWaiverFlag
    ,LOMAOnFileFlag
    ,WindEscrowStatus
    ,HazardEscrowStatus
    ,FloodEscrowStatus
    ,OccupancyStatus
    ,RequiredFloodCoverageAmount
    ,PropertyGrandfatheredCode
    ,FloodDeterminationDate
    ,CollateralLienPositionCode
    ,PropertyCounty
    ,SalePrice
    ,FloodPartialCode
    ,FloodProgramCode
    ,FloodMapPanelNumber
    ,FloodMapPanelEffectiveDate
    ,FloodCommunityNumber
    ,ConstructionDate
    ,FloodDeterminationReferenceNumber
    ,FloodLOMCEffectiveDate
    ,FloodLOMCReceivedDate
    ,FloodCBRADate
    ,FloodLOMCCaseNumber
    ,CollateralOccupancyStatusCode
    ,SeniorLiensTotalAmount
    ,LastSeniorLiensVerificationDate
    ,OriginalPrincipalAmount
    ,CurrentPrincipalAmount
    ,OriginationDate
    ,TerminationDate
    ,PfiHazardPropertyType
    ,PfiHazardPropertyTypeDate
    ,PfiFloodPropertyType
    ,PfiFloodPropertyTypeDate
    ,PfiWindPropertyType
    ,PfiWindPropertyTypeDate
    ,PfiLiabilityPropertyType
    ,PfiLiabilityPropertyTypeDate
    ,PfiEarthquakePropertyType
    ,PfiEarthquakePropertyTypeDate
    )

    You're correct about HASHBYTES having problems with NULL (which is what triggered me to go back at look at our actual code) and that's why we had to go the BINARY_CHECKSUM route.  Of course, that has a much higher potential for collisions but it will very quickly disqualify most of the rows that are guaranteed to not be duplicates either in a table comparison or within a single table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 2 posts - 16 through 16 (of 16 total)

You must be logged in to reply to this topic. Login to reply