Handling the Duplication of Records

  • Comments posted to this topic are about the item Handling the Duplication of Records

  • This was removed by the editor as SPAM

  • I cannot count the number of times our developers have stated the "business logic belongs in the application". Our developers know there are multiple instanceS of our application, fronted by multiple stateless web servers. They know (or should know) multiple app instances can easily insert duplicate data. It does not matter if I insist on unique constraints or natural primary keys - I get blank stares or weary sighs. Changing the application binaries is too much work, it seems. As a result Support sees many tickets to "remove duplicates", even though the database server has no formal declaration for the word "duplicate".

    Even if I lower my standards and suggest IGNORE_DUP_KEY as a temporary fix,  I  still occasionally face fear and loathing. To allay that fear, I explain that IGNORE_DUP_KEY raises a warning, not an error. Not that I see our developers doing anything with errors (outside of fatal ones), other than punting them into a log file (thus they are permissibly ignored, at least until an end-user complains and Support goes dumpster diving). Such behavior must be the Agile way, I suppose. When I do get buy-in to use IGNORE_DUP_KEY, I let the developers know IGNORE_DUP_KEY is a temporary fix, and that they are (at least) wasting IO. Funny thing is, in some cases IGNORE_DUP_KEY is still there, after years.

    Yes, there are more rigorous approaches I can do. Such as negotiating with 2 or more developer teams, using ORMs, who vaguely point to each other as being "the" root cause. I can only hope one of those teams makes a decision to design their shared table correctly. Being on the operations side, I have no say in a development team's choice for the "correct" points to assign such issues in their sprints (which I suspect is why IGNORE_DUP_KEY is still present).

    • This reply was modified 3 years, 5 months ago by  SoHelpMeCodd.
  • Why don't the developers just lobby to use only JSON files.  They are a lot cheaper.  The developers can write the application to do what they usually ask the database to do, which is not much.  I am sure there are some good development teams that actually use the database to provide data integrity.  The only places I see this is where the client supports MSSS and Oracle.

  • "detailed analysis" for me is when the root causes are analyzed and solutions are provided to address them (e.g. using validation on the database side, uniqueness constraints, etc.) . You partially described the root causes in the introduction, therefore the formulation afterwards doesn't make sense. The solution you provide is to correct the effects, which needs to be done periodically if the problem is not addressed by design. Otherwise, the post covers the most important aspects, though can be improved.

    You make a good point that this needs to be addressed by design, either at database or UI level. The solution depends also on the business scenario and/or application type. In contrast with your example, in case the records come from other system (e.g integration between two systems), then one can still allow for duplicates to be saved into the staging table, while importing the latest record. There is also the possibility to override the existing record or version the records. In some scenarios one can allow users to indicate manually the correct records (e.g. data cleaning systems), while the implemented logic can highlight the correct record or make a recommendation especially when further logic is used in the background (e.g. address validation).

    In what concerns the code:

    - An artificial primary key can allow to easier delete the duplicates and, besides a date, it can allow to identify in which sequence the records were inserted. It’s in general recommended to have a primary key on tables.

    - You can use a combination of count and rank windows functions to highlight the duplicates, respectively the correct record.

    - Your logic for deleting the records is not “deterministic” – it doesn’t make sure that between consecutive runs the same record will be deleted. This because the order in which the records are retrieved is not guaranteed. You need to specify further columns in the ORDER BY clause to make the logic deterministic (e.g. primary key, timestamp or a combination of both). Further discussions could be made on which record to keep (typically the last).

    - Duplicates can refer to whole records or groups of columns. It’s useful to make explicit upfront which columns are supposed to form a unique key.

    - As you used two methods to identify the duplicates, some might find it useful to understand the key differences between them and which one would you recommend and why.

    - I think it would be interesting to show how one could delete the records via the GROUP BY, as there are 1-2 tricks one could use, especially when is needed to select the last record. This reminds me about the first/last windows functions recently introduced in SQL Server, where the handling of NULLs can involve some challenges.

    - Some systems don’t delete the records but move them to special tables or use statuses.

    I ask myself how the Compliance Id is generated. Something is fishy if this led to duplicates.

    Best regards,

    sql-troubles.blogspot.com

  • I like the approach illustrated here.  I have used in the past the dedup-rename method to cleanup the records.  i.e.:

    Select <fieldlist> Into <tmp_tablename> From <tablename> Group by <fieldlist>

    EXEC sp_rename 'tablename', 'dump_tablename';

    EXEC sp_rename 'tmp_tablename', 'tablename';

    Of course you have to suspend any foreign key constraints to flip it over, so this can only be done during a maintenance window.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply