"detailed analysis" for me is when the root causes are analyzed and solutions are provided to address them (e.g. using validation on the database side, uniqueness constraints, etc.) . You partially described the root causes in the introduction, therefore the formulation afterwards doesn't make sense. The solution you provide is to correct the effects, which needs to be done periodically if the problem is not addressed by design. Otherwise, the post covers the most important aspects, though can be improved.
You make a good point that this needs to be addressed by design, either at database or UI level. The solution depends also on the business scenario and/or application type. In contrast with your example, in case the records come from other system (e.g integration between two systems), then one can still allow for duplicates to be saved into the staging table, while importing the latest record. There is also the possibility to override the existing record or version the records. In some scenarios one can allow users to indicate manually the correct records (e.g. data cleaning systems), while the implemented logic can highlight the correct record or make a recommendation especially when further logic is used in the background (e.g. address validation).
In what concerns the code:
- An artificial primary key can allow to easier delete the duplicates and, besides a date, it can allow to identify in which sequence the records were inserted. It’s in general recommended to have a primary key on tables.
- You can use a combination of count and rank windows functions to highlight the duplicates, respectively the correct record.
- Your logic for deleting the records is not “deterministic” – it doesn’t make sure that between consecutive runs the same record will be deleted. This because the order in which the records are retrieved is not guaranteed. You need to specify further columns in the ORDER BY clause to make the logic deterministic (e.g. primary key, timestamp or a combination of both). Further discussions could be made on which record to keep (typically the last).
- Duplicates can refer to whole records or groups of columns. It’s useful to make explicit upfront which columns are supposed to form a unique key.
- As you used two methods to identify the duplicates, some might find it useful to understand the key differences between them and which one would you recommend and why.
- I think it would be interesting to show how one could delete the records via the GROUP BY, as there are 1-2 tricks one could use, especially when is needed to select the last record. This reminds me about the first/last windows functions recently introduced in SQL Server, where the handling of NULLs can involve some challenges.
- Some systems don’t delete the records but move them to special tables or use statuses.
I ask myself how the Compliance Id is generated. Something is fishy if this led to duplicates.