need to find duplicate rows like same entries more than one time

  • i have a table like below

    create table staff_attendance

    (

    attendance_id int,

    attendace_date datetime,

    staff_id int,

    working_year int,

    hours int

    )

    values like

    1 2014-06-30 00:00:00.0ST10121

    2 2014-06-30 00:00:00.0ST10122

    3 2014-06-30 00:00:00.0ST10122 ----same entry like previous one

    4 2014-07-01 00:00:00.0ST10121

    5 2014-07-01 00:00:00.0ST10122

    6 2014-07-02 00:00:00.0ST10121

    7 2014-07-02 00:00:00.0ST10122

    8 2014-06-30 00:00:00.0ST10221

    9 2014-06-30 00:00:00.0ST10222

    10 2014-07-01 00:00:00.0ST1022 1

    11 2014-07-01 00:00:00.0ST102 22

    12 2014-07-02 00:00:00.0ST102 21

    13 2014-07-02 00:00:00.0ST102 22

    I Need to find the duplicate rows like same entries which is having more than 1 rows.... how do i find???

  • Quick code to get you passed the hurdle

    😎

    USE tempdb;

    GO

    IF OBJECT_ID('dbo.staff_attendance') IS NOT NULL

    DROP TABLE dbo.staff_attendance;

    create table dbo.staff_attendance

    (

    attendance_id int

    ,attendace_date datetime

    ,staff_id VARCHAR(6)

    ,working_year int

    ,[hours] int

    );

    --attendance_id,attendace_date,staff_id,working_year,[hours]

    INSERT INTO dbo.staff_attendance (attendance_id,attendace_date,staff_id,working_year,[hours])

    VALUES

    (1 ,'2014-06-30 00:00:00.0','ST101',2,1)

    ,(2 ,'2014-06-30 00:00:00.0','ST101',2,2)

    ,(3 ,'2014-06-30 00:00:00.0','ST101',2,2) ----same entry like previous one

    ,(4 ,'2014-07-01 00:00:00.0','ST101',2,1)

    ,(5 ,'2014-07-01 00:00:00.0','ST101',2,2)

    ,(6 ,'2014-07-02 00:00:00.0','ST101',2,1)

    ,(7 ,'2014-07-02 00:00:00.0','ST101',2,2)

    ,(8 ,'2014-06-30 00:00:00.0','ST102',2,1)

    ,(9 ,'2014-06-30 00:00:00.0','ST102',2,2)

    ,(10 ,'2014-07-01 00:00:00.0','ST102',2,1)

    ,(11 ,'2014-07-01 00:00:00.0','ST102',2,2)

    ,(12 ,'2014-07-02 00:00:00.0','ST102',2,1)

    ,(13 ,'2014-07-02 00:00:00.0','ST102',2,2);

    SELECT

    SA.attendance_id

    ,SA.attendace_date

    ,SA.staff_id

    ,SA.working_year

    ,SA.[hours]

    ,COUNT(SA.attendance_id) OVER

    (

    PARTITION BY SA.attendace_date

    ,SA.staff_id

    ,SA.working_year

    ,SA.[hours]

    ) AS INSTANCE_COUNT

    ,ROW_NUMBER() OVER

    (

    PARTITION BY SA.attendace_date

    ,SA.staff_id

    ,SA.working_year

    ,SA.[hours]

    ORDER BY (SA.attendance_id)

    ) AS INSTANCE_RID

    FROM dbo.staff_attendance SA

    Results

    attendance_id attendace_date staff_id working_year hours INSTANCE_COUNT INSTANCE_RID

    ------------- ----------------------- -------- ------------ ----------- -------------- --------------

    1 2014-06-30 00:00:00.000 ST101 2 1 1 1

    2 2014-06-30 00:00:00.000 ST101 2 2 2 1

    3 2014-06-30 00:00:00.000 ST101 2 2 2 2

    8 2014-06-30 00:00:00.000 ST102 2 1 1 1

    9 2014-06-30 00:00:00.000 ST102 2 2 1 1

    4 2014-07-01 00:00:00.000 ST101 2 1 1 1

    5 2014-07-01 00:00:00.000 ST101 2 2 1 1

    10 2014-07-01 00:00:00.000 ST102 2 1 1 1

    11 2014-07-01 00:00:00.000 ST102 2 2 1 1

    6 2014-07-02 00:00:00.000 ST101 2 1 1 1

    7 2014-07-02 00:00:00.000 ST101 2 2 1 1

    12 2014-07-02 00:00:00.000 ST102 2 1 1 1

    13 2014-07-02 00:00:00.000 ST102 2 2 1 1

  • I think you're looking for something like this:

    SELECT

    attendace_date

    , staff_id

    , working_year

    , hours

    , COUNT(1)

    FROM

    staff_attendance

    GROUP BY

    attendace_date

    , staff_id

    , working_year

    , hours

    HAVING

    COUNT(1) > 1

    ;

    This is the standard query for this type of thing.

    Cheers!

  • If we define duplicate rows as rows where the following attributes are equal: attendace_date, staff_id, working_year, [hours] and that the original row within them is the row with the smallest attendance_id, the following statement identifies duplicate rows (you see the attendance_id, so you can easily remove them if needed):

    ;WITH cte AS

    (SELECT attendance_id, attendace_date, staff_id, working_year, [hours],

    ROW_NUMBER() OVER(PARTITION BY attendace_date, staff_id, working_year, [hours]

    ORDER BY attendance_id ASC) rn

    FROM dbo.staff_attendance

    )

    SELECT * FROM cte WHERE rn > 1;

    Note: the statement uses the table defined by Eirikur Eiriksson. Thanks to Erikur and next time please use the same apporach to define and populate tables by yourself.

    Another note: When you want to remove duplicates you need to define them first. This sounds maybe trivial, but sometimes your perception of duplicates does not correspond to the one of developers or business guys. Simple define when two rows are duplicates (i.e. which columns must be equal) and also what is the original row within them (the smallest Id, or highest Id, timestamp whatever). This will take not more than a minute but can avoid potential misunderstandings and misinterpretations.

    ___________________________
    Do Not Optimize for Exceptions!

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply