How to Remove Duplicates from my table

  • Hi All,

    Hope you are doing great. Here I have a problem and I am sure you will help me out.

    I have a table with full of duplicates. I have the following fields in my table.

    Contactid (Unique), FirstName, LastName, Officerid, PhoneNo email, cellphone, address1

    select contactid,FirstName,lastname,officerid,

    email,,HomeAddress1

    from contacts

    where firstname='Niladri' and lastname='Biswas'

    and officerid='jfortw'

    order by contactid desc

    If I run the above query I am getting 6 records which is fine. I am getting the 6 unique contact id also. I need to remove the all the 5 duplicates except the first one. As the first contactid is newest one which is inserted into the table.

    Before I remove the records I need to update the other info like phoneno, email, address1 from other IDs.

    Now I have 6 contact Ids like 1, 2,3,4,5 and 6. I will show only the reocrd 6 but it will update phoneno, email, address1 from other ids. for example if id 1 has the phone no then it will update the phone no of contact 6, if id 2 has the address then it will update the address of contact 6 and go on...

    Lastly it will keep the records only 6 contact id and then delete all the other records.

    The records are unique based on OfficerID, LastName,FirstName

    I need to do same task for all the unique records. what is the best way to clean the table.

    Please guide me

    Thanks in advance

    Niladri

  • niladri.primalink (3/4/2016)


    Hi All,

    Hope you are doing great. Here I have a problem and I am sure you will help me out.

    I have a table with full of duplicates. I have the following fields in my table.

    Contactid (Unique), FirstName, LastName, Officerid, PhoneNo email, cellphone, address1

    select contactid,FirstName,lastname,officerid,

    email,,HomeAddress1

    from contacts

    where firstname='Niladri' and lastname='Biswas'

    and officerid='jfortw'

    order by contactid desc

    If I run the above query I am getting 6 records which is fine. I am getting the 6 unique contact id also. I need to remove the all the 5 duplicates except the first one. As the first contactid is newest one which is inserted into the table.

    Before I remove the records I need to update the other info like phoneno, email, address1 from other IDs.

    Now I have 6 contact Ids like 1, 2,3,4,5 and 6. I will show only the reocrd 6 but it will update phoneno, email, address1 from other ids. for example if id 1 has the phone no then it will update the phone no of contact 6, if id 2 has the address then it will update the address of contact 6 and go on...

    Lastly it will keep the records only 6 contact id and then delete all the other records.

    The records are unique based on OfficerID, LastName,FirstName

    I need to do same task for all the unique records. what is the best way to clean the table.

    Please guide me

    Thanks in advance

    Niladri

    When you post questions like this it is much appreciated when you help us help you by providing DDL for your table in the form of a CREATE TABLE statement plus all indexes and constraints -and- DML to create test data in that table in the form of INSERT INTO statements so folks can easily test in their sandbox environment and provide you with a working and tested code-solution that matches your environment.

    Is it safe to assume there are other address columns in the table, e.g. city, state, zip? And will those all need to be moved forward together? For example if row 6 in your example has a HomeAddress value but not City but Row 3 has City would you want to blend the City from one row with the HomeAddress of another risking creating an incorrect address in row 6 instead of just having an incomplete one? What other columns in the table would you want this solution to handle bringing data forward for?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Hi,

    Thanks fro you reply. Next time I will be careful. Here are all the columns the table I have

    contactid,FirstName,lastname,officerid,

    email,Homephone,workphone,Fax,HomeAddress1,HomeAddress2,Homecity,HomeState,

    HomeZip,Title,workaddress1,workaddress2,workcity,workcity,workstate,workzip

    Hope it will help you. Please guide me.

    Also let me know if you need any other info

  • niladri.primalink (3/4/2016)


    Hi,

    Thanks fro you reply. Next time I will be careful. Here are all the columns the table I have

    contactid,FirstName,lastname,officerid,

    email,Homephone,workphone,Fax,HomeAddress1,HomeAddress2,Homecity,HomeState,

    HomeZip,Title,workaddress1,workaddress2,workcity,workcity,workstate,workzip

    Hope it will help you. Please guide me.

    Also let me know if you need any other info

    Nice try, but please re-read my post. You can get the DDL from SSMS Object Explorer. Plus we will need some INSERT statements to create some test data, from you.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • OK..Thanks got it. Here is the query which will help

    /****** Object: Table [dbo].[contacts2] Script Date: 3/4/2016 3:26:55 PM ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    SET ANSI_PADDING ON

    GO

    CREATE TABLE [dbo].[contacts2](

    [ContactID] [int] IDENTITY(1,1) NOT NULL,

    [OfficerID] [varchar](8) NULL,

    [FirstName] [varchar](50) NULL,

    [LastName] [varchar](75) NULL,

    [Email] [varchar](80) NULL,

    [HomePhone] [varchar](25) NULL,

    [CellPhone] [varchar](25) NULL,

    [WorkPhone] [varchar](25) NULL,

    [HomeAddress1] [varchar](80) NULL,

    [HomeAddress2] [varchar](50) NULL,

    [HomeCity] [varchar](50) NULL,

    [HomeState] [varchar](2) NULL,

    [HomeZip] [varchar](10) NULL,

    [WorkAddress1] [varchar](80) NULL,

    [WorkAddress2] [varchar](50) NULL,

    [WorkCity] [varchar](50) NULL,

    [WorkState] [varchar](2) NULL,

    [WorkZip] [varchar](10) NULL,

    CONSTRAINT [PK_Contacts2] PRIMARY KEY CLUSTERED

    (

    [ContactID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    SET IDENTITY_INSERT [dbo].[contacts2] ON

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (1, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (2, N'jfortq', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (3, N'jfortm', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (4, N'jfortn', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', N'Irvine', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (5, N'jfortt', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618', NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (6, N'jfortp', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'', N'', NULL, N'', NULL, NULL, NULL, NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618')

    GO

    SET IDENTITY_INSERT [dbo].[contacts2] OFF

    GO

    let me know if you need anything from me

    Again thanks for your reply

    Niladri

  • Great, thanks. From my earlier post on how to handle address data, using your sample data what do you want retained in the HomeAddress of the surviving row? Would you simply want the complete address from Contact 5 or would you want HomeAddress1, HomeCity, HomeState and HomeZip from Contact 5 and HomeAddress2 from Contact 4 blended together? Or something else?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • if there is address is Available in last row (in this case row 6) we don't need to update. But if the row no 6 has not the address and the row no 2, 4, 5 has the address then we can update from any of the value. if the the row 2 has the city name but not the address and row no 5 has the address not the city name then we need the city name from row 2 and address from row 5. We need the complete info for row no 6 if the other row has the value . if there is nothing we can omit. After the updating it will delete all the rows except the latest one (in that case row no 6)

    we need to go other example after that.

    Thanks a lot for using your valuable time for me.

  • niladri.primalink (3/4/2016)


    if there is address is Available in last row (in this case row 6) we don't need to update. But if the row no 6 has not the address and the row no 2, 4, 5 has the address then we can update from any of the value. if the the row 2 has the city name but not the address and row no 5 has the address not the city name then we need the city name from row 2 and address from row 5. We need the complete info for row no 6 if the other row has the value . if there is nothing we can omit. After the updating it will delete all the rows except the latest one (in that case row no 6)

    we need to go other example after that.

    Thanks a lot for using your valuable time for me.

    Your sample data is throwing me off. All OfficerID values are different. I think for a valid example you need a set where they are all the same, along with FirstName and LastName. Correct?

    Can you please post a new set of INSERT statements along with the expected results from that sample data?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • My Sample data is looking like below. Sorry image is not working in my system.

    contactidFirstNamelastnameofficeridemailcellphoneHomephoneworkphoneHomeAddress1HomeAddress2HomecityHomeStateHomeZipworkaddress1workaddress2workcityworkcityworkstateworkzip

    1NiladriBiswasjfortwniladrri@gmail.comNULL257-098-3456NULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNULLNULL

    2NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL

    3NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLNULLNULLNULLNULLNULLNULLNULLNULLNULL

    4NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120IrvineNULLNULLNULLNULLNULLNULLNULLNULLNULL

    5NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120NULLIrvineCA92618NULLNULLNULLNULLNULLNULL

    6NiladriBiswasjfortwniladrri@gmail.comNULLNULLNULLNULLNULL114 Pacifica Ste 120NULLIrvineIrvineCA92618

    My output should like this

    contactidFirstNamelastnameofficeridemailcellphoneHomephoneworkphoneHomeAddress1HomeAddress2HomecityHomeStateHomeZipworkaddress1workaddress2workcityworkcityworkstateworkzip

    6NiladriBiswasjfortwniladrri@gmail.com232-567-879257-098-3456NULL114 Pacifica Ste 120IrvineIrvineCA92618114 Pacifica Ste 120NULLIrvineIrvineCA92618

    Here is the insert query

    SET IDENTITY_INSERT [dbo].[contacts2] ON

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (1, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (2, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (3, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (4, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', N'Irvine', NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (5, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'257-098-3456', N'232-567-879', NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618', NULL, NULL, NULL, NULL, NULL)

    GO

    INSERT [dbo].[contacts2] ([ContactID], [OfficerID], [FirstName], [LastName], [Email], [HomePhone], [CellPhone], [WorkPhone], [HomeAddress1], [HomeAddress2], [HomeCity], [HomeState], [HomeZip], [WorkAddress1], [WorkAddress2], [WorkCity], [WorkState], [WorkZip]) VALUES (6, N'jfortw', N'Niladri', N'Biswas', N'niladrri@gmail.com', N'', N'', NULL, N'', N'', NULL, NULL, NULL, N'114 Pacifica Ste 120', NULL, N'Irvine', N'CA', N'92618')

    GO

    SET IDENTITY_INSERT [dbo].[contacts2] OFF

    GO

    Yes you are right we need to consolidate records. the records are unique based on OfficerID, LastName,FirstName

    Again thanks.

    Niladri

  • A word of caution, this query will not perform well over large datasets. Your data needs to be processed in a partitioned fashion (by OfficerID, FirstName and LastName) and then within that row-by-row, per-column. What I handed over is basically a cursor masquerading as set-based code but I could not come up with another way to get there short of writing an actual cursor (I tried a variation of OUTER APPLY as well but this actually did better over a large dataset). Maybe someone else will come along and provide a better way to solve this one. I think we did a good job of getting to the bottom of your requirement at least.

    You did not provide any indexes with your table definition. Does that mean there are none or that you just did not provide them? If there is no index where the leading edge consists of OfficerID, FirstName and LastName in some order you should consider creating one to support this query.

    [font="Courier New"]WITH cte AS (SELECT

                    MAX(ContactID) AS ContactID,

                    OfficerID,

                    FirstName,

                    LastName

                 FROM dbo.contacts2

                 GROUP BY OfficerID,

                    FirstName,

                    LastName)

        SELECT ContactID,

               OfficerID,

               FirstName,

               LastName,

               (SELECT TOP 1 HomePhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomePhone > '' ORDER BY ContactID DESC) AS HomePhone,

               (SELECT TOP 1 CellPhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND CellPhone > '' ORDER BY ContactID DESC) AS CellPhone,

               (SELECT TOP 1 WorkPhone FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND WorkPhone > '' ORDER BY ContactID DESC) AS WorkPhone,

               (SELECT TOP 1 HomeAddress1 FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeAddress1 > '' ORDER BY ContactID DESC) AS HomeAddress1,

               (SELECT TOP 1 HomeAddress2 FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeAddress2 > '' ORDER BY ContactID DESC) AS HomeAddress2,

               (SELECT TOP 1 HomeCity FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeCity > '' ORDER BY ContactID DESC) AS HomeCity,

               (SELECT TOP 1 HomeState FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeState > '' ORDER BY ContactID DESC) AS HomeState,

               (SELECT TOP 1 HomeZip FROM dbo.contacts2 WHERE OfficerID = cte.OfficerID AND FirstName = cte.FirstName AND LastName = cte.LastName AND HomeZip > '' ORDER BY ContactID DESC) AS HomeZip

        FROM cte;

    [/font]

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • I'm very cautious about automatically updating records like this, because there is so much variation for entering the same information. For instance, addresses with apartment numbers. If you have records like

    100 Main St, Apt 2A | | Anywhere | PA | 191xx

    100 Main St | Apt 2A | Anywhere | PA | 191xx

    After merging, you will get

    100 Main St, Apt 2A | Apt 2A | Anywhere | PA | 191xx

    with the apartment number appear in both lines 1 and 2 of the address.

    I also had one person sign his name in each of the following ways:

    First Middle Last

    F. Middle Last

    F. M. Last

    I would constantly have to merge his records together, because people would reintroduce records with one of the abbreviated forms.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (3/4/2016)


    I'm very cautious about automatically updating records like this, because there is so much variation for entering the same information. For instance, addresses with apartment numbers. If you have records like

    100 Main St, Apt 2A | | Anywhere | PA | 191xx

    100 Main St | Apt 2A | Anywhere | PA | 191xx

    After merging, you will get

    100 Main St, Apt 2A | Apt 2A | Anywhere | PA | 191xx

    with the apartment number appear in both lines 1 and 2 of the address.

    I also had one person sign his name in each of the following ways:

    First Middle Last

    F. Middle Last

    F. M. Last

    I would constantly have to merge his records together, because people would reintroduce records with one of the abbreviated forms.

    Drew

    I had the same concern in my earlier line of questions but it seems it is not a problem for the OP to blend the columns across rows. Even in the sample data you can see where HomeAddress2 has a data quality issue that will result in a city name appearing twice on a printed mailing label. Luckily this requirement does not involve names but it has the same type of pitfalls as you point out.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Hi Thanks for your reply...

    It is absolutely the same I am looking for. Thanks a lot. I will update the table from cte. Is there any chance to delete the other records ? I want to automate this through SSIS.

    But like you I am also wondering. there are almost 43 lakhs records in that table and unique is almost 31 lakhs (comination of firstname,lastname and officerid ).

    Again thanks a lot.

  • niladri.primalink (3/4/2016)


    Hi Thanks for your reply...

    It is absolutely the same I am looking for. Thanks a lot. I will update the table from cte. Is there any chance to delete the other records ? I want to automate this through SSIS.

    But like you I am also wondering. there are almost 43 lakhs records in that table and unique is almost 31 lakhs (comination of firstname,lastname and officerid ).

    Again thanks a lot.

    You might want to persist the results from the CTE to a #temp table so you can not only use it to update what will be the surviving rows but also to delete all other rows where the ContactID is not in the #temp table. You may also want to consider only deleting rows with a ContactID less than the max value in the #temp table in case new rows enter the permanent table after you start processing.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Hi Orlando,

    your query works like a magic. Thanks a lot. What I have done is that I have taken the records into a new table and then match the records with the contacts and then Updated the base table and after that I have deleted the duplicates.

    Thanks a lot. Definitely I will love to take help from you later.

    Regards,

    Niladri

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply