Help with de-duplicating records in a table and updating the records to reflect eliminated dups

  • Hello everybody,

    I have a set of roughly 1.2 million records in a table. The records came from multiple sources. It is a list of automotive businesses in the US. The problem is that some of duplicates appear on surface as unique records because they have a different categoryid. Example - a shop can be listed in a tire shop category and car repair category with all attributes being exactly the same.

    I have identified all the duplicates using a CTE. I would like to eliminate the duplicates, but mark the record that is left over with both categories (sometimes it is more than just a duplicate - I have triplicates and quadruplicates as well).

    DDL

    CREATE TABLE [dbo].[AYP](

    [Company] [nvarchar](255) NULL,

    [Business Street] [nvarchar](255) NULL,

    [Business City] [nvarchar](255) NULL,

    [Business State] [nvarchar](255) NULL,

    [Business Postal Code] [nvarchar](50) NULL,

    [Business Country] [nvarchar](255) NULL,

    [Business Phone] [nvarchar](255) NULL,

    [Business Fax] [nvarchar](255) NULL,

    [E-mail Address] [nvarchar](255) NULL,

    [Web Page] [nvarchar](255) NULL,

    [CategoryID] [int] NULL

    ) ON [PRIMARY]

    GO

    CREATE TABLE [dbo].[CT_Category](

    [CategoryID] [int] NOT NULL,

    [CategoryName] [varchar](50) NULL,

    CONSTRAINT [PK_CT_Category] PRIMARY KEY CLUSTERED

    (

    [CategoryID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    SET ANSI_PADDING OFF

    GO

    ALTER TABLE [dbo].[CT_Category] WITH CHECK ADD CONSTRAINT [FK_CT_Category_CT_Category] FOREIGN KEY([CategoryID])

    REFERENCES [dbo].[CT_Category] ([CategoryID])

    GO

    ALTER TABLE [dbo].[CT_Category] CHECK CONSTRAINT [FK_CT_Category_CT_Category]

    GO

    Data example:

    CompanyBusiness StreetBusiness CityBusiness StateBusiness Postal CodeBusiness CountryBusiness PhoneBusiness FaxE-mail AddressWeb PageCategoryIDRow

    C & C J'S Car Wash4031 1st Avenue NorthBirminghamAL35222USA(205) 202-4095NULLNULLNULL72

    C & C J'S Car Wash4031 1st Avenue NorthBirminghamAL35222USA(205) 202-4095NULLNULLNULL23

    I would like eliminate row 3 and alter the categoryID to 2,7 on row 2 to account for the fact that shop belongs to two categories - categoryID 2 and categoryID 7. My thought is to use XML path, but I am having trouble coding it.

    thanks for input/insight,

    Petr

  • Petr

    This isn't possible - CategoryID is int, so you can't change it to 2,7. Even if you could, I'd advise you not to - you're breaking the laws of normalisation and it'll cause problems in the future. What you should do instead is this:

    (1) Add a column to Company (CompanyID?) that uniquely identifies the rows

    (2) Take the CategoryID column out of Company

    (3) Create a new table called CompanyCategory with columns CategoryID and CompanyID and the appropriate foreign key constraints

    John

  • I hear you and had the same thoughts. The problem is that once this data is cleaned up, it is going into Salesforce and the only way to use their multiple selection picklists is to serialize the choices in one field. I can convert the int categoryid into let's say varchar(50) to accomodate the comma separated values.

    I have very little to no control of what I can do inside of salesforce, so I need to fit/massage the data outside of SF and then upload it into the Salesforce application,

    thanks,

    Petr

  • Petr

    In that case, yes, change CategoryID to varchar something, and then use something like this page[/url] for tips on how to perform your concatenation.

    John

  • thank you John, you have been most helpful,

    this community is one of the greatest out there. I always find tons of great and knowledgeable folks ready to help on this site.

    thanks again,

    Petr

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply