Avoiding duplicates from file import

  • Hi

    I am building an automated process to import delimited text file data to a master table. As part of the process i can archive the file, but i cant prevent a user adding the same file again to the source file location. In fact, a file with an updated name could be uploaded and contain the same data. My table contains an identity field, the text file data does not, so the Id is created at the import. The files don't contain a large amount of data, less than 100 columns, and a couple hundred records in each.

    What would be a valid approach to prevent duplicate records being added to the master data table. I could allow them to be added, then remove them, but would prefer if to prevent them being added.

    Any suggestions welcome

    Gerry

  • There's not much, but is there any form of natural key on the incoming data? Not your identity, but maybe a 4 column combination showing a unique record or something similar?

    The reason I ask is because if there is, you can assemble a checksum as a persisted calculated column on the master table. When new data comes in, you check the natural key and the checksum in the file against what's in the table already. If they match, you have a duplicate and can ship that to a different logic path.

    Somehow or another, though, you end up checking the inbound row against the data. You need a way to identify a unique row in the inbound, or you have to do full column to column checks for the entire row.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Cheers Craig, for the quick response.

    Logically, i figured as much. Since i can't control the data coming in, I'd have to check against each existing record.

  • I'd import the text file into a staging table and then run a merge statement to copy any new data into the live table.

    Maybe you can get the SSIS package to read in just the first line from the text file, query the live table to see if it's there, and then abandon the import process, but I can only see this providing a real benefit if the text file is huge and takes a significant amount of time to import.

  • I'd definitely use a staging table whenever importing from a text file, but the question of duplicates still remains. I'd do exactly as Kraig suggested - find a combination of columns that makes a row unique and then insert those rows from the staging table into the production table where they don't already exist.

  • Cheers Daniel,

    Holding table is the way to go for sure. Was going need that for other validation.

    Craig's idea to combine fields to make a key field is also going to work. Think i can restructure the data so there only 4/6 key fields in the import file.

    Here's the process then

    1. import to a dump table.

    2. create the index and move to the holding table.

    3. Do a select join on the on the holding and data ON new index, to see if there's any records in the data table already.

    4. Delete the ones which exist from the data table

    5. Append from the holding table (this ensures the most current data is in the data file).

    6. empty the holding table.

    I think this can be done with only sql (no iterating through individual records!)

  • What's the difference between the dump table and the holding table? It looks like you're importing the data into one table just so you can index it and then move it somewhere else.

    You may be over-thinking this a little. Just have a holding table with a surrogate key field and use that for the clustered index (the data is stored in the order of the clustered index; if your clustered index is made up of several fields then SQL Server has to keep reordering the data as it imports the records). You can have more indexes on the holding table for the actual key fields.

    Once you have the holding table and indexes created, you can leave it there and just truncate it as and when you need to import data. There's no need to drop and recreate the indexes, it'll keep them up to date as it goes along. For a few hundred records this isn't a problem, if there are thousands of records I'd consider disabling the index and rebuilding it after the import has completed.

    The rest of your procedure (steps 3 to 6) seem fine to me. You can combine steps 3 and 4 into one SQL Query once you know that you have the logic worked out.

  • daniel.freedman 80164 (11/12/2014)


    You may be over-thinking this a little. Just have a holding table with a surrogate key field and use that for the clustered index (the data is stored in the order of the clustered index; if your clustered index is made up of several fields then SQL Server has to keep reordering the data as it imports the records). You can have more indexes on the holding table for the actual key fields.

    Once again thanks for the input Daniel.

    I could easily be overcomplicating this. Using the key fields (natural key) allows me to test for duplicates in the import. Not sure i can do that easily if there's a surrogate key.

    Was considering making a new field with the key fields concatenated (that was step 2 above), but only to avoid complex joins later (since data wont change). This would be the clustered index in the data file.

  • You can have several indexes on a table, but the data is physically stored on the disc in the order of the clustered index. This is why using a surrogate key as the clustered index is convenient - it ensures that new data is always written to the end of the file instead of making the server do more work by inserting it into the middle of the file.

    Having said that, if your text files are small (and by that I mean no more than a few thousand rows) then you probably won't experience much of a performance hit. If it's only a few hundred rows then we're talking about nanoseconds probably! But it's something worth bearing in mind if you're dealing with larger tables later on down the line.

    When you write a query the server will work out for itself the best way to execute that query. If there's an index it can use, then it'll use it. If not then it'll just scan through all the records one by one until it finds what it wants. You'll need to look at the execution plan for the query to see how it'll be executed, but the point I'm trying to make is that even though there's an index on the table that you think should do the job, SQL Server may decide not to use it at all.

    I don't really see any benefit in creating a new field that contains all the key fields concatenated together, other than it'll make your queries smaller and easier to read, but then again you'll need the same concatenated field on the live data so you can compare like with like.

  • FishD (11/12/2014)


    Craig's idea to combine fields to make a key field is also going to work.

    Errr, you don't have to 'combine' them. You just use a multicolumn key. IE:

    tableA JOIN tableB ON a.ColA = b.ColA AND a.ColB = b.ColB...

    A single giant column is no better or worse than multiple individual columns. You only need this for the insertion check, so no reason to jump through hoops, as your surrogate key (identity) will be carried to the rest of the data for reference. All you're doing is using up more space and ending up with less rows/page in the storage.

    Here's the process then

    1. import to a dump table.

    2. create the index and move to the holding table.

    3. Do a select join on the on the holding and data ON new index, to see if there's any records in the data table already.

    4. Delete the ones which exist from the data table

    5. Append from the holding table (this ensures the most current data is in the data file).

    6. empty the holding table.

    I think this can be done with only sql (no iterating through individual records!)

    Once you've imported to the dump table (staging table, holding table, whatever you want to call it) you can simply run the MERGE statement (check it out, it's neat). That will let you choose what to do when things match or are missing. IE: Matches get ignored, Missing gets inserted.

    The big deal here is are rows ever updated from this process, or is it drop once and done? My concern is that you start ignoring updates coming in for data changes from the external source for these records. If they're one and done dropoffs (like a checking register or something) then you're fine with a simple version. If you have to check versions, you're going to need something more complex.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks guys for the help. Think i've now something which will work. Still don't fully understand the surrogate key. Gotta research a bit more. Also ignored the 'merge' for the moment though it looks like a very powerful feature. Will post back if i can improve on this.

    Here's what ive done.

    table structures.....

    CREATE TABLE [tbl_Holding](

    [D1] [nchar](10) NOT NULL,

    [D2] [nchar](10) NOT NULL,

    [D3] [nchar](10) NOT NULL,

    [D4] [nchar](10) NOT NULL,

    [D5] [nchar](10) NOT NULL,

    CONSTRAINT [PK_tbl_Holding] PRIMARY KEY CLUSTERED

    ([D1] ASC,

    [D2] ASC,

    [D3] ASC,

    [D4] ASC,

    [D5] ASC ))

    CREATE TABLE [tbl_Data](

    [Id] [int] IDENTITY(1,1) NOT NULL,

    [D1] [nchar](10) NOT NULL,

    [D2] [nchar](10) NOT NULL,

    [D3] [nchar](10) NOT NULL,

    [D4] [nchar](10) NOT NULL,

    [D5] [nchar](10) NOT NULL

    ) ON [PRIMARY]

    remove duplicates and append....

    BULK INSERT tbl_holding

    FROM '\\................\test.csv'

    WITH

    (

    FIELDTERMINATOR = '|', --Field separator

    ROWTERMINATOR = '', -- Row ending character

    FIRSTROW= 2 , -- Row starts at number 1

    KEEPNULLS, -- Brings null values through

    ROWS_PER_BATCH = 2000 -- should be approx size of files.

    )

    GO

    DELETE a

    FROM tbl_Holding a

    Inner JOIN tbl_Data b ON

    a.D1 = b.D1 AND

    a.D2 = b.D2 AND

    a.D3 = b.D3 AND

    a.D4 = b.D4 AND

    a.D5 = b.D5

    GO

    INSERT INTO tbl_data

    SELECT D1,D2,D3,D4,D5

    FROM tbl_Holding;

    GO

    TRUNCATE TABLE tbl_Holding

    GO

  • Just as a point of information, the ID field in tbl_Data is what we're calling a surrogate key, it's called a surrogate because the value of the ID has nothing to do with the values of the other fields, it's a completely separate piece of information solely designed to uniquely identify the row.

    You could put an ID field into the holding table as well and set that as the primary key, and then create a separate index on fields D1 to D5, i.e. create index IX_HoldingData (D1 asc, D2 asc, D3 asc, D4 asc, D5 asc).

    Other than that minor point the rest of it is fine and should work as intended. If you do decide to put an ID field into the holding table you don't need to change anything else, in other words when you insert from the holding table to the main table it'll allocate a new ID number, you only need to copy the data over and not the ID number.

  • You approach doesn't account for when you have a valid reason to insert the data again. Such would be the case when you got the original file with some or all erroneous data. So the question becomes should you delete at the source or at the destination, I think you should think this out.

    I think you are fine with the PK clustered index on the fields you are joining on in this temp table. I don't think you need to cluster a separate id field, it just gets wiped at the next batch on not used.

    ----------------------------------------------------

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply