Use SQL to dedupe csv data

  • Hi all

    To keep the example simple lets say for the sake of argument I have large CSV file of say 500,000 rows and 80 columns of mostly text data

    I have been working on a process to create an XLSX file using SpreadsheetML this work fine until the amounts of data ramp up and I run out of memory

    The main problem is creating a unique array of all of the text data for sharedStrings.xml

    I just wondered if there was an efficient way to use SQL to dedupe those 40 million values

    I can't simply use COM automation and get Excel to do the work for me as this process has to run unattended on a server and the Microsoft recommendation for this sort of thing is to use OpenXML

    Thanks in advance for any ideas

  • Oh just to add I have tried using the Microsoft Scripting dictionary class but it is so sloooow and of course still consumes big amounts of memory

  • You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically?  or that each of those 40M values are a distinct item to be considered for deduplication?  Or are you just (hah!) looking at it by unique rows?

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • ThomasRushton - Thursday, February 23, 2017 9:38 AM

    You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically?  or that each of those 40M values are a distinct item to be considered for deduplication?  Or are you just (hah!) looking at it by unique rows?

    The former as I need to build the sharedStrings.xml file from it 

    https://msdn.microsoft.com/en-us/library/office/gg278314.aspx

  • andrew 67979 - Thursday, February 23, 2017 9:42 AM

    ThomasRushton - Thursday, February 23, 2017 9:38 AM

    You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically?  or that each of those 40M values are a distinct item to be considered for deduplication?  Or are you just (hah!) looking at it by unique rows?

    The former as I need to build the sharedStrings.xml file from it 

    https://msdn.microsoft.com/en-us/library/office/gg278314.aspx

    Okay, ...  so WHY are you "re-inventing the wheel", when you could import the csv file using SSIS and write it out in spreadsheet form without having to programmatically write out all the actual bytes on your own?   When you said "the former", I intrepret that to mean that you were hoping to have a way to remove duplicate values from a set composed of all values in all columns.   As SSIS would have to do this in order to create the spreadsheet, why do you think you need to do this manually?   I'd be seeking to understand the larger objective...  What's the purpose of doing this on your own?   What have you already tried, and why?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • If it were me, I'd just import the file into a table, hit it with the SQL hammer, and convert the deduplicated data to XML, and export it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply