Use SQL to dedupe csv data

Question

Use SQL to dedupe csv data

andrew 67979

Ten Centuries

Points: 1268
More actions
February 23, 2017 at 9:09 am

#335485

Hi all
To keep the example simple lets say for the sake of argument I have large CSV file of say 500,000 rows and 80 columns of mostly text data
I have been working on a process to create an XLSX file using SpreadsheetML this work fine until the amounts of data ramp up and I run out of memory
The main problem is creating a unique array of all of the text data for sharedStrings.xml
I just wondered if there was an efficient way to use SQL to dedupe those 40 million values
I can't simply use COM automation and get Excel to do the work for me as this process has to run unattended on a server and the Microsoft recommendation for this sort of thing is to use OpenXML
Thanks in advance for any ideas

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

andrew 67979 Ten Centuries Points: 1268 More actions · Answer 1

Oh just to add I have tried using the Microsoft Scripting dictionary class but it is so sloooow and of course still consumes big amounts of memory

Thomas Rushton SSC-Insane Points: 22648 More actions · Answer 2

You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?

Thomas Rushton
blog: https://thelonedba.wordpress.com

andrew 67979 Ten Centuries Points: 1268 More actions · Answer 3

ThomasRushton - Thursday, February 23, 2017 9:38 AM
You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?

The former as I need to build the sharedStrings.xml file from it

https://msdn.microsoft.com/en-us/library/office/gg278314.aspx

sgmunson SSC Guru Points: 110639 More actions · Answer 4

andrew 67979 - Thursday, February 23, 2017 9:42 AM
ThomasRushton - Thursday, February 23, 2017 9:38 AM
You say "dedupe these 40 million values"; does this mean that the 500,000 rows of data / 80 columns are to be deduped horizontally as well as vertically? or that each of those 40M values are a distinct item to be considered for deduplication? Or are you just (hah!) looking at it by unique rows?
The former as I need to build the sharedStrings.xml file from it
https://msdn.microsoft.com/en-us/library/office/gg278314.aspx

Okay, ... so WHY are you "re-inventing the wheel", when you could import the csv file using SSIS and write it out in spreadsheet form without having to programmatically write out all the actual bytes on your own? When you said "the former", I intrepret that to mean that you were hoping to have a way to remove duplicate values from a set composed of all values in all columns. As SSIS would have to do this in order to create the spreadsheet, why do you think you need to do this manually? I'd be seeking to understand the larger objective... What's the purpose of doing this on your own? What have you already tried, and why?

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Jeff Moden SSC Guru Points: 1003859 More actions · Answer 5

If it were me, I'd just import the file into a table, hit it with the SQL hammer, and convert the deduplicated data to XML, and export it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)