Get rid of commas in quoted text, then Bulk Insert

  • I was trying to use DTS to solve this problem, because DTS can deal with this issue, but I'm a hackboy and DTS is way over my head to automate.

    I could use bulk insert if I could get past one problem.  My CSV data files are garbage, and I have no control over the vendor.

    Lets say it is a 6 field table, here is what my data looks like:

    ,dog,,,$100.00,

    ,cat,small,,"$1,000,000.00",

    ,mouse,big,ugly,"$3,000.00",

    The data has numerous null fields, the killer is the formatting of the money fields.  If it is more than 1000 they quote it and send commas.  Bulk insert lets you pick the delimiter, but does not let you indicate quoted text.

    If I could get rid of the commas in the money field, I'd be in good shape.  To add to the insanity, there are about 10 transaction types in each set of records, some have more than one money field, although I can identify the record type (you can't see that in my example).

    ,dog,,,$100.00,

    ,cat,small,,"$1000000.00",

    ,mouse,big,ugly,"$3000.00",

    I should probably post this as a separate question, with a different header.

  • I have no easy answer for this (and would be interested to know if there is), but this is what I tend to do when data-to-be-imported is inconsistent (in the manner you've described):

    1. bulk load the records to a staging table in one big varchar field.

    2. use TSQL to parse the rows while moving into a staging table.

    3. move direct from the staging table to the final destination tables.

    Step 2 can be done in multiple passes for simple code, or single pass (insert/update) using many case statements or even user defined functions to parse/interpret.

  • I was thinking, using CHARINDEX I could locate the first instance of the double quote.  Then use CHARINDEX again to find the second instance

    0000000001111111111222222222233333333334

    1234567890123456789012345678901234567890

    xx,"1,000",xxxxxxx,"2,000",xxxxxxxxxxxxx

    So using CHARINDEX I'd get position 4 was my first

    double quote.  Then I could do a CHARINDEX on

    SUBSTRING(mydata,4+1,40) which would return position 10.

    So now I know there is a quoted text field between position 4 and 10, and there MAY BE (not for sure) a comma.  They use spaces, and sometimes the numbers are in the millions.

    Can I do a REPLACE comma with '' (nothing, remove comma) BETWEEN positions 4 and 10?

     

    If that is possible, I fix that problem, the use the same logic on the next money field in the record

     

  • Here is a solution to my problem.  Basically I find the location of the first and second quotes, delete all commas between the two, then delete the first set of quotes.  I set this up in a stored procedure and just call it about 5 times, to get every possible instance of this in the data.

    I'm sure there is a better way to do this, but considering the hack I am, this works, LOL.

    TABLE with three fields ExcelData, Quote1a, Quote2b

    UPDATE DataDump

    SET ExcelData = '12345"12,,5"12345'

    --Locate the first two quoted indicators

    UPDATE DataDump

    SET

    Quote1a = CHARINDEX('"',ExcelData,0),

    Quote1b = CHARINDEX('"',SUBSTRING(ExcelData,CHARINDEX('"',ExcelData,0)+1,1000),0)

    --Get rid of all commas between the first two quoted indicator

    UPDATE DataDump

    SET ExcelData =

    LEFT(ExcelData,Quote1a-1) +

    REPLACE(SUBSTRING(ExcelData, Quote1a, Quote1b), CHAR(44), '') +

    SUBSTRING(ExcelData, Quote1a + Quote1b + 1, 2000)

    WHERE Quote1a > 0

    --Get rid of the first quoted indicator

    UPDATE DataDump

    SET ExcelData

    = LEFT(ExcelData,Quote1a-1) + SUBSTRING(ExcelData,Quote1a+1,2000)

    WHERE Quote1a > 0

     

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply