DTS - loading mixed data types from text file.

  • Hi all,

    I have a text file to import, it has several NUMERIC columns.

    However, the data provider sends the data with some ALPHA data in the columns.

    For example, in a Dollar Value column, they send me NULL as text, not a null value.

    I try to replace this value with null, but I get type mismatch errors.

    I have tried:

    if dtssource("Value")>1 then

    Dest.... = Source....

    main = DTS...._OK

    end if

    But as soon as the source data has ALPHA data, I get an error.

    I just want to replace invalid data with

    Please help if you have done this before!

    Thanks.

  • I import all columns as VARCHAR and evaluate from there. You should only import to a staging table... there you can make such decisions before transferring to the final table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I understand the staging table, but in this case I truncate the dest.

    each time I load the data.

    The other problem is log growth.

    How do you handle log growth? I end up doing a full backup and shrinking/defraging the DB and log files after every load.

    This is a fairly static DB. The main tables are updated once a week.

    The other thing is.... I actually figured it out. I would need to see how I did it? But it was along the lines of nested date parts concat'd

    and using isdate? But this was a few days ago and I just took some pain pills for my back about an hour ago! LOL

    I may start staging everything. I generally hope for consistent, fairly clean data....when provided from bigger NAME sources.

    But it is amazing the CR@P data I have seen in my travels.

    I don't know how it can even be queried in some cases!

    See ya,

    Gene

  • Wow,

    I was a bit out of it when I replied last night! LOL

    I did have some crazy DATE data stored as Date in one varchar column and time stored in another.

    But when the date was blank it was one or two spaces then / / and another space or two with no consistency.

    Why date+time are not in the same column and why they would be CHAR data???

    Okay, as far as the crazy alpha chars in the numeric fields:

    They changed the data feed. NULL is now empty column.

    and the other values are calculated correctly so no overflow type messages are in the data anymore.

    Thus, my problem is solved.

    I still would love to have the log not grow!!! I am loading millions of rows in several tables.

    I would love to BULK load. Do you bulk insert from your staging tables?

    This is more of a warehouse DB for the most part. I have played with building indexes before and after. But I MUST have constraints on duplicate rows. I do sometimes get dupes in the feeds.

    Yeah, Staging could be used for this as weill. Hmmm. I will need to read a warehousing book!

    Have a great weekend,

    Gene

  • intvfan (4/26/2008)


    I still would love to have the log not grow!!! I am loading millions of rows in several tables.

    I would love to BULK load. Do you bulk insert from your staging tables?

    Yes... I'll typically use BULK INSERT or BCP with a format file. Using BCP with a correctly setup format file with the -m (max errors) parameter set to a high number and the -e (error file name) will "pre-screen" for a lot of different errors and store the offending rows in a file for repair without stopping the rest of the load. Of course, not all problems (such as dupes) will be handled by this and that's why it's important to use a staging table...

    ...both methods usually beat the tar out of similar DTS loads.

    Depending on the client, I'll usually setup a separate database with the SIMPLE recovery mode setup so that I can take full advantage of the speed of bulk loads. That also means that the client doesn't have to do backups on that database and it doesn't cause undo log growth for the client.

    This is more of a warehouse DB for the most part. I have played with building indexes before and after. But I MUST have constraints on duplicate rows. I do sometimes get dupes in the feeds.

    Just a suggestion... Nope... no need for such constraints. It's a data warehouse and to keep the performance way up on loads, it's the load program that should do duplicate checks BEFORE the data is inserted into the data warehouse. The load program should mark the inserts/updates to be used in the staging table and then operate according to the marks. That includes duplicates... a correctly written dupe check can find and delete 100's, even 1,000's of dupes from a correctly formed million row staging table in a very short duration of time (seconds, if done correctly). Or, you can just mark them as dupes and ignore them during the final transfer to the warehouse table. The only indexes/constraints you need on the final warehouse tables (provided that humans don't ever enter data directly into them) are to support reporting queries and the correct clustered index to keep the tables from being a heap. Choose the clustered indexes carefully... they can really slow down INSERTs and some UPDATEs if done on columns that will cause inserts to the "middle" of the table. Might be a necessary evil in some cases... see about "views" or "synonyms" below if you need a clustered index that will allow inserts into the "middle" of the table.

    Keep in mind that unless you drop or truncate the warehouse tables and replace them with staging area tables/data, there's not much you can do to keep the log from growing in the presence of indexes. Read up about the requirements for high speed bulk loads in Books Online.

    If the staging tables contain full replacement data for a whole table (all rows replaced), consider using a view in SQL 2000 or a synonym in SQL Server 2005 to point to the "active" table (both tables would likely live in the same DB in this case). This means you can leisurely build and clean the data in a table without interferring with the current warehouse data. When you're all done cleaning the data in the staging table and rebuilding all the indexes, simply issue an ALTER command on the view or synonym to repoint the view or synonym to the newly cleaned table. The total "downtime" for such an alter is rarely more than 65 milliseconds. Also, for these types of "replacement" tables, if no one but the load program adds data to the table, consider using a FILL FACTOR of 100 for maximum query performance... you'll also have the time to do full index rebuilds without interfering with warehouse table usage on full table replacements.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks Jeff,

    I had considered using views and pointing to the tables for minimum down time.

    I totally overlooked the maintenance time benefit's! Just to obvious I guess! LOL

    I was also considering splitting the 10millions plus, row tables and union them through a view. I can split by STATE. Searches do not generally cross the state line. A virtual PARTITION. This is SQL2000.

    I wish it was 2008. Partitions seem to work very well in 2008.

    However, I have only read it works well!

    Take care,

    Gene

  • Thanks for the feedback, Gene, and good luck on your efforts! Let us know if anything comes up... maybe give us some feedback on what you ended up doing. Shoot, you might even want to write an article on what you ended up doing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply