Error-During data import/cleansing using SSIS packages(urgent!!)

  • Thanks Journeyman

    That made sense! Just wondering though.....

    IF for table A- that has the duplicates...i just run a distinct query that returns only one record of the duplicate(based on name, id, email etc

    For table B- I take the remaining id's that are unique( that is not in the distinct)

    Join table A&B into a new table C

    I get a new table free of duplicates.

    Im trying out these queries though currently.

    Thanks for your suggestions ...

  • I've deduped data, but only manually(pressed for time..).

    I would like to use SSIS for its really meant for then....

  • hello

    i did the same thing that you recommended but i got an avertissement message which said :"

    The buffer manager detected that the system was low on virtual memory, but

    was unable to swap out any buffers. 16 buffers were considered and 16 were

    locked. Either not enough memory is available to the pipeline because not

    enough is installed, other processes are using it, or too many buffers are

    locked."

    My server included an system SQLSERVER 2003 :

    MEMORY : 1 Go

    PENTIUM III 512 MHZ

    I know it's old version but my entreprise use it for test it

    unfortunately i don't have choice to change it.

    I try to use in my package limitation of memory ( 256 Mo : ) but i got same avertissement message.

    Have you got an idea how i can solve my problems ?

    thanks you 😉

  • Could you give a little more detail as to the issue you are facing i.e. how many tables, how much data etc...what ssis package are you using..?

  • hello

    i create an table TempClient which contents duplicate fields. To clean that table i create an process composed of :

    - query with aggregation on LastName,FirstName, Adress,ZipCode, City

    - i used an fuzzy lookup where my reference table is "aggClient table" contened only 3 field not duplicate is "Lastname, Firstname , Zipcode"

    - after i conditional fractionnal on Similarity >=0.8 to retrieve unique Client

    - then i used an fuzzy grouping with and conditionnal fractionnal where restriction is key_in == key_out

    - finally i used an aggregation to retrieve unique client

    - Insert in my Client Table

    When i did that process for 1000 data or 4 months data , my proccess work

    but when i did on all data, it wrote that i got a allocated memory problem.

    My table contents more 1 300 000 data on TempClient

    Have you got an idea for my problems ?

    Thanks you

    hani 😉

  • Let me see if i have understood correctly, basically you're having a problem when you do a Fuzzy Lookup on a larger dataset.

    So, one option would be to make sure that the Buffers size is kept to smallest possible and that the Buffers are stored in a drive that has a lot of space.

    Secondly, this is what worked for me, i used t-sql prcedures to initially clean the dataset( when i first began the data was about 4 years old and there were 45 million records!). After cleaning the data and building the routines to be automated i am now operating on a much smaller dataset( as i have only about 60K records or so) that i pull every week.

    I have started to use SSIS, but its still slow.

    Hope that helps...

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply