Error-During data import/cleansing using SSIS packages(urgent!!)

  • Hi!

    I get the following error message when i run my data import task.

    [font="Arial Black"]Information: 0x4004800C at Data Flow Task, DTS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 8 buffers were considered and 8 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

    [/font]

    [font="Arial"]What my import task does is basically[/font]----------------

    1]Import Data from DB table(about 140 Mil rows)

    2]Fuzzy grouping of the data on the basis of Customer-ID, first name, last name, email ..with a similarity level of 1

    3]Sort the data, deleting duplicate data

    4]Stores the Deduplicated data into a DB(on the same server)

    The package works fine for smaller sets of data, but i know the basic problem is the huge volume of data. [font="Arial"]What i've done so far is [/font]

    1] Reduce the size of DefaultBufferMax rows from 10000 to 2500

    2]Increase the size of Default buffersize to 20MB

    Neither seems to work.

    I need to import all the data and deduplicate but the system cannot find enough memory buffers. My System Admin sai dhe cannot increase the size of the paging/virtual memory.

    MAybe running the package in batches would help...but i have NO IDEA how to do this.

    Could someone pleasee help me with this..its my first Data Import project and this is really bogging me down.

    Any Advice..any Would be greatly appreciated.............!!!!!!!!!!!!

  • Sounds like you have some memory contention πŸ™‚ Try decreasing the DefaultBufferMaxRows. It will slow the package down but should alleviate the symptom.

    More info here about how SSIS determines the size of the buffers (worthwhile read):

    http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

    Good luck πŸ™‚

  • It was an excellent read. I tried reducing the maxrow count in each buffer to 1000(as to the default of 10000). Still no luck, i get the same message

    Information: 0x4004800C at Data Flow Task, DTS.Pipeline: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 4 buffers were considered and 4 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.

    However, i notice when the import task begins i get the following warning message

    [font="Arial Black"]Warning: Warning: 0x80047076 at Data Flow Task, DTS.Pipeline: The output column "_key_in" (1046) on output "Sort Output" (40) and component "Sort" (38) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance.( There are a lot more fields not used because the fuzzy grouping as i mentioned uses only about 4 fields out of 25 in the source table)[/font]

    Should i

    a) Try to get rid of these excess rows

    b)Split the data flow to take a couple of thousand rows at a time. But that also does not make sense and i would need the entire recordset( of 140 million rows, fuzzygroup it, sort and then dedupe.

    Do let me know if im on the right track...

    Thanks

    W

  • You seem to be on the right track - remove any unsed columns from the data flow. Kirk Haselden has a great blog w/ some tips to improve data flow performance. Take a look at some of his recommendations and see if you can apply them in your environment. Good luck πŸ™‚

    http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/[/url]

  • Thanks so far for advice and links

    I proceeded by trying to filter the input columns, by allowing the rows rows which were not going to be used to be selected as "Passthrough" in the fuzzy grouping component.

    During the fuzzy grouping however i constantly get a warning message

    1Validation error. Data Flow Task: Fuzzy Grouping [16]: Fuzzy Grouping does not support input columns of type IMAGE, TEXT, or NTEXT.Package.dtsx00

    There are no columns having this data type from the source DB.

    I'm wondering where this is coming from

    Regards

    Jude

  • Interesting - will try and reproduce tomorrow. What build are you at? 3186, 3152? Take a look at this article as well, http://msdn2.microsoft.com/en-us/library/ms345128.aspx

  • Thanks again,

    I was able to workaround it, i deselected the address column which had a data type of varchar(max). This was subsequently not used in the transformation.

    I decided the breakup the tasks in the import/fuzzy grouping package into two tasks.

    A]-------------------------------

    1]Read the data from the source table into a fuzzy grouping transformation

    2]Store it in the destination DB(this has the sorted data)

    B]Dedupe the sorted data(using sort) and then insert it into a DB column.

    This would physically reduce the amount of buffers needed .

    I have attached a screenshot of the first task(ran into an error again!).

    The second attachment(dedupe_package) contains a screenshot of what i was trying to do initially.

    Do let me know what you think......im going to start looking at the log files today to try and decipher where the problem has occured.

    Thanks!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

  • Anyone, any tips on how to optimize the buffer storage.

    Here is the deal, aprrox 50 gigs of data from the source DB, (about 140 mil rows) causes it to use almost 170 gigs in buffer space, while undergoing sorting(and deduplication).

    It dosent end there, it reaches that point while sorting around 70-75 million rows.

    Just wanted to know if such a thing was normal. I have modified the settings on the data flow package. The rowsinbuffer count is set to 10000 and the maxbuffersize to around 2 MB.

    Do let me know, otherwise ill have to end up buying more HDD.

    Thanks....any advice folks..:)

    Jude

  • Have a look at this blog here - pay particular attention to the perf counters, any leaks?

    http://mohammedu.spaces.live.com/blog/cns!6699CF8ADD3D4F67!256.entry

  • Hi there,

    We had a conference call with someone from the SQL Server Escalation Support group two days ago on the issue of Fuzzy Grouping performance, we have been struggling with the performance issue as well.We have about 1.7 million rows that go through this process.All works well up to 70% and after which each 1% takes 2 hours to complete.Fuzzy Grouping eats every bit of VM available.Here are some items that were pointed out during our confra call with SQL Server Escalation Support.

    1) Hardware requirement up to Microsoft Std

    2) They have yet to get back to us on whatever running the process on a 32 bit or 64 bit Server makes any diff. Ours is a 32 bit.

    3)Tempdb size grows as memory gets reduced, therefore set tempdb to autogrow.

    4) Paging to be considered

    5) Sp_Configure to set max memory usage

    6) Enable full txt searching- If required

    7) AWE enabled

    8) Location of log and data file.

    9) Suggest to enhance package using codes

    We are yet to hear from them if there is any diff on the performance based on 32 bit and 64 bit.Shall share it once known.As for now we have end up finding exact matches first and running dedup on those records.

    I'm sharing all the above points based on our discussion with SQL Team, some of these points might not be relevant for all business requirements and system.Consider your enviroment settings before any actions.

    [font="TimesNewRoman"] β€œI haven't failed, I've found 10,000 ways that don't work”........Thomas Alva Edison[/font]

  • Hey thanks for the post, it helps to know that someone is facing the same problem.

    Now the issue here is deduplication of records. A 140 million rows of data. Just want to know if modifying my process would make things more efficient. Currently this is what i am doing.

    Retrieving data from the Source DB

    Passing it through a Sort Object( i check the option of deleting duplicate rows) It sorts on the basis of cust-id,fname, lname , email)

    Inserts the data into a SQL OLE DB.

    Can modifying the above process reduce the memory load???

    Do let me know, i have seen all other options you gave out, but it looks like i would have to handle it with my IT mgr since i do not know much about SQL server settings( nor do i have privileges)

    Thanks, looking forward to hearing from you.....

  • Man...you have lots of data....:w00t:Anyways let me get this right, you need to remove duplicate rows based on Cust-id,f-name,l-name & email add.So only 1 transformation currently involved from [Source > Fuzzy > Destination].

    My suggestion would be to write a query that would move duplicate rows based on 4 columns that has exact match into a table.That should reduce your record counts, and then run Fuzzy on the new table.Place a Conditional Split after Fuzzy, to split duplicate,unique and invalid records into 3 diff tables.This did reduce the overhead for us.

    Have you tried setting the MaxMemoryUsage in Fuzzy Properties?

    No point splitting the 140 mil records into parts as you might still end up with duplicate records.

    Hope this helps πŸ™‚

    [font="TimesNewRoman"] β€œI haven't failed, I've found 10,000 ways that don't work”........Thomas Alva Edison[/font]

  • You bet! almost 3 years worth of it...anyways i get what you're saying, select all duplicates into a new table..

    And then do a sort(or do you mean fuzzy lookup) on that table

    Yes it could reduce the overhead...but i did not get the part after get Sort/fuzzy. Since this table would contain only one set of the previous duplicate data, it wouldnt be possible to do a condiltional split.

    I guess i would have to somehow pull all the non duplicates during the sorting into a new table and then join all the recordsets.

    So( i may have misunderstood)

    I would...Source---->Select duplicates---->insert into DB---->Sort---A]All uniques into one table-----------------------

    B]Select only one of the duplicates in the sort ----- Do a join---------> Finally write in the DB

    Do let me know if this is efficient, or if i have misunderstood your suggestion, do point that out to me.

    All this is first time for me...thanks for your input..its really helping πŸ™‚

    Jude

  • I think im getting closer, it would be easy to write to query to delete duplicate record

    Someone posted a nice deduplicating query.....

    DELETE

    FROM Table

    WHERE ID NOT IN

    (SELECT MAX(ID)

    FROM MyTable

    GROUP BY DuplicatevalueColumn1, DuplicateValueColumn2)

    But this would delete all duplicates, and return only the most recent record. However, it would also delete all unique records as well .

    My table needs to have the most one copy of the duplicate records as well as unique records as well.

  • Steps to Follow ...

    1) Create 4 tables a) Table_A_Duplicate_Records b) Duplicate_Records

    c) Unique_Records d)Invalid_Records[i]records with selection columns having all NULL values[/i]

    2)Insert into Table_A_Duplicate_Records, all records from Table_A[original table]that has more than 1 count based on the columns [F_name/l_name and etc ].

    3) Insert into Invalid_Records all records that have [F_name/l_name and etc ] NULL value on the columns. This way you are reducing data that runs through Fuzzy Process.

    4)Do Fuzzy Grouping on Table_A_Duplicate_Records, Pick your columns to compare.Set your Similarity Threshold value.

    5)Add a Conditional Split below Fuzzy Grouping.

    6) Define Condition [_key_in == _key_out ] for Unique Records

    7) Define Condition [_key_in != _key_out ] for Duplicate Records

    8) Add OLE db Destination, Set table name to Duplicate_Records.

    9) Add OLE db Destination, Set table name to Unique_Records.

    10) Direct Duplicate_Records from Conditional Split to Duplicate table

    11) Direct Unique_Records from Conditional Split to Unique table.

    This process would run Fuzzy on only duplicate records from Table A with an exact match on all selective columns.Having Count >1

    After step 11 you would have all the records that are duplicate in Table Duplicate_Records.Now set the following.

    12)Delete from Table_A [original table] where the primary_key value is same as those records in Duplicate_Records table.

    Now your Original Table [Table_A] is free of Duplicates.

    Do keep in mind, if you are planning to run this process not for this 1 time only.Then you would have to add a Ole db Command to Truncate all 4 tables mentioned in step 1) before running Fuzzy each time.

    I know there are lots of steps involved..but this does seem to work for us.Do give it a try.Good Luck !

    [font="TimesNewRoman"] β€œI haven't failed, I've found 10,000 ways that don't work”........Thomas Alva Edison[/font]

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply