Fuzzy Grouping Slow

  • Hi All,

    I have ran out of ideas on the following problem:crying:.Maybe someone would be able to help me out.I have a package to remove duplicate records from 1 table and move the unique records to a new table .Below is the flow of my SSIS package

    Oledb source = 1 table [2 million records]

    Transformation =Fuzzy grouping on [based on 6 columns] [score is 70%]

    The min similarity of each column is 0.7%

    Conditional Split = Unique & Duplicate

    Oledb destination =2 tables [Unique & Duplicate table ]

    Source table ---> Fuzzy Grouping---->Conditional Split ---->Unique

    ---- >Duplicate

    My Server information is as below

    Microsoft SQL Server 2005 - 9.00.3042.00 SP2

    Windows Server 2003 SP2

    Memory 4.00GB

    My Problem is , My package execution goes on for 2 continuous days and still is at [Fuzzy Grouping Inner Data Flow] Progress: Finding similar records - 78 percent complete.

    1)Is it suppose to take so long to complete ?

    2)How can I improve the performance ?

    3)Upon checking, the memory used by the process is 400,000 K

    There is no other application or process running on my server except for this package.It take up to 3 hours to complete 5%.Do help me out as I am not able to go on like this for 2 days and still not completed.This package is to run on every month end.

    Thank you in advance.

    [font="TimesNewRoman"] “I haven't failed, I've found 10,000 ways that don't work”........Thomas Alva Edison[/font]

  • Hey there,

    Fuzzy Group by design is very resource intensive. However, there are a few general things you can do to ease up on the problem of performance.

    1.) Make sure SQL Server has access to enough memory. Often you will find it's limited to only a chunk of memory. If SQL is the only thing running on the server can open up most of the system RAM to SQL and leave say 512 to the Operating System. Also make SQL Server has access to all available processors. This can be adjusted through Management Studio under Server Properties.

    2.) Limit the size of the data. This can make a tremendous difference. I am doing a Fuzzy grouping where staff submit lists of prospects that I run against our custom CRM system to identify which prospects are already clients. We have a huge client list, but when I filter that list down to only active clients the list becomes much more workable. Went from taking +20 minutes down to 2-3 minutes.

    3.) Avoid fuzzy grouping on large strings if possible. The more text it has to comb thru the longer it will take. Also consider whether your need to match against so many columns. By sampling you can see if the results from individual columns are really providing useful matches. If not then the column could possibly be eliminated.

    4.) Sample your results. Fuzzy grouping isn't an exact sience. I often found that when it found matches either they were very very close like 80% or better, or no where near. So setting your matching threshold higher can increase performance as well. This option will very greatly depending on your current situation and data requirements.

    5.) Then there are the common sense items. Make sure your hardware is up to the task, and that you don't have other things running that could result in resource contention. Always remember that Fuzzy Datasets are stored in memory while they are being processed, if the server is fighting for memory it will end up paging and will take a tremendous performance hit.

    Hope this was helpful...Happy Holidays,

    Eric

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply