Data Cleansing With SSIS

  • i have converted many text,acess,acess,oracle data to sql now i have to perform data cleansing.

    i ,m new to SSIS and i don`t know how to use SSIS to perform my data cleansing before ETL. i have used some queries in my SQL to check dirty data,

    for gender

    SELECT [Gender]

    FROM [Isb_practice].[dbo].[BS_Students]

    Where (Gender <>'0') AND ( Gender <>'1')

    i found 8 values with (00,11,01,10,001,M) .

    similarly, for date of birth

    SELECT *

    FROM Student

    WHERE ([Date of Birth] NOT LIKE '%_-Jan-__')

    AND ([Date of Birth] NOT LIKE '%_-Feb-__')

    AND ([Date of Birth] NOT LIKE '%_-Mar-__')

    AND ([Date of Birth] NOT LIKE '%_-Apr-__')

    AND ([Date of Birth] NOT LIKE '%_-May-__')

    AND ([Date of Birth] NOT LIKE '%_-Jun-__')

    AND ([Date of Birth] NOT LIKE '%_-Jul-__')

    AND ([Date of Birth] NOT LIKE '%_-Aug-__')

    AND ([Date of Birth] NOT LIKE '%_-Sep-__')

    AND ([Date of Birth] NOT LIKE '%_-Oct-__')

    AND ([Date of Birth] NOT LIKE '%_-Nov-__')

    AND ([Date of Birth] NOT LIKE '%_-Dec-__')

    AND ([Date of Birth] <> '') AND ([Date of Birth] IS NOT NULL)

    The output of the above query shows 3 invalid dates. 22-Jal-75,1/27/75,27-Apl-77.

    SELECT * FROM Student

    WHERE ([Date of Birth] LIKE '29-Feb-%') OR

    ([Date of Birth] LIKE '3_-%')

    The output of the above query shows 9 invalid dates. 29-Feb-75,31-jun-89 etc.

    WHAT I WANT????

    1)first SSIS read data.

    2)if don`t find 0 or 1 in gender send erronous data(00,11,01,10,001,M) to "error table".

    3)in "error table", dirty data (00,11,01,10,001,M) should be converted in 0,1

    4)after standarization, error table data records should go to original table at its place.

    5)Now i will change 0 to male and 1 to female.

    6)now i will send complete table of student(stid,name,father,gender,adress,last degree,reg data,dob) to a new table STD_INFO.

    7) i will repeat same 1-6 steps for all databases(lahore,karachi,islamabad,peshawar) and then i will integrate STD_INFO tables of all databases into one.

    i hope now you can understand what i have to do. Similar case with DOB first dirty data to erronous table after conversion there to original table and then i will check all records at the same time.

  • My suggestion to you is create two package instead of single package trying to do everything in one package

    two packages are

    1) First package will identify the error data and move error data(error data means not just a column but entire row) to error table and process the correct data

    2) Second package will do the data cleanising and call the first package for processing the correct data(u can put a condtion to call the first package when error table has more than one row )

  • mhassanshahbaz (11/2/2010)


    3)in "error table", dirty data (00,11,01,10,001,M) should be converted in 0,1

    How do you know the true value of your dirty data? For example, how do you know if "01" is a 0 or a 1? Do you have business requirements that tell you the true value of the erroneous records? I would certainly be sure before you start altering the data and sending it to a table with good records.

    I agree with the poster above that this should be split out into two separate packages.

    If you have a lot of possible "dirty values" for a column, I would suggest using a decode table to cleanse your data as opposed to a derived column component. You would need to create a table that has a column that contains the column name, a column with your "dirty data" values", and a column with the corresponding "clean" values. The column labeled "Column Name" will hold values such as BirthDate, Gender, and any other columns that you need to check for "dirty data." Then do a lookup specifying the column you want to decode in your where clause and if the data in your data flow matches your dirty data values, then return the cleansed value. Updating a decode table will be a lot easier from a maintanence perspective as opposed to always going into your package to add "dirty" values. Also, the derived column component can be a nightmare to go through and keep track of what values you are decoding - not the case with a decode table.

  • How do you know the true value of your dirty data? For example, how do you know if "01" is a 0 or a 1?

    i checked data in Sql server,n found 8 dirty values only,as far as your answer is concerned,yeah there is no exact info about "01" is a 0 or a 1,i have to do it manually by names.

    I would suggest using a decode table to cleanse your data as opposed to a derived column component.

    what is decode table in SSIS plzz elaborate???

  • the decode table is supposed to refer to a "parameterlist table" where you correlate certain values to keys or other values eg.

    ParameterName Code Key Value

    Gender U 1 Unassigned

    Gender M 2 Male

    Gender F 3 Female

    Gender G 4 Group

    MaritalStatus U 1 Unknown

    MaritalStatus M 2 Married

    MaritalStatus S 3 Single

    .

    .

    you can use a lookup component on a parameterlist table based on parameter name and codes instead of setting multiple combinations in a derived column to get the desired value and you would still have your data grouped, organized and readily available.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply