Data cleansing and Data filtering

  • What is data cleansing and Data filtering and how can we do the data cleansing and Data filtering from the SQL 2005 server database to Maximo database.

  • ...or to put it more succinctly "how long is a piece of string?"


    Sorry I couldn't resist!

    I'm no forum expert but surely you can't expect a decent answer if you post a question like that? Data cleaning/filtering or basically ETL (extract, transform, load) is not a fixed set of procedures or rules. It depends on the data, it depends on the source database (application) and it depends on the destination database (application).

  • What is data cleansing

    What is Data filtering

    how can we do the data cleansing and Data filtering from the SQL 2005 server database to Maximo database

    Set up connection to SQL 2005 database (simple and easy) see connection mangers in SSIS, and Extract the data. Once processed the data in way you want, lots of tools/processes for that in SSIS, Transforming the data. Then set up connection to Maximo, which l have seen a posting doing using MS Access DB so using possibly an OLEDB conneciton will allow out write the data to the tables in Maximo, Load the data.

    SSIS is an ETL (Extract, Transform, Load) tool.

    How you do these steps depends on the databases, business rules, data etc you are working with. Start playing with SSIS, then google and ask questions

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

  • Need to cleanse data comes from acceptance of dirty data to begin with. It happens and there is apparently nothing we can do about it.

    There are two cases of data loading:

    1. Raw incoming data is being loaded into independent isolated table
    2. Raw data is loaded into a set of tables in more or less properly designed relational database.

    Case 1) independent table may be case of loading some Excel, XML or similar data. If you know or can infer what makes the data dirty do your best to follow advices from previous posts. In time, you will get better idea of what could be wrong with the data. Mind you, it is impossible to 100% cleanse the data this way, with time and experience, you will get better and come closer to 100%.

    Case 2) should be easier, apparently, because properly designed relational database will have some constraints and nonconforming data will simply be rejected at the gate. You don't want to keep trying to add the data and each time get message about violation of another constraint. Often incoming data is not consistent with data types. For each column in incoming data check if  data types are corect.

    SELECT individual_column -- numeric
    FROM IcomingRawData
    WHERE NOT ISNUmeric (individual_column )


    So, to avoid the system telling you that something is wrong, you can find definitions of all constraints of interest (that affect target table(s)) and simply run select queries for each constraint, something like this:

    For constraint CHECK (StartDate<=EndDate) something like this would do:

    SELECT SomeColumns
    FROM IncomeingRawData
    WHERE NOT (StartDate<=EndDate)

    Once you find all dirty data, you need a decision - "fix it" or "throw it to trash". "Fix it" is OK if the errors are obvious. If not, leave it alone and send nasty letter to the data sender. "Throw to trash" may make sense, but it may make loading impossible due to Foreign Key constraints.

    The best approach is none of the above. Find out in advance what incoming data should look like, design a table  - column names, data types,  data size, NULLability (sigh!, there should be none, but nothing we can do about that) with lots of constraints, the more the merrier. Then demand sender to send the data with your design in mind. If possible, see if data on their end can be in same type of teh database as your target database is, so you can pass them constraints in advance. They should make sure data is as clean (follows fdata types, nullability and constraints) before it is sent.

    Of course, you could be getting data from PDF files or screenshots of the web pages. If that is the case, entire operation should be questioned.

    You know already - garbage in, garbage out. DBA or a developer should not be the only person to deal with dirty incoming data.

    That is all, good luck


    Zidar's Theorem: The best code is no code at all...

  • This was removed by the editor as SPAM

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply