Organizing and cleaning up AS400 DB2 data in SQL server

  • I've been dealing with data migration out of ancient AS400 piece by piece through linked server connection and I'm having hard time figuring out the best way to structure everything into more or less decent way.  I'm joining multiple tables into views to get a piece of data I'm looking for and then creating tables out of that views.

    I have tables that have information about different parties like "attorney, commissioner, claimant(id, names, contact info etc) those tables also contain addresses of corresponding party. What I would like to do is to have only one Address table and collect all of the addresses in  there but I'm having hard time keeping up with referential integrity since each tbl has over 1,5 million records

    What would be the right way do achieve that?

    Claimant(Id,ClaimNo, First/LastName, Phone, email, Street, City, State, Zip)

    Attorney(AttCode, First/LastName, FirmId, Phone, email, Street, City, State, Zip)

    Insurer(InsCode, Name, PolicyNo, Phone, email, Street, City, State, Zip) etc


  • I recommend getting the data out of the AS400 data source directly into staging tables in SQL Server first.  If there is any question about the data-typing from the AS400, go with all character-based columns in the staging tables (e.g. varchar, nvarchar, etc.).  You can use the MERGE statement for the initial pull if you are repeating it over time, or just do "truncate and load" if you need everything only once.

    NOTE:  Be sure you do NOT apply any clustered indexes, and preferably no indexes at all, on the staging tables while they are being loaded.  You can apply indexes after they are loaded for the next step if desired/necessary.

    Once you have your raw data in SQL Server, there are a number of options depending on how you will associate your entity records to the address records.  For the address records themselves, using MERGE into the shared address table will allow you to insert new records and get the ID, or retrieve the existing address ID if a match is found.  You will have to determine the address record matching criteria.

    I recommend doing one entity at a time.  For example, Claimant.  Then do each additional one.  Once you figure out the address data MERGE statement format, it will be replicatable for the others.  Wrap them all in a stored procedure with input parms for things like verbose logging, single entity loading, etc. for a robust and repeatable solution.

    J Pratt

  • Thank you for a detailed answer, I appreciate it

  • When cleaning data, one of the first things to do is have a 'model' in your head how the data is working. A logical model which data is present and how it relates.

    A suggestion for implementation of this type could be :

     

    Have a table with 'objects', this can be persons, institutions etc. Each object has a row in this table.

    A second table would be an identify table. Where there is at least one identification of each object, but a object can have multiple identifications. For example some People are identified by there social security number, others are identified by their employee number, maybe objects are identified by a ISBN number. The identification should go with at least an identification, a identification_type (for example social_security, or drivers_number), and the identification of the objects table.

    Now you can have multiple identifications for a multiple of objects. In General within the identification_type you can also specify the object type, but this is not neccesary.

    Then you can have plenty of tables which contain information from or about the 'objects'. So a fireman can have specific information about his capabilities where a Attorney does not have capabilities 🙂 . And a book can have other properties.

    One set of properties can be an address. Some object or institutions can have multiple addresses, like a corresponding address and/or a visitiong address.

    Relations:

    The 'Object' table is fairly central and has a totaly artificial (surrogate) key, which is NOT used outside the database. The identify table uses this key to identify the object, the address or any other property table used this key to identify the object.

    Software should first use the identify table to get this artificial key and then use this artificial key for all the other tables, which has something to do with the object. So even if different identifications (ISBN/Social etc.) is used first step is get the internal identification and than work with that.

    Often an object can be only of one kind so it is a book or it is a person. In your model you can incorperate that. But it is also possible that a person has different rolls. In that case the object can be the person or you could choose to have the object to represent one role and have a model were the roles are the object (and not the person).

    Once you have got this in place cleaning can start. Often some object have multiple entries with different identifications, this can be 'resolved' and cleaned by keeping the identifications but merging the two objects in all other tables as one object.

    If you want to have the constraint that a person can not be a book, it is possible to create a constraint for that. (It goes a bit far to go into super and subtypes constraints here, because this message would become a bit longer).

    Any questions, it might be that we (or I) can anwser them. Important is to think first of the model (in your head) and after that think of the implementation in a database. Trying to solve problems without the model (in your head) will often lead to more complex and less appropriate solutions.

    Ben

  • Claimant can have multiple claims

    An Attorney can represent multiple clients

    And an  Insurer can have ties to both

    Each address might be only one type, but there could be overlap. For example, Attorney A insures his business with Insurer B and has a claim.

    Thinking in terms of normalizing the data and the end goal. You have supplied one small part of what appears to be a much larger picture. The work put in here will be work well spent down the road. I assume that some of the relationships between the data exist today, and you want not only the addresses, but to maintain these relationships.

    Just a different way of stating Ben’s thoughts. There still might be some other data clean up issues - names and addresses change over time, which you may have effectively dates in the core system to deal with.

  • Thank you, that's definitely useful information.

    As of right now my goal is to have legacy Claim data in datastore separate from functioning database.  Software that we are using for a new system is Bizagi, it has tons of limitations in terms of touching  database.  So at the time when someone needs to access older claim data Bizagi suppose to get it from datastore through API.

    So what I'm doing now is collecting all pieces of claim data and case parties,  getting rid of duplicates and transforming everything in more acceptable structured form.

  • Normalizing data takes care of duplication.

    Sometimes history can be modeled so it integrates into production, sometimes not. Big thing is getting a clean cutoff. At times, although it would require more data entry, we ran in parallel for a week. Even good testing sometimes would miss something important.

    Hope all goes well - sounds like you now have what you need for a smooth transition.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply