Refactoring ERP loading to data warehouse...

  • I have inherited (isn't that always the case) a data warehouse whose sole source of data is a legacy ERP.

    The ERP has no 'last modified' timestamps in its data, so the idea of pulling data is only useful for a few tables with useful dates. I surmise, that as a result, my predecessor (who was obviously a programmer) built a push mechanism (from ERP to DW) that uses a central collection procedure in the ERP that gets all 'writes' to the ERP tables and pushes them in near-real-time to the DW. This doesn't sound that horrible except that on the DW side, there is a central stored proc that loops over this 'incoming data' (in a table) row-by-row looking for the oldest unprocessed record. This 'incoming table' contains all the pushed data (in a fixed-width format, varchar(max) field) from the ERP, of all (> 200) data types. After it finds the next record to process it looks up what type of data it is (based on a datatype dimension) finds the associated stored proc and then builds a dynamic stored proc string to execute for the specific datatype. This is a very simplistic explanation but hopefully demonstrates the point.

    the 'incoming table' data looks something like this:

    sequence, datatype, data, processed

    -----------------------------------------

    1, datatype1, fixed-width data string, true

    2, datatype1, fixed-width data string, true

    3, datatype2, fixed-width data string, true

    4, datatype2, fixed-width data string, true

    5, datatype1, fixed-width data string, false

    6, datatype3, fixed-width data string, false

    7, datatype2, fixed-width data string, false

    8, datatype3, fixed-width data string, false

    Problems (opportunities) I see:

    1) no timestamps in ERP limits my opportunities to 'pull'

    2) row-by-row ignores the fact that it's in a SQL database

    3) one stored proc per fixed-width datatype has me a bit cornered

    4) dynamic sp execution can't be good for query plans, right?

    So, any ideas, without affecting the source system, on a better way to deal with many types of data in the same 'incoming table' such that I can get rid of the row-by-row and take advantage of set-based processing?

    My first thought was to assign a priority to each datatype and make the scope of the loop be 'priority\datatype' instead of row. So loop over each datatype in 'priority order and within the loop select top X rows into a global temp table (or table variable?), ordered by creation sequence, for the datatype of the current loop iteration. Then invoke the stored proc for this data type (modified to work on sets from the temp table). Loop through each data type and then repeat.

    This idea doesn't get rid of loops completely but at least I would get the advantage of large datasets of tunable size. Lots of sp's to modify.

    Can anyone shoot holes in this or provide any other ideas?

  • It looks like a simple message broking pattern.

    For me the first thing I would look to do is refactor the messages to XML; this will increase the size of the file but it has two advantages. Firstly then data is syntactically referenced (i.e. you can see by looking in the file what the data represents). Secondly XML will allow you to collect all of the relevant data together into a single file. For example, Assuming DataType 1 is an Order Header and type 2 is the order lines and type 3 is the payment data

    Currently you have to process the type 1 line first or the relationship between the lines and header does not exist. You then have to process each line separately; if one line fails, what do you do with the other lines and the header. If you try to process the lines before the header do you fail them or assume late arriving data.

    If the data is in XML then the whole order comes through as a single message

    <ORDER>

    <ID>12356</ID>

    <CUSTOMER>XYZ123</CUSTOMER>

    <ORDERLINES>

    <ORDERLINE>

    <LINEID>1</LINEID>

    <ITEMID>FFGG55</ITEMID>

    <QTY>3</QTY>

    <PRICE>9.99</PRICE>

    </ORDERLINE>

    <ORDERLINE>

    <LINEID>2</LINEID>

    <ITEMID>WIDGET7</ITEMID>

    <QTY>1</QTY>

    <PRICE>100.00</PRICE>

    </ORDERLINE>

    </ORDERLINES>

    <PAYMENTS>

    <PAYMENT>

    <METHOD>AMEX</METHOD>

    <AMOUNT>109.99</AMOUNT>

    <AUTHCODE>123765</AUTHCODE>

    </PAYMENT>

    </PAYMENTS>

    </ORDER>

    Now you can validate the file against a DTD and process the entire file as a single transaction with a commit or rollback if there is a fatal error

    Failing that each SP should have static code and receive the data as parameters. If the SSIS is not already deciding which SP to call then I would do that in the SSIS rather than have a master sproc that decides which child sproc or INSERT routine is going to be run. This is a rare case of where I would try to put it into SSIS even though it could be done by the database engine - you can get better visibility and logging of what conditions are being applied. If there is no dependency between the messages and therefore the order of processing is not critical, you would even get performance increases from parallel processing of records by putting the procedure calls at outout from a conditional split task.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply