Self-Healing ETL

  • Comments posted to this topic are about the item Self-Healing ETL

  • Back in the days of DTS it was possible to write a DTS package that altered itself dynamically at run time.  We tended to use this for things that are now standard components in SSIS but still, the ability to adjust to situations at run-time is a great boon.

    I have to ingest JSON documents into our data warehouse and one of the challenges is detecting when a new key:value has appeared in the files supplied.  Technically everything I need to do is possible the bit that is sad is that I am building a technical solution to cope with a lack of communication.

    Another useful facility is the ability to detect column orders in files and even have a thesaurus for column names.  3rd parties supplying data may be supplying Excel spreadsheets as a stop-gap until a formal solution is put in place.  The thesaurus facility allows for CustomerNumber, CustomerNo, Customer_Number etc.  The column detector allows for the CustomerNumber field (or any other for that matter) appearing in the 1st column on one supply and in some other column on subsequent supplies.

  • I think this would be an excellent use of AI.

    I suspect a lot in here have worked up their own tools to speed up the tedious task of matching things like columns or normalizing up or down to assist in writing necessary updates and appends. I am not completely familiar with the products out there available for this mainly because it is an infrequent task for me and it seemed as easy to write my own functions as buy an actual product.

    For me large scale ETL is still a somewhat rare request and any AI data manipulation automation is being directed at daily tasks at present.

    cloudydatablog.net

  • This is why I switched primarily to Python for all of my ETL so we can do things like this on top of easily doing dynamic pivots and headers. At this stage though, been moving away from headers all together though.

  • I am not sure if this applies to the subject matter, but I have found through experience of moving data across set boundaries (Customer Systems to Supplier Systems) can be accomplished by using existing tools such as regular expressions. In my case I have to decipher information provided by the customer to convert it to data that the supplier systems can identify with. In most cases it requires working with the engineers who have to use the data for existing processes. Terminology is the biggest hurdle to overcome in these situations. However if an Normality can be identified in the format of the data when presented in a string, regular expression's can be used to identify potential gaps in the deciphering process and be presented to the engineers for investigation and resolution. Sometimes issues can be resolved on the code development side, other times collaboration with the Customer Systems Engineers is needed. These situations are usually experienced when Process Changes occur (Either on the Customers Systems or the Suppliers Systems) where the Normality of the formatted data has been effected. Not every data system is designed the same, obviously, however the  intended purpose of the data must exist.   
    For example - The intended Purpose of the data is to define an assembly process of an axle. The data must express a part, a quantity, a description.   So if my description contains "WHEEL"   then we can say a Normality is every axle gets a wheel. So on and so on. Is this Self Healing? Most likely not, but to prevent the need to self heal, to me is the more logical choice.

  • xsevensinzx - Thursday, October 11, 2018 5:37 AM

    This is why I switched primarily to Python for all of my ETL so we can do things like this on top of easily doing dynamic pivots and headers. At this stage though, been moving away from headers all together though.

    What do you use to co-ordinate the transformations?  Apache Airflow or similar?

  • Dalkeith - Thursday, October 11, 2018 1:43 AM

    I think this would be an excellent use of AI.

    I suspect a lot in here have worked up their own tools to speed up the tedious task of matching things like columns or normalizing up or down to assist in writing necessary updates and appends. I am not completely familiar with the products out there available for this mainly because it is an infrequent task for me and it seemed as easy to write my own functions as buy an actual product.

    For me large scale ETL is still a somewhat rare request and any AI data manipulation automation is being directed at daily tasks at present.

    It's really not AI at all - it's about introducing reflection and introspection techniques into your ETL.  Programming languages have used this/made this available for some time now, and while it does take some care to know how to match things up (i.e. find the existing elements and make sure that they are still "the same") vs knowing how to detect new elements and deciding what to do with those new items, this is an old concept in a new setting.

    As Steve mentioned - while it is possible, the first question is whether you can really afford to make some guesses "on the fly":  what is the fallout of making the wrong choice/a bad matchup?  what is the performance hit?  For example - I might do this on demographic/background info (little to no risk), but I would probably lean towards "failing" the job if this were financial or regulatory data.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • David.Poole - Friday, October 12, 2018 1:17 AM

    xsevensinzx - Thursday, October 11, 2018 5:37 AM

    This is why I switched primarily to Python for all of my ETL so we can do things like this on top of easily doing dynamic pivots and headers. At this stage though, been moving away from headers all together though.

    What do you use to co-ordinate the transformations?  Apache Airflow or similar?

    What I built is basically Airflow from scratch. This is because Airflow was not really built with support for Azure at the time.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply