Dynamic Cleanse Commands and Check Queries - Design Ideas

  • I am looking at automating a repetitive task (using SSIS) and would be interested to know if anybody has alternatives or pros/cons on the design I have so far.

    External parties send us data in Excel sheets, based on 20+ templates, due to the bespoke nature of the system, a new template could be requested at any time.  A LOT of time is spent manually importing the data into SQL, then cleansing and checking, only to realise there is a problem, the import request is rejected and the process starts again 🙁 .  The rub is that once this system is setup, there will be limited developer support to create new packages when new templates are introduced and therefore is a real risk it will become less useful as time goes by - the type of checks are always the same (validate Length, validate against List of Values, validate against another column in another Sheet) .

    I get around the differing templates by importing the data into a staging table with columns [F1] to [F26], all at nvarchar(255), the excel template structure is modelled in TSQL so we know which columns need to be validated E.g Template 1 uses [F1] to [F7].

    My plan is to store column specific cleanse commands and check queries in the config database itself and substitute the column names at runtime, allowing the user to maintain all template info and checks without having to create specific ETL packages for each scenario. E.g If template = X and column = X this needs to have LTRIM(F1), RTRIM(F1), REPLACE(F1, ",", "") and "Select Cols from RawData where [F1] not in (select Col from ListOfValuesCheckTable)".  This would really be SSIS executing SQL commands in batch rather than using native SSIS transformations.

    Any suggestions/pitfalls/alternatives for implementing such a system?

    Thanks in advance.

  • "Excel, I love ya, but we need to talk about your personal habits..."  Overall it could be a workable system although it is based (heavily) on dynamic sql.  To expect it to work over time without dev support is a bit optimistic imo.

    "External parties send us data in Excel sheets"  Could you set up containers in blob storage to receive the files?  Emailing files is usually quite error prone in my experience.

    SSIS could be a good choice to import the data and apply rule based transformations.  I would check out Azure Data Factory pipelines too.  ADF Pipelines can be scheduled or triggered by event.  The Excel files would have to be converted to CSV or some delimited format afaik.

     

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply