In this week's Database Weekly, Nigel Rivett's ETL from Hell article deserves the attention of anyone for whom the title resonates.
Many SQL Server Extract-Transform and Load (ETL) processes have a similar goal. We must extract data from various external sources, transform it in various ways, including executing various data cleansing routines, and load it into a database, often an OLAP data model, for generation of business reports.
What often starts out as a simple, quick process, gradually expands over time, as developers use SQL Agent to schedule new data feeds, add more complex transformations. Eventually, errors start to creep into the data. The time required for each 'sausage' to pass through the ETL machine, one after the other, goes up and up. An error in one task in the process typically means that no further tasks can run, even if they have no dependency on the failed task. Very soon, it is common that the business reports are not ready at the time the business needs them, because the data is not ready.
Nigel's solution starts by gathering into 'threads' each set of dependent tasks, which must run in sequence, in order to complete the ETL process for a given business function (i.e. loading the accounting data, or the sales data, and so on). A table in a central repository database logs details of every task, its start and end time, how long each thread as a whole takes to execute, and records dependencies between threads and tasks.
A controller pulls a task from the table and assigns it to run on a Scheduler. The moment it completes, the next dependent task runs. The controller can raise an alert for any failed or overrunning task. The developer can analyze and seek to tune a long-running task, perhaps by splitting it down into several 'atomic' tasks.
At the next stage, the developer can assign resource ratings to each task, as well as time windows and priorities. The scheduler can select the task with the highest priority to run. If a negotiated time slot for a task has passed, then the scheduler can give precedence to another task, so that a rogue thread can no longer hold up the whole system. The scheduler can run side-by-side any tasks from separate threads that will not 'fight' for the same resource (such as data source), up to the maximum capacity of the system.
Sounds simple, right? That's its beauty. Much has been written about the intricacies and difficulties of ETL processes, delving deep into the crevices of tools such as SSIS, in order to wring maximum performance from each part of the process. Occasionally, however, someone presents an idea, a solution, of such elegance and simplicity that it immediately seems like common sense, and that everyone must be doing it this way already, though I suspect they aren't. Am I wrong?