Designing SQL Server ETL Pipelines That Don't Break at Scale

The first ETL pipeline I ever built in production was a single stored procedure. It pulled data from three source tables, applied some transformations, and loaded it into a reporting table. It ran in 40 seconds. I was proud of it.

Eighteen months later, that same pipeline had grown to 14 stored procedures chained together by a SQL Agent job with 22 steps. Source data volume had increased tenfold. The job ran for 4 hours on a good night and 7 hours on a bad one. When it failed — and it failed two or three times a month — finding which step broke and what data was left in an inconsistent state took longer than the actual fix. Nobody wanted to touch it. I was no longer proud of it.

That experience taught me that the difference between ETL pipelines that work and ETL pipelines that survive at scale is not better T-SQL. It is how you structure the pipeline itself — how you handle failures, how you manage dependencies, how you make the pipeline observable, and how you design it so that adding a new source or transformation does not require rearchitecting everything.

This article covers the patterns I have settled on after building and maintaining ETL systems that process hundreds of millions of rows per day in SQL Server. These are not theoretical best practices. They are decisions that directly reduced pipeline failures, cut debugging time, and made the systems maintainable by people other than the person who built them.

The Staging Layer Is Not Optional

The most consequential design decision in any SQL Server ETL pipeline is whether to stage incoming data before transforming it. The answer is always yes. I have never regretted having a staging layer. I have regretted not having one more times than I want to admit.

Here is what I mean by staging. Raw data from every source lands in a dedicated staging schema, in tables that mirror the source structure as closely as possible. No transformations, no joins, no business logic. Just the raw data, plus metadata columns that track when and how it arrived.

CREATE TABLE staging.orders_raw (
    source_order_id     VARCHAR(50)     NOT NULL,
    customer_code       VARCHAR(20),
    order_date          VARCHAR(30),     -- intentionally VARCHAR
    total_amount        VARCHAR(20),     -- intentionally VARCHAR
    status_code         VARCHAR(10),
    _load_batch_id      INT             NOT NULL,
    _loaded_at          DATETIME2       NOT NULL DEFAULT SYSUTCDATETIME(),
    _source_file        NVARCHAR(500)   NULL
);

Notice that order_date and total_amount are VARCHAR, not DATE and DECIMAL. That is deliberate. When you load raw data directly into typed columns, a single malformed date or a non-numeric amount causes the entire batch to fail. By staging as VARCHAR first, you land everything successfully and then handle data quality issues in a controlled transformation step where you can log, count, and route bad records instead of losing the whole load.

Here is what the validation step looks like after staging:

-- Identify and log records that will fail type conversion
INSERT INTO etl.data_quality_log (
    batch_id, table_name, column_name, 
    source_key, raw_value, error_type, logged_at
)
SELECT 
    _load_batch_id,
    'orders_raw',
    'order_date',
    source_order_id,
    order_date,
    'INVALID_DATE',
    SYSUTCDATETIME()
FROM staging.orders_raw
WHERE TRY_CAST(order_date AS DATE) IS NULL
  AND order_date IS NOT NULL;

-- Report the count before proceeding
DECLARE @bad_dates INT = @@ROWCOUNT;

IF @bad_dates > 0
    PRINT CONCAT('Warning: ', @bad_dates, 
                 ' records with invalid dates in batch ', 
                 @batch_id, '. Routed to quality log.');

In one system I maintained, this pattern caught an upstream schema change that silently switched a date format from YYYY-MM-DD to MM/DD/YYYY. Without the VARCHAR staging layer, the entire nightly load would have failed. Instead, the first batch landed successfully, the validation step flagged 100% of records as having unparseable dates, and the on-call engineer saw the alert before business hours. We adjusted the parsing logic, reprocessed from staging, and nothing downstream was affected.

The cost of staging is minimal. You need additional storage for temporary data and an extra load step. The staging tables are truncated after each successful pipeline run. On a system processing 200 million rows per night, the staging layer added roughly 40 GB of temporary storage and 8 minutes to the total pipeline time. That is a trivial price for the ability to diagnose, replay, and recover from any data issue without going back to the source.

Idempotent Loads: Design for Re-Runnability

Every transformation step in the pipeline should be safely re-runnable. If step 5 of 10 fails, I want to fix the issue and restart from step 5 without worrying about duplicate data or missing records.

The pattern I use is MERGE for dimension loads and DELETE-INSERT for fact loads. Here is a concrete example of each.

Here is a dimension load using MERGE:

MERGE dbo.dim_customer AS target
USING (
    SELECT 
        customer_code,
        customer_name,
        COALESCE(segment, 'Unknown') AS segment,
        region,
        TRY_CAST(registration_date AS DATE) AS registration_date
    FROM staging.customers_raw
    WHERE _load_batch_id = @batch_id
      AND TRY_CAST(registration_date AS DATE) IS NOT NULL
) AS source
ON target.customer_code = source.customer_code
WHEN MATCHED AND (
    target.customer_name <> source.customer_name
    OR target.segment <> source.segment
    OR target.region <> source.region
) THEN UPDATE SET
    target.customer_name = source.customer_name,
    target.segment = source.segment,
    target.region = source.region,
    target.modified_at = SYSUTCDATETIME()
WHEN NOT MATCHED BY TARGET THEN INSERT (
    customer_code, customer_name, segment, 
    region, registration_date, created_at, modified_at
) VALUES (
    source.customer_code, source.customer_name, source.segment,
    source.region, source.registration_date, 
    SYSUTCDATETIME(), SYSUTCDATETIME()
);

DECLARE @merged_inserts INT = 0, @merged_updates INT = 0;
SET @merged_inserts = @@ROWCOUNT;  -- simplified; see note below

A note on MERGE: there are well-documented edge cases with MERGE in SQL Server, including occasional deadlocks under high concurrency and the requirement to terminate the statement with a semicolon. For dimension tables that load during a dedicated ETL window with no concurrent writes, these issues are manageable. If your dimensions receive concurrent updates from multiple sources, consider using separate INSERT and UPDATE statements with explicit locking hints instead.

Here is a Fact load using DELETE-INSERT by batch window:

-- Delete any previously loaded data for this date range
-- This makes the step re-runnable
DELETE FROM dbo.fact_orders
WHERE order_date >= @window_start 
  AND order_date < @window_end;

-- Insert the fresh data
INSERT INTO dbo.fact_orders (
    order_date, customer_key, product_key, 
    quantity, amount, discount_amount, 
    load_batch_id, loaded_at
)
SELECT
    TRY_CAST(s.order_date AS DATE),
    d.customer_key,
    p.product_key,
    TRY_CAST(s.quantity AS INT),
    TRY_CAST(s.total_amount AS DECIMAL(18,2)),
    TRY_CAST(s.discount AS DECIMAL(18,2)),
    @batch_id,
    SYSUTCDATETIME()
FROM staging.orders_raw s
JOIN dbo.dim_customer d ON s.customer_code = d.customer_code
JOIN dbo.dim_product p ON s.product_code = p.product_code
WHERE s._load_batch_id = @batch_id
  AND TRY_CAST(s.order_date AS DATE) IS NOT NULL
  AND TRY_CAST(s.total_amount AS DECIMAL(18,2)) IS NOT NULL;

The DELETE-INSERT pattern is simpler than MERGE for fact tables, and it guarantees idempotency — run it five times and you get the same result. The key is defining the reload window clearly. I always use the natural business key (order_date in this case) rather than the batch ID for the delete, because if the pipeline is rerun with a different batch ID, the old data still gets cleaned up.

Actual impact of idempotent design: In the system I referenced earlier, pipeline failures dropped from 2-3 per month requiring manual intervention to the same frequency of failures but with an average recovery time of 12 minutes instead of 2 hours. The failures still happened — bad source data, network timeouts, occasional deadlocks — but recovering was just "fix the root cause, rerun from the failed step." No data cleanup, no inconsistency investigations.

The Batch Control Table: Making Pipelines Observable

The single most underrated component of a production ETL pipeline is a batch control table. It sounds mundane. It changes everything about how you operate the system.

CREATE TABLE etl.batch_control (
    batch_id            INT IDENTITY(1,1)   PRIMARY KEY,
    pipeline_name       VARCHAR(100)        NOT NULL,
    batch_status        VARCHAR(20)         NOT NULL 
                            DEFAULT 'RUNNING',
                            -- RUNNING, COMPLETED, FAILED, REPROCESSING
    started_at          DATETIME2           NOT NULL DEFAULT SYSUTCDATETIME(),
    completed_at        DATETIME2           NULL,
    rows_staged         INT                 NULL,
    rows_rejected       INT                 NULL,
    rows_loaded         INT                 NULL,
    error_message       NVARCHAR(4000)      NULL,
    window_start        DATE                NULL,
    window_end          DATE                NULL,
    triggered_by        VARCHAR(50)         NOT NULL DEFAULT SYSTEM_USER
);

Every pipeline run starts by inserting a row into this table and ends by updating it with the final status and row counts:

-- At pipeline start
INSERT INTO etl.batch_control (pipeline_name, window_start, window_end)
VALUES ('daily_orders', @window_start, @window_end);

DECLARE @batch_id INT = SCOPE_IDENTITY();

-- ... pipeline steps execute here ...

-- At pipeline end (success path)
UPDATE etl.batch_control
SET batch_status = 'COMPLETED',
    completed_at = SYSUTCDATETIME(),
    rows_staged = @staged_count,
    rows_rejected = @rejected_count,
    rows_loaded = @loaded_count
WHERE batch_id = @batch_id;

And the failure path, wrapped in TRY-CATCH:

BEGIN TRY
    -- ... pipeline steps ...
END TRY
BEGIN CATCH
    UPDATE etl.batch_control
    SET batch_status = 'FAILED',
        completed_at = SYSUTCDATETIME(),
        error_message = CONCAT(
            'Error ', ERROR_NUMBER(), ': ', ERROR_MESSAGE(),
            ' at line ', ERROR_LINE(), 
            ' in ', ERROR_PROCEDURE()
        )
    WHERE batch_id = @batch_id;
    
    THROW;  -- re-raise so SQL Agent marks the step as failed
END CATCH

Why this matters in practice

Six months after implementing the batch control table, I could answer questions that used to require digging through SQL Agent job history and transaction logs:

-- How often does each pipeline fail?
SELECT 
    pipeline_name,
    COUNT(*) AS total_runs,
    SUM(CASE WHEN batch_status = 'FAILED' THEN 1 ELSE 0 END) AS failures,
    CAST(SUM(CASE WHEN batch_status = 'FAILED' THEN 1 ELSE 0 END) * 100.0 
         / COUNT(*) AS DECIMAL(5,2)) AS failure_pct,
    AVG(DATEDIFF(SECOND, started_at, completed_at)) AS avg_duration_sec
FROM etl.batch_control
WHERE started_at >= DATEADD(MONTH, -3, GETDATE())
GROUP BY pipeline_name
ORDER BY failure_pct DESC;

Sample output from one of our systems:

pipeline_name      total_runs  failures  failure_pct  avg_duration_sec
-----------------  ----------  --------  -----------  ----------------
daily_orders       90          4         4.44         1,847
customer_refresh   90          1         1.11         423
product_sync       90          0         0.00         187
inventory_snap     90          7         7.78         2,291

That inventory_snap pipeline at 7.78% failure rate immediately became my next optimization target. Without the batch control table, I would not have known it was failing that often because the on-call engineer was silently rerunning it each time.

Row count trending is equally valuable. When the daily_orders pipeline usually loads 1.2 million rows and suddenly loads 200,000, something is wrong upstream — even if the pipeline itself completed successfully. A simple query against the batch control table catches this:

-- Detect unusual row counts (more than 40% deviation from 7-day average)
SELECT 
    b.batch_id,
    b.pipeline_name,
    b.rows_loaded,
    b.window_start,
    avg_recent.avg_rows,
    CAST((b.rows_loaded - avg_recent.avg_rows) * 100.0 
         / NULLIF(avg_recent.avg_rows, 0) AS DECIMAL(5,1)) AS pct_deviation
FROM etl.batch_control b
CROSS APPLY (
    SELECT AVG(rows_loaded * 1.0) AS avg_rows
    FROM etl.batch_control b2
    WHERE b2.pipeline_name = b.pipeline_name
      AND b2.batch_status = 'COMPLETED'
      AND b2.started_at >= DATEADD(DAY, -7, b.started_at)
      AND b2.batch_id <> b.batch_id
) avg_recent
WHERE b.batch_status = 'COMPLETED'
  AND b.started_at >= DATEADD(DAY, -1, GETDATE())
  AND ABS(b.rows_loaded - avg_recent.avg_rows) > avg_recent.avg_rows * 0.4;

This query has caught three production issues in our systems that would have otherwise gone undetected until a business user noticed missing data in a report — usually days later.

Handling Dependencies Between Steps

When your pipeline grows beyond five or six steps, the dependencies between them start to matter. Step 4 cannot run until Steps 2 and 3 are both complete. Step 7 depends on Step 4 but not Step 5 or 6. SQL Agent jobs model this poorly — they support linear sequences and simple branching, but expressing parallel dependencies is awkward.

The pattern I use is a step control table that tracks individual step completion within a batch:

CREATE TABLE etl.step_control (
    batch_id        INT             NOT NULL,
    step_name       VARCHAR(100)    NOT NULL,
    step_status     VARCHAR(20)     NOT NULL DEFAULT 'PENDING',
    depends_on      VARCHAR(500)    NULL,  -- comma-separated step names
    started_at      DATETIME2       NULL,
    completed_at    DATETIME2       NULL,
    rows_affected   INT             NULL,
    error_message   NVARCHAR(4000)  NULL,
    CONSTRAINT PK_step_control PRIMARY KEY (batch_id, step_name)
);

At the start of each batch, I populate the step control table with all steps and their dependencies:

INSERT INTO etl.step_control (batch_id, step_name, depends_on) VALUES
(@batch_id, 'stage_orders',       NULL),
(@batch_id, 'stage_customers',    NULL),
(@batch_id, 'validate_orders',    'stage_orders'),
(@batch_id, 'validate_customers', 'stage_customers'),
(@batch_id, 'load_dim_customer',  'validate_customers'),
(@batch_id, 'load_fact_orders',   'validate_orders,load_dim_customer'),
(@batch_id, 'update_aggregates',  'load_fact_orders');

Each step checks its dependencies before executing:

CREATE PROCEDURE etl.execute_step
    @batch_id INT,
    @step_name VARCHAR(100)
AS
BEGIN
    -- Check all dependencies are completed
    DECLARE @depends_on VARCHAR(500);
    SELECT @depends_on = depends_on 
    FROM etl.step_control 
    WHERE batch_id = @batch_id AND step_name = @step_name;

    IF @depends_on IS NOT NULL
    BEGIN
        DECLARE @pending_deps INT;
        
        SELECT @pending_deps = COUNT(*)
        FROM etl.step_control sc
        JOIN STRING_SPLIT(@depends_on, ',') deps 
            ON sc.step_name = TRIM(deps.value)
        WHERE sc.batch_id = @batch_id
          AND sc.step_status <> 'COMPLETED';

        IF @pending_deps > 0
        BEGIN
            RAISERROR('Step %s cannot run: %d dependencies not completed', 
                      16, 1, @step_name, @pending_deps);
            RETURN;
        END
    END

    -- Mark step as running
    UPDATE etl.step_control
    SET step_status = 'RUNNING', started_at = SYSUTCDATETIME()
    WHERE batch_id = @batch_id AND step_name = @step_name;

    -- Execute the step's procedure
    -- (dynamic SQL or a CASE-based dispatch)
END

What this solves in practice

In a 12-step pipeline, the step control pattern reduced our average pipeline runtime by 35% because independent steps (like staging orders and staging customers) could run in parallel. More importantly, when Step 6 failed, the on-call engineer could see instantly which steps had completed, which were pending, and exactly where the failure occurred — all from a single query:

SELECT step_name, step_status, 
       DATEDIFF(SECOND, started_at, COALESCE(completed_at, GETDATE())) AS duration_sec,
       rows_affected, error_message
FROM etl.step_control
WHERE batch_id = @batch_id
ORDER BY started_at;

step_name            step_status  duration_sec  rows_affected  error_message
-------------------  -----------  ------------  -------------  --------------------------
stage_orders         COMPLETED    124           1,248,331      NULL
stage_customers      COMPLETED    18            42,109         NULL
validate_orders      COMPLETED    31            NULL           NULL
validate_customers   COMPLETED    4             NULL           NULL
load_dim_customer    COMPLETED    7             1,204          NULL
load_fact_orders     FAILED       89            NULL           Error 547: FK violation...
update_aggregates    PENDING      NULL          NULL           NULL

That output tells the entire story. No log file digging. No guesswork.

Performance: Batch Size and Minimal Logging

Two performance decisions make or break ETL speed in SQL Server: batch sizing and recovery model management.

Batch sizing for large loads

Loading 200 million rows in a single INSERT INTO statement works technically, but it holds locks for the entire duration, generates a massive transaction log entry, and if it fails at row 190 million, you start over from zero.

I break large loads into batches of 500,000 to 1 million rows, tracking progress as we go:

DECLARE @batch_size INT = 500000;
DECLARE @rows_loaded INT = 0;
DECLARE @total_rows INT;

SELECT @total_rows = COUNT(*) FROM staging.orders_raw 
WHERE _load_batch_id = @batch_id;

WHILE @rows_loaded < @total_rows
BEGIN
    INSERT INTO dbo.fact_orders (/* columns */)
    SELECT TOP (@batch_size) /* columns and transforms */    FROM staging.orders_raw s
    JOIN dbo.dim_customer d ON s.customer_code = d.customer_code
    JOIN dbo.dim_product p ON s.product_code = p.product_code
    WHERE s._load_batch_id = @batch_id
      AND s.source_order_id NOT IN (
          SELECT source_order_id FROM dbo.fact_orders
          WHERE order_date >= @window_start AND order_date < @window_end
      );

    SET @rows_loaded += @@ROWCOUNT;

    -- Checkpoint progress
    UPDATE etl.step_control
    SET rows_affected = @rows_loaded
    WHERE batch_id = @batch_id AND step_name = 'load_fact_orders';

    IF @@ROWCOUNT = 0 BREAK;  -- safety exit
END

On one system, switching from a single 200-million-row insert to batches of 500,000 reduced the transaction log usage from a peak of 68 GB to a steady 2.1 GB because the log could be reused between batches (in SIMPLE recovery model) or backed up between batches (in FULL recovery model).

Minimal logging for bulk operations

If your warehouse database is in SIMPLE or BULK_LOGGED recovery model during the ETL window, certain operations can be minimally logged — meaning SQL Server writes only enough to the transaction log to support rollback, not full row-by-row redo information.

The requirements for minimal logging with INSERT INTO are:

The target table must have a clustered index (or be a heap with a TABLOCK hint)
Use TABLOCK hint on the insert
The database must be in SIMPLE or BULK_LOGGED recovery model

-- Minimally logged insert (BULK_LOGGED recovery model)
INSERT INTO dbo.fact_orders WITH (TABLOCK)
    (/* columns */)
SELECT /* columns */FROM staging.orders_raw s
/* joins and filters */;

On a 50-million-row load into a table with a clustered columnstore index, switching to minimally logged inserts reduced the load time from 22 minutes to 8 minutes and the transaction log usage from 31 GB to 4 GB. The trade-off is that you cannot do point-in-time recovery through a minimally logged operation — you need a full backup after the load completes. For most dedicated ETL windows, this is an acceptable trade-off.

What I Would Build Differently Today

After several generations of these pipelines, a few things stand out:

I would add data quality metrics from day one, not after the first production incident. The data quality log and the batch control table should be in the initial design, not bolted on when something breaks. Every pipeline I have built that started without observability eventually had it added under pressure after an outage. The cost of building it upfront is two or three hours. The cost of adding it after a production incident includes the incident itself.

I would standardize the error handling pattern across all steps from the start. In early pipelines, some steps had TRY-CATCH and others did not. Some logged errors to the batch control table and others relied on SQL Agent to capture the error. Inconsistency makes debugging harder. A template stored procedure that every step follows eliminates this.

I would build the row count deviation alerting immediately. The query I showed earlier — comparing today's row count against the 7-day average — has caught more real issues than any other single piece of monitoring. A 40% deviation threshold has been the right balance between sensitivity and noise in every system I have used it on. Set it up on day one.

Closing

The difference between an ETL pipeline that works and one that works at scale is not cleverness. It is discipline. Stage your data before transforming it. Make every step re-runnable. Track every batch. Log every rejection. Design so that when something fails — and it will — the path to recovery is obvious and fast.

These patterns are not exotic. Staging layers, batch control tables, idempotent loads, and dependency tracking are straightforward to implement in SQL Server with nothing more than T-SQL and a few metadata tables. The hard part is having the discipline to build them before you think you need them.

Every production ETL system I have seen fail at scale failed for the same reason: it was designed for the happy path. The pipelines that survive are the ones designed for the 2 AM failure that nobody is awake to babysit. Build for that scenario, and the happy path takes care of itself.

Designing SQL Server ETL Pipelines That Don't Break at Scale

The Staging Layer Is Not Optional

Idempotent Loads: Design for Re-Runnability

The Batch Control Table: Making Pipelines Observable

Why this matters in practice

Handling Dependencies Between Steps

What this solves in practice

Performance: Batch Size and Minimal Logging

Batch sizing for large loads

Minimal logging for bulk operations

What I Would Build Differently Today

Closing

Rate

Share

Categories

Share

Rate

Designing SQL Server ETL Pipelines That Don't Break at Scale

The Staging Layer Is Not Optional

Idempotent Loads: Design for Re-Runnability

The Batch Control Table: Making Pipelines Observable

Why this matters in practice

Handling Dependencies Between Steps

What this solves in practice

Performance: Batch Size and Minimal Logging

Batch sizing for large loads

Minimal logging for bulk operations

What I Would Build Differently Today

Closing

Rate

Share

Categories

Share

Rate

Related content

Building a Real-Time Analytics Pipeline with Azure Stream Analytics and SQL Server

Advantages of Kappa architecture in the Modern Data Stack

The Art of Data Ingestion - Part 1

Parquet

Chopping Off Data