The first ETL pipeline I ever built in production was a single stored procedure. It pulled data from three source tables, applied some transformations, and loaded it into a reporting table. It ran in 40 seconds. I was proud of it.
Eighteen months later, that same pipeline had grown to 14 stored procedures chained together by a SQL Agent job with 22 steps. Source data volume had increased tenfold. The job ran for 4 hours on a good night and 7 hours on a bad one. When it failed — and it failed two or three times a month — finding which step broke and what data was left in an inconsistent state took longer than the actual fix. Nobody wanted to touch it. I was no longer proud of it.
That experience taught me that the difference between ETL pipelines that work and ETL pipelines that survive at scale is not better T-SQL. It is how you structure the pipeline itself — how you handle failures, how you manage dependencies, how you make the pipeline observable, and how you design it so that adding a new source or transformation does not require rearchitecting everything.
This article covers the patterns I have settled on after building and maintaining ETL systems that process hundreds of millions of rows per day in SQL Server. These are not theoretical best practices. They are decisions that directly reduced pipeline failures, cut debugging time, and made the systems maintainable by people other than the person who built them.
The Staging Layer Is Not Optional
The most consequential design decision in any SQL Server ETL pipeline is whether to stage incoming data before transforming it. The answer is always yes. I have never regretted having a staging layer. I have regretted not having one more times than I want to admit.
Here is what I mean by staging. Raw data from every source lands in a dedicated staging schema, in tables that mirror the source structure as closely as possible. No transformations, no joins, no business logic. Just the raw data, plus metadata columns that track when and how it arrived.
CREATE TABLE staging.orders_raw (
source_order_id VARCHAR(50) NOT NULL,
customer_code VARCHAR(20),
order_date VARCHAR(30), -- intentionally VARCHAR
total_amount VARCHAR(20), -- intentionally VARCHAR
status_code VARCHAR(10),
_load_batch_id INT NOT NULL,
_loaded_at DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(),
_source_file NVARCHAR(500) NULL
);
Notice that order_date and total_amount are VARCHAR, not DATE and DECIMAL. That is deliberate. When you load raw data directly into typed columns, a single malformed date or a non-numeric amount causes the entire batch to fail. By staging as VARCHAR first, you land everything successfully and then handle data quality issues in a controlled transformation step where you can log, count, and route bad records instead of losing the whole load.
Here is what the validation step looks like after staging:
-- Identify and log records that will fail type conversion
INSERT INTO etl.data_quality_log (
batch_id, table_name, column_name,
source_key, raw_value, error_type, logged_at
)
SELECT
_load_batch_id,
'orders_raw',
'order_date',
source_order_id,
order_date,
'INVALID_DATE',
SYSUTCDATETIME()
FROM staging.orders_raw
WHERE TRY_CAST(order_date AS DATE) IS NULL
AND order_date IS NOT NULL;
-- Report the count before proceeding
DECLARE @bad_dates INT = @@ROWCOUNT;
IF @bad_dates > 0
PRINT CONCAT('Warning: ', @bad_dates,
' records with invalid dates in batch ',
@batch_id, '. Routed to quality log.');
In one system I maintained, this pattern caught an upstream schema change that silently switched a date format from YYYY-MM-DD to MM/DD/YYYY. Without the VARCHAR staging layer, the entire nightly load would have failed. Instead, the first batch landed successfully, the validation step flagged 100% of records as having unparseable dates, and the on-call engineer saw the alert before business hours. We adjusted the parsing logic, reprocessed from staging, and nothing downstream was affected.
The cost of staging is minimal. You need additional storage for temporary data and an extra load step. The staging tables are truncated after each successful pipeline run. On a system processing 200 million rows per night, the staging layer added roughly 40 GB of temporary storage and 8 minutes to the total pipeline time. That is a trivial price for the ability to diagnose, replay, and recover from any data issue without going back to the source.
Idempotent Loads: Design for Re-Runnability
Every transformation step in the pipeline should be safely re-runnable. If step 5 of 10 fails, I want to fix the issue and restart from step 5 without worrying about duplicate data or missing records.
The pattern I use is MERGE for dimension loads and DELETE-INSERT for fact loads. Here is a concrete example of each.
Here is a dimension load using MERGE:
MERGE dbo.dim_customer AS target
USING (
SELECT
customer_code,
customer_name,
COALESCE(segment, 'Unknown') AS segment,
region,
TRY_CAST(registration_date AS DATE) AS registration_date
FROM staging.customers_raw
WHERE _load_batch_id = @batch_id
AND TRY_CAST(registration_date AS DATE) IS NOT NULL
) AS source
ON target.customer_code = source.customer_code
WHEN MATCHED AND (
target.customer_name <> source.customer_name
OR target.segment <> source.segment
OR target.region <> source.region
) THEN UPDATE SET
target.customer_name = source.customer_name,
target.segment = source.segment,
target.region = source.region,
target.modified_at = SYSUTCDATETIME()
WHEN NOT MATCHED BY TARGET THEN INSERT (
customer_code, customer_name, segment,
region, registration_date, created_at, modified_at
) VALUES (
source.customer_code, source.customer_name, source.segment,
source.region, source.registration_date,
SYSUTCDATETIME(), SYSUTCDATETIME()
);
DECLARE @merged_inserts INT = 0, @merged_updates INT = 0;
SET @merged_inserts = @@ROWCOUNT; -- simplified; see note below
A note on MERGE: there are well-documented edge cases with MERGE in SQL Server, including occasional deadlocks under high concurrency and the requirement to terminate the statement with a semicolon. For dimension tables that load during a dedicated ETL window with no concurrent writes, these issues are manageable. If your dimensions receive concurrent updates from multiple sources, consider using separate INSERT and UPDATE statements with explicit locking hints instead.
Here is a Fact load using DELETE-INSERT by batch window:
-- Delete any previously loaded data for this date range
-- This makes the step re-runnable
DELETE FROM dbo.fact_orders
WHERE order_date >= @window_start
AND order_date < @window_end;
-- Insert the fresh data
INSERT INTO dbo.fact_orders (
order_date, customer_key, product_key,
quantity, amount, discount_amount,
load_batch_id, loaded_at
)
SELECT
TRY_CAST(s.order_date AS DATE),
d.customer_key,
p.product_key,
TRY_CAST(s.quantity AS INT),
TRY_CAST(s.total_amount AS DECIMAL(18,2)),
TRY_CAST(s.discount AS DECIMAL(18,2)),
@batch_id,
SYSUTCDATETIME()
FROM staging.orders_raw s
JOIN dbo.dim_customer d ON s.customer_code = d.customer_code
JOIN dbo.dim_product p ON s.product_code = p.product_code
WHERE s._load_batch_id = @batch_id
AND TRY_CAST(s.order_date AS DATE) IS NOT NULL
AND TRY_CAST(s.total_amount AS DECIMAL(18,2)) IS NOT NULL;
The DELETE-INSERT pattern is simpler than MERGE for fact tables, and it guarantees idempotency — run it five times and you get the same result. The key is defining the reload window clearly. I always use the natural business key (order_date in this case) rather than the batch ID for the delete, because if the pipeline is rerun with a different batch ID, the old data still gets cleaned up.
Actual impact of idempotent design: In the system I referenced earlier, pipeline failures dropped from 2-3 per month requiring manual intervention to the same frequency of failures but with an average recovery time of 12 minutes instead of 2 hours. The failures still happened — bad source data, network timeouts, occasional deadlocks — but recovering was just "fix the root cause, rerun from the failed step." No data cleanup, no inconsistency investigations.
The Batch Control Table: Making Pipelines Observable
The single most underrated component of a production ETL pipeline is a batch control table. It sounds mundane. It changes everything about how you operate the system.
CREATE TABLE etl.batch_control (
batch_id INT IDENTITY(1,1) PRIMARY KEY,
pipeline_name VARCHAR(100) NOT NULL,
batch_status VARCHAR(20) NOT NULL
DEFAULT 'RUNNING',
-- RUNNING, COMPLETED, FAILED, REPROCESSING
started_at DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(),
completed_at DATETIME2 NULL,
rows_staged INT NULL,
rows_rejected INT NULL,
rows_loaded INT NULL,
error_message NVARCHAR(4000) NULL,
window_start DATE NULL,
window_end DATE NULL,
triggered_by VARCHAR(50) NOT NULL DEFAULT SYSTEM_USER
);
Every pipeline run starts by inserting a row into this table and ends by updating it with the final status and row counts:
-- At pipeline start
INSERT INTO etl.batch_control (pipeline_name, window_start, window_end)
VALUES ('daily_orders', @window_start, @window_end);
DECLARE @batch_id INT = SCOPE_IDENTITY();
-- ... pipeline steps execute here ...
-- At pipeline end (success path)
UPDATE etl.batch_control
SET batch_status = 'COMPLETED',
completed_at = SYSUTCDATETIME(),
rows_staged = @staged_count,
rows_rejected = @rejected_count,
rows_loaded = @loaded_count
WHERE batch_id = @batch_id;
And the failure path, wrapped in TRY-CATCH:
BEGIN TRY
-- ... pipeline steps ...
END TRY
BEGIN CATCH
UPDATE etl.batch_control
SET batch_status = 'FAILED',
completed_at = SYSUTCDATETIME(),
error_message = CONCAT(
'Error ', ERROR_NUMBER(), ': ', ERROR_MESSAGE(),
' at line ', ERROR_LINE(),
' in ', ERROR_PROCEDURE()
)
WHERE batch_id = @batch_id;
THROW; -- re-raise so SQL Agent marks the step as failed
END CATCH
Why this matters in practice
Six months after implementing the batch control table, I could answer questions that used to require digging through SQL Agent job history and transaction logs:
-- How often does each pipeline fail?
SELECT
pipeline_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN batch_status = 'FAILED' THEN 1 ELSE 0 END) AS failures,
CAST(SUM(CASE WHEN batch_status = 'FAILED' THEN 1 ELSE 0 END) * 100.0
/ COUNT(*) AS DECIMAL(5,2)) AS failure_pct,
AVG(DATEDIFF(SECOND, started_at, completed_at)) AS avg_duration_sec
FROM etl.batch_control
WHERE started_at >= DATEADD(MONTH, -3, GETDATE())
GROUP BY pipeline_name
ORDER BY failure_pct DESC;
Sample output from one of our systems:
pipeline_name total_runs failures failure_pct avg_duration_sec ----------------- ---------- -------- ----------- ---------------- daily_orders 90 4 4.44 1,847 customer_refresh 90 1 1.11 423 product_sync 90 0 0.00 187 inventory_snap 90 7 7.78 2,291
That inventory_snap pipeline at 7.78% failure rate immediately became my next optimization target. Without the batch control table, I would not have known it was failing that often because the on-call engineer was silently rerunning it each time.
Row count trending is equally valuable. When the daily_orders pipeline usually loads 1.2 million rows and suddenly loads 200,000, something is wrong upstream — even if the pipeline itself completed successfully. A simple query against the batch control table catches this:
-- Detect unusual row counts (more than 40% deviation from 7-day average)
SELECT
b.batch_id,
b.pipeline_name,
b.rows_loaded,
b.window_start,
avg_recent.avg_rows,
CAST((b.rows_loaded - avg_recent.avg_rows) * 100.0
/ NULLIF(avg_recent.avg_rows, 0) AS DECIMAL(5,1)) AS pct_deviation
FROM etl.batch_control b
CROSS APPLY (
SELECT AVG(rows_loaded * 1.0) AS avg_rows
FROM etl.batch_control b2
WHERE b2.pipeline_name = b.pipeline_name
AND b2.batch_status = 'COMPLETED'
AND b2.started_at >= DATEADD(DAY, -7, b.started_at)
AND b2.batch_id <> b.batch_id
) avg_recent
WHERE b.batch_status = 'COMPLETED'
AND b.started_at >= DATEADD(DAY, -1, GETDATE())
AND ABS(b.rows_loaded - avg_recent.avg_rows) > avg_recent.avg_rows * 0.4;
This query has caught three production issues in our systems that would have otherwise gone undetected until a business user noticed missing data in a report — usually days later.
Handling Dependencies Between Steps
When your pipeline grows beyond five or six steps, the dependencies between them start to matter. Step 4 cannot run until Steps 2 and 3 are both complete. Step 7 depends on Step 4 but not Step 5 or 6. SQL Agent jobs model this poorly — they support linear sequences and simple branching, but expressing parallel dependencies is awkward.
The pattern I use is a step control table that tracks individual step completion within a batch:
CREATE TABLE etl.step_control (
batch_id INT NOT NULL,
step_name VARCHAR(100) NOT NULL,
step_status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
depends_on VARCHAR(500) NULL, -- comma-separated step names
started_at DATETIME2 NULL,
completed_at DATETIME2 NULL,
rows_affected INT NULL,
error_message NVARCHAR(4000) NULL,
CONSTRAINT PK_step_control PRIMARY KEY (batch_id, step_name)
);
At the start of each batch, I populate the step control table with all steps and their dependencies:
INSERT INTO etl.step_control (batch_id, step_name, depends_on) VALUES (@batch_id, 'stage_orders', NULL), (@batch_id, 'stage_customers', NULL), (@batch_id, 'validate_orders', 'stage_orders'), (@batch_id, 'validate_customers', 'stage_customers'), (@batch_id, 'load_dim_customer', 'validate_customers'), (@batch_id, 'load_fact_orders', 'validate_orders,load_dim_customer'), (@batch_id, 'update_aggregates', 'load_fact_orders');
Each step checks its dependencies before executing:
CREATE PROCEDURE etl.execute_step
@batch_id INT,
@step_name VARCHAR(100)
AS
BEGIN
-- Check all dependencies are completed
DECLARE @depends_on VARCHAR(500);
SELECT @depends_on = depends_on
FROM etl.step_control
WHERE batch_id = @batch_id AND step_name = @step_name;
IF @depends_on IS NOT NULL
BEGIN
DECLARE @pending_deps INT;
SELECT @pending_deps = COUNT(*)
FROM etl.step_control sc
JOIN STRING_SPLIT(@depends_on, ',') deps
ON sc.step_name = TRIM(deps.value)
WHERE sc.batch_id = @batch_id
AND sc.step_status <> 'COMPLETED';
IF @pending_deps > 0
BEGIN
RAISERROR('Step %s cannot run: %d dependencies not completed',
16, 1, @step_name, @pending_deps);
RETURN;
END
END
-- Mark step as running
UPDATE etl.step_control
SET step_status = 'RUNNING', started_at = SYSUTCDATETIME()
WHERE batch_id = @batch_id AND step_name = @step_name;
-- Execute the step's procedure
-- (dynamic SQL or a CASE-based dispatch)
END
What this solves in practice
In a 12-step pipeline, the step control pattern reduced our average pipeline runtime by 35% because independent steps (like staging orders and staging customers) could run in parallel. More importantly, when Step 6 failed, the on-call engineer could see instantly which steps had completed, which were pending, and exactly where the failure occurred — all from a single query:
SELECT step_name, step_status,
DATEDIFF(SECOND, started_at, COALESCE(completed_at, GETDATE())) AS duration_sec,
rows_affected, error_message
FROM etl.step_control
WHERE batch_id = @batch_id
ORDER BY started_at;
step_name step_status duration_sec rows_affected error_message ------------------- ----------- ------------ ------------- -------------------------- stage_orders COMPLETED 124 1,248,331 NULL stage_customers COMPLETED 18 42,109 NULL validate_orders COMPLETED 31 NULL NULL validate_customers COMPLETED 4 NULL NULL load_dim_customer COMPLETED 7 1,204 NULL load_fact_orders FAILED 89 NULL Error 547: FK violation... update_aggregates PENDING NULL NULL NULL
That output tells the entire story. No log file digging. No guesswork.
Performance: Batch Size and Minimal Logging
Two performance decisions make or break ETL speed in SQL Server: batch sizing and recovery model management.
Batch sizing for large loads
Loading 200 million rows in a single INSERT INTO statement works technically, but it holds locks for the entire duration, generates a massive transaction log entry, and if it fails at row 190 million, you start over from zero.
I break large loads into batches of 500,000 to 1 million rows, tracking progress as we go:
DECLARE @batch_size INT = 500000;
DECLARE @rows_loaded INT = 0;
DECLARE @total_rows INT;
SELECT @total_rows = COUNT(*) FROM staging.orders_raw
WHERE _load_batch_id = @batch_id;
WHILE @rows_loaded < @total_rows
BEGIN
INSERT INTO dbo.fact_orders (/* columns */)
SELECT TOP (@batch_size) /* columns and transforms */ FROM staging.orders_raw s
JOIN dbo.dim_customer d ON s.customer_code = d.customer_code
JOIN dbo.dim_product p ON s.product_code = p.product_code
WHERE s._load_batch_id = @batch_id
AND s.source_order_id NOT IN (
SELECT source_order_id FROM dbo.fact_orders
WHERE order_date >= @window_start AND order_date < @window_end
);
SET @rows_loaded += @@ROWCOUNT;
-- Checkpoint progress
UPDATE etl.step_control
SET rows_affected = @rows_loaded
WHERE batch_id = @batch_id AND step_name = 'load_fact_orders';
IF @@ROWCOUNT = 0 BREAK; -- safety exit
END
On one system, switching from a single 200-million-row insert to batches of 500,000 reduced the transaction log usage from a peak of 68 GB to a steady 2.1 GB because the log could be reused between batches (in SIMPLE recovery model) or backed up between batches (in FULL recovery model).
Minimal logging for bulk operations
If your warehouse database is in SIMPLE or BULK_LOGGED recovery model during the ETL window, certain operations can be minimally logged — meaning SQL Server writes only enough to the transaction log to support rollback, not full row-by-row redo information.
The requirements for minimal logging with INSERT INTO are:
- The target table must have a clustered index (or be a heap with a TABLOCK hint)
- Use TABLOCK hint on the insert
- The database must be in SIMPLE or BULK_LOGGED recovery model
-- Minimally logged insert (BULK_LOGGED recovery model)
INSERT INTO dbo.fact_orders WITH (TABLOCK)
(/* columns */)
SELECT /* columns */FROM staging.orders_raw s
/* joins and filters */;
On a 50-million-row load into a table with a clustered columnstore index, switching to minimally logged inserts reduced the load time from 22 minutes to 8 minutes and the transaction log usage from 31 GB to 4 GB. The trade-off is that you cannot do point-in-time recovery through a minimally logged operation — you need a full backup after the load completes. For most dedicated ETL windows, this is an acceptable trade-off.
What I Would Build Differently Today
After several generations of these pipelines, a few things stand out:
I would add data quality metrics from day one, not after the first production incident. The data quality log and the batch control table should be in the initial design, not bolted on when something breaks. Every pipeline I have built that started without observability eventually had it added under pressure after an outage. The cost of building it upfront is two or three hours. The cost of adding it after a production incident includes the incident itself.
I would standardize the error handling pattern across all steps from the start. In early pipelines, some steps had TRY-CATCH and others did not. Some logged errors to the batch control table and others relied on SQL Agent to capture the error. Inconsistency makes debugging harder. A template stored procedure that every step follows eliminates this.
I would build the row count deviation alerting immediately. The query I showed earlier — comparing today's row count against the 7-day average — has caught more real issues than any other single piece of monitoring. A 40% deviation threshold has been the right balance between sensitivity and noise in every system I have used it on. Set it up on day one.
Closing
The difference between an ETL pipeline that works and one that works at scale is not cleverness. It is discipline. Stage your data before transforming it. Make every step re-runnable. Track every batch. Log every rejection. Design so that when something fails — and it will — the path to recovery is obvious and fast.
These patterns are not exotic. Staging layers, batch control tables, idempotent loads, and dependency tracking are straightforward to implement in SQL Server with nothing more than T-SQL and a few metadata tables. The hard part is having the discipline to build them before you think you need them.
Every production ETL system I have seen fail at scale failed for the same reason: it was designed for the happy path. The pipelines that survive are the ones designed for the 2 AM failure that nobody is awake to babysit. Build for that scenario, and the happy path takes care of itself.