Designing SQL Server ETL Pipelines That Don't Break at Scale

Question

Designing SQL Server ETL Pipelines That Don't Break at Scale

SQL Expert

SSC-Addicted

Points: 481
More actions
May 15, 2026 at 12:00 am

#4782610
Comments posted to this topic are about the item Designing SQL Server ETL Pipelines That Don't Break at Scale
Ankush Rastogi
"Turning data into insights, one query at a time."
LinkedIn | GitHub
Got a SQL question? Feel free to ask!SELECT 'Have a great day!' AS Greeting;

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

timothyjultican Newbie Points: 7 More actions · Answer 1

I do agree having an ETL control framework to log success and failure steps is extremely important and saves a massive amount of time. One thing that you might consider doing is breaking your Merge step up into two possibly up to four steps... Here's an example. Note you could use temp tables or physical tables. The physical tables give your etl engineer another level of troubleshooting but it does require that you truncate those tables before each ETL job. Also, note this approach requires you to store the Hash on the core table.

The

-- Step 1: Hash staging rows, apply batch filter
SELECT  customer_code,
        customer_name,
        COALESCE(segment, 'Unknown') AS segment,
        region,
        TRY_CAST(registration_date AS DATE) AS registration_date,
        CONVERT(CHAR(64), HASHBYTES('SHA2_256',
            CONCAT_WS('|', customer_code, customer_name,
                           COALESCE(segment,''), ISNULL(region,''))), 2) AS HashCheck
INTO #Step1
FROM staging.customers_raw
WHERE _load_batch_id = @batch_id
  AND TRY_CAST(registration_date AS DATE) IS NOT NULL;

CREATE CLUSTERED INDEX CIX_Step1 ON #Step1(customer_code);

-- Step 2: Classify rows, filter out unchanged
SELECT  s.customer_code,
        s.customer_name,
        s.segment,
        s.region,
        s.registration_date,
        CASE 
            WHEN t.customer_code IS NULL THEN 'I'
            WHEN s.HashCheck <> t.HashCheck THEN 'U'  -- requires stored hash on target
            ELSE 'F'
        END AS IDFU_FLAG
INTO #Step2
FROM #Step1 s
LEFT JOIN dbo.dim_customer t ON s.customer_code = t.customer_code
WHERE t.customer_code IS NULL          -- net new
   OR s.HashCheck <> t.HashCheck;      -- changed only; 'F' rows never materialize

CREATE CLUSTERED INDEX CIX_Step2 ON #Step2(customer_code);

-- MERGE: working set is now inserts + updates only
MERGE dbo.dim_customer AS trgt
USING #Step2 AS src
ON trgt.customer_code = src.customer_code
WHEN MATCHED AND src.IDFU_FLAG = 'U' THEN UPDATE SET ...
WHEN NOT MATCHED AND src.IDFU_FLAG = 'I' THEN INSERT ...;

Then after each step you can use your ETL framework to log success, failure and the number of rows processed. Finally, this example filters out records that haven't changed prior to the merge step. Depending on how your source system sends data, this can have a huge impact on overall performance.

SQL Expert SSC-Addicted Points: 481 More actions · Answer 2

Really appreciate this; the hash-based change detection plus pre-filter deserves its own section in the article. A few thoughts:

Where I've seen this shine brightest is when the source sends full snapshots every load rather than deltas. If 90%+ of rows are unchanged each run, common for customer/product dimensions sourced from operational systems without CDC, filtering them out before the MERGE is a huge win for both lock duration and transaction log volume. For SCD2 dimensions, it can be the difference between a 5-minute load and a 45-minute one.

The split also pairs nicely with the step_control pattern. Logging "Step 1: 1.2M hashed | Step 2: 12K classified I/U (1.0%) | MERGE: 12K affected" tells you instantly whether the source is behaving normally. A sudden jump to 50% changed rows is almost always an upstream issue worth catching before downstream consumers do.

One implementation gotcha worth flagging for anyone trying this: CONCAT_WS skips NULLs entirely, so hashing (A, B, NULL, D) gives a different result from (A, B, '', D) unless you wrap consistently. I've seen this cause phantom updates after a hash expression got tweaked on one side but not the other; every row suddenly looks "changed" on the next run. Worth committing to a strict convention (always ISNULL to a fixed sentinel) and documenting it right next to the hash column definition.

And yes, physical tables over temp for anything running unattended at 2 AM. The TRUNCATE at the start of each run costs nothing compared to having the intermediate results still sitting there when you wake up to a failure. Thanks for the thoughtful addition.

Ankush Rastogi

"Turning data into insights, one query at a time."

Database Enthusiast

SQL Server Specialist

Performance Tuning Expert

LinkedIn | GitHub

Got a SQL question? Feel free to ask!SELECT 'Have a great day!' AS Greeting;

timothyjultican Newbie Points: 7 More actions · Answer 3

Great points!! That's a great call out on the concat with NULLS unfortunately, I've had to resort to a case statement to set a default value '@@' for example.

Tom LeBlanc SSC Rookie Points: 44 More actions · Answer 4

I could have used this article in 2018! I will use it today.

Thanks!

SQL Expert SSC-Addicted Points: 481 More actions · Answer 5

I am glad the article is adding value