Designing SQL Server ETL Pipelines That Don't Break at Scale

  • Comments posted to this topic are about the item Designing SQL Server ETL Pipelines That Don't Break at Scale

    Ankush Rastogi

    "Turning data into insights, one query at a time."

      [*] Database Enthusiast
      [*] SQL Server Specialist
      [*] Performance Tuning Expert

    LinkedIn | GitHub

    Got a SQL question? Feel free to ask!SELECT 'Have a great day!' AS Greeting;
  • I do agree having an ETL control framework to log success and failure steps is extremely important and saves a massive amount of time.  One thing that you might consider doing is breaking your Merge step up into two possibly up to four steps... Here's an example.  Note you could use temp tables or physical tables. The physical tables give your etl engineer another level of troubleshooting but it does require that you truncate those tables before each ETL job.  Also, note this approach requires you to store the Hash on the core table.

    The

    -- Step 1: Hash staging rows, apply batch filter
    SELECT customer_code,
    customer_name,
    COALESCE(segment, 'Unknown') AS segment,
    region,
    TRY_CAST(registration_date AS DATE) AS registration_date,
    CONVERT(CHAR(64), HASHBYTES('SHA2_256',
    CONCAT_WS('|', customer_code, customer_name,
    COALESCE(segment,''), ISNULL(region,''))), 2) AS HashCheck
    INTO #Step1
    FROM staging.customers_raw
    WHERE _load_batch_id = @batch_id
    AND TRY_CAST(registration_date AS DATE) IS NOT NULL;

    CREATE CLUSTERED INDEX CIX_Step1 ON #Step1(customer_code);

    -- Step 2: Classify rows, filter out unchanged
    SELECT s.customer_code,
    s.customer_name,
    s.segment,
    s.region,
    s.registration_date,
    CASE
    WHEN t.customer_code IS NULL THEN 'I'
    WHEN s.HashCheck <> t.HashCheck THEN 'U' -- requires stored hash on target
    ELSE 'F'
    END AS IDFU_FLAG
    INTO #Step2
    FROM #Step1 s
    LEFT JOIN dbo.dim_customer t ON s.customer_code = t.customer_code
    WHERE t.customer_code IS NULL -- net new
    OR s.HashCheck <> t.HashCheck; -- changed only; 'F' rows never materialize

    CREATE CLUSTERED INDEX CIX_Step2 ON #Step2(customer_code);

    -- MERGE: working set is now inserts + updates only
    MERGE dbo.dim_customer AS trgt
    USING #Step2 AS src
    ON trgt.customer_code = src.customer_code
    WHEN MATCHED AND src.IDFU_FLAG = 'U' THEN UPDATE SET ...
    WHEN NOT MATCHED AND src.IDFU_FLAG = 'I' THEN INSERT ...;

    Then after each step you can use your ETL framework to log success, failure and the number of rows processed.  Finally, this example filters out records that haven't changed prior to the merge step.  Depending on how your source system sends data, this can have a huge impact on overall performance.

  • Really appreciate this; the hash-based change detection plus pre-filter deserves its own section in the article. A few thoughts:

    Where I've seen this shine brightest is when the source sends full snapshots every load rather than deltas. If 90%+ of rows are unchanged each run, common for customer/product dimensions sourced from operational systems without CDC, filtering them out before the MERGE is a huge win for both lock duration and transaction log volume. For SCD2 dimensions, it can be the difference between a 5-minute load and a 45-minute one.

    The split also pairs nicely with the step_control pattern. Logging "Step 1: 1.2M hashed | Step 2: 12K classified I/U (1.0%) | MERGE: 12K affected" tells you instantly whether the source is behaving normally. A sudden jump to 50% changed rows is almost always an upstream issue worth catching before downstream consumers do.

    One implementation gotcha worth flagging for anyone trying this: CONCAT_WS skips NULLs entirely, so hashing (A, B, NULL, D) gives a different result from (A, B, '', D) unless you wrap consistently. I've seen this cause phantom updates after a hash expression got tweaked on one side but not the other; every row suddenly looks "changed" on the next run. Worth committing to a strict convention (always ISNULL to a fixed sentinel) and documenting it right next to the hash column definition.

    And yes, physical tables over temp for anything running unattended at 2 AM. The TRUNCATE at the start of each run costs nothing compared to having the intermediate results still sitting there when you wake up to a failure. Thanks for the thoughtful addition.

    Ankush Rastogi

    "Turning data into insights, one query at a time."

      [*] Database Enthusiast
      [*] SQL Server Specialist
      [*] Performance Tuning Expert

    LinkedIn | GitHub

    Got a SQL question? Feel free to ask!SELECT 'Have a great day!' AS Greeting;
  • Great points!!  That's a great call out on the concat with NULLS unfortunately, I've had to resort to a case statement to set a default value '@@' for example.

  • I could have used this article in 2018! I will use it today.

    Thanks!

  • I am glad the article is adding value

    Ankush Rastogi

    "Turning data into insights, one query at a time."

      [*] Database Enthusiast
      [*] SQL Server Specialist
      [*] Performance Tuning Expert

    LinkedIn | GitHub

    Got a SQL question? Feel free to ask!SELECT 'Have a great day!' AS Greeting;

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply