Using MERGE tsql for FACT loading

  • Hi All,

    I m using the below tsql MERGE statement for loading data to one of the fact table in DWH db. For initial full load there would be around 30 million records and going forward for delta load it would be less than 50 k records.

    Here i'm dealing with accumulating snapshot type of fact table where there will always insert i.e. if its a new record then insert into the table, when existing record but with different fact values then create a new record with different load date

    To handle the above requirement I have created a temp table in the same database as that of target fact table with same structure and then used the below MERGE query.

    INSERT INTO dbo.[Fct_Prod_Mthly_Actuals] ([Well_Skey],[DateKey],[Cost_Center],[Wghtd_WI_Key],[Item_Cd],[Prd_Lease_Vent_Cd],[Rev_Lease_Vent_Cd],[PHA_Flg],Days_Prodcd,Days_Injctd,[Pressure_Base],[Metric_Desc],[Metric_Vol],[Load_Date],[Update_Date])

    SELECT [Well_Skey],[DateKey],[Cost_Center],[Wghtd_WI_Key],[Item_Cd],[Prd_Lease_Vent_Cd],[Rev_Lease_Vent_Cd],[PHA_Flg],Days_Prodcd,Days_Injctd,[Pressure_Base],[Metric_Desc],[Metric_Vol],CAST(CONVERT(VARCHAR(8),GETDATE(), 112)AS DATE) AS [LOAD_DATE],[Update_Date]

    FROM

    (

    MERGE dbo.[Fct_Prod_Mthly_Actuals] AS DST

    USING dbo.[Fct_Prod_Mthly_Actuals_Temp] AS SRC

    ON (

    SRC.[Well_Skey] = DST.[Well_Skey]

    AND

    SRC.[DateKey] = DST.[DateKey]

    AND

    SRC.[Wghtd_WI_Key] = DST.[Wghtd_WI_Key]

    AND

    SRC.[Item_Cd] = DST.[Item_Cd]

    AND

    SRC.[Metric_Desc] = DST.[Metric_Desc]

    )

    WHEN NOT MATCHED THEN

    INSERT ([Well_Skey],[DateKey],[Cost_Center],[Wghtd_WI_Key],[Item_Cd],[Prd_Lease_Vent_Cd],[Rev_Lease_Vent_Cd],[PHA_Flg],Days_Prodcd,Days_Injctd,[Pressure_Base],[Metric_Desc],[Metric_Vol],[Load_Date],[Update_Date])

    VALUES (SRC.[Well_Skey],SRC.[DateKey],SRC.[Cost_Center],SRC.[Wghtd_WI_Key],SRC.[Item_Cd],SRC.[Prd_Lease_Vent_Cd],SRC.[Rev_Lease_Vent_Cd],SRC.[PHA_Flg],SRC.Days_Prodcd,SRC.Days_Injctd,SRC.[Pressure_Base],SRC.[Metric_Desc],SRC.[Metric_Vol],SRC.[Load_Date], SRC.[Update_Date])

    WHEN MATCHED

    AND (ISNULL(DST.[Metric_Vol],'') <> ISNULL(SRC.[Metric_Vol],'')

    OR ISNULL(DST.Pressure_Base,'') <> ISNULL(SRC.Pressure_Base,'')

    OR ISNULL(DST.Days_Injctd,'') <> ISNULL(SRC.Days_Injctd,'')

    OR ISNULL(DST.Days_Prodcd,'') <> ISNULL(SRC.Days_Prodcd,'')

    OR ISNULL(DST.Rev_Lease_Vent_Cd,'') <> ISNULL(SRC.Rev_Lease_Vent_Cd,'')

    OR ISNULL(DST.Prd_Lease_Vent_Cd,'') <> ISNULL(SRC.Prd_Lease_Vent_Cd,'')

    OR ISNULL(DST.[PHA_Flg],'') <> ISNULL(SRC.[PHA_Flg],'')

    )

    THEN UPDATE

    SET

    DST.[LOAD_DATE] = DST.[LOAD_DATE]

    OUTPUT SRC.[Well_Skey],SRC.[DateKey],SRC.[Cost_Center],SRC.[Wghtd_WI_Key],SRC.[Item_Cd],SRC.[Prd_Lease_Vent_Cd],SRC.[Rev_Lease_Vent_Cd],SRC.[PHA_Flg],SRC.Days_Prodcd,SRC.Days_Injctd,SRC.[Pressure_Base],SRC.[Metric_Desc],SRC.[Metric_Vol],SRC.[Update_Date], $Action AS MergeAction

    ) AS MRG

    WHERE MRG.MergeAction = 'UPDATE'

    ;

    So My query is:

    1) Can the above query handle such large volume of data.

    2) Do i need to create any indexes in either target fact/temp table. Or consider for partitioning the fact table.

    Please suggest/advice..

    Thanks in advance

    Sam

  • I've never seen the two inserts done like that, it looks very clever.

    Am I right in thinking that you are not actually updating anything in the fact table? DST.[LOAD_DATE] = DST.[LOAD_DATE] doesn't change anything.

    This mean you are doing two inserts.

    When not matched then insert.

    When matched, but different, then insert.

    Is the LOAD_DATE column the only difference between the two inserts?

    As for answering your questions. An index on the join columns with an include of the other columns might help. I think the isnull will prevent the use of an index on the comparison columns.

    If you are always loading a finite date range, and its not feasible to index everything, you might be able to index the DateKey column and filter the target using a common table expression.

    30 million is pretty small, so I doubt partitioning is worthwhile unless you expect the table to get more than thirty times the size.

  • If you are always loading a finite date range, and its not feasible to index everything, you might be able to index the DateKey column and filter the target using a common table expression.

    Ed,

    could you please explain more about "filter target using CTE". Also should i go about creating a clustered index on update date column (which is coming from source table) both on Target fact and temp fact table respectively.

  • sam 55243 (4/29/2013)


    If you are always loading a finite date range, and its not feasible to index everything, you might be able to index the DateKey column and filter the target using a common table expression.

    Ed,

    could you please explain more about "filter target using CTE". Also should i go about creating a clustered index on update date column (which is coming from source table) both on Target fact and temp fact table respectively.

    I don't know if it will work with the extra select from the output table, but something like this can work to limit the scope of the merge target. This is also useful if you want to delete when not matched by source and you only have a subset of source data.

    I don't know about a clustered index on update date. I usually put a clustered index on something unique and ever increasing.

    DECLARE @MinDateKey INT, @MaxDateKey INT

    WITH cte_Fct_Prod_Mthly_Actual

    AS

    (SELECT *

    FROMdbo.Fct_Prod_Mthly_Actuals

    WHEREDateKey >= @MinDateKey

    ANDDateKey <= @MaxDateKey

    )

    ;MERGE cte_Fct_Prod_Mthly_Actual AS DST

    USING dbo.[Fct_Prod_Mthly_Actuals_Temp] AS SRC

    ON (

    SRC.[Well_Skey] = DST.[Well_Skey]

    AND

    SRC.[DateKey] = DST.[DateKey]

    AND

    SRC.[Wghtd_WI_Key] = DST.[Wghtd_WI_Key]

    AND

    SRC.[Item_Cd] = DST.[Item_Cd]

    AND

    SRC.[Metric_Desc] = DST.[Metric_Desc]

    )

    WHEN NOT MATCHED THEN

    INSERT ([Well_Skey],[DateKey],[Cost_Center],[Wghtd_WI_Key],[Item_Cd],[Prd_Lease_Vent_Cd],[Rev_Lease_Vent_Cd],[PHA_Flg],Days_Prodcd,Days_Injctd,[Pressure_Base],[Metric_Desc],[Metric_Vol],[Load_Date],[Update_Date])

    VALUES (SRC.[Well_Skey],SRC.[DateKey],SRC.[Cost_Center],SRC.[Wghtd_WI_Key],SRC.[Item_Cd],SRC.[Prd_Lease_Vent_Cd],SRC.[Rev_Lease_Vent_Cd],SRC.[PHA_Flg],SRC.Days_Prodcd,SRC.Days_Injctd,SRC.[Pressure_Base],SRC.[Metric_Desc],SRC.[Metric_Vol],SRC.[Load_Date], SRC.[Update_Date])

    WHEN MATCHED

    AND (ISNULL(DST.[Metric_Vol],'') <> ISNULL(SRC.[Metric_Vol],'')

    OR ISNULL(DST.Pressure_Base,'') <> ISNULL(SRC.Pressure_Base,'')

    OR ISNULL(DST.Days_Injctd,'') <> ISNULL(SRC.Days_Injctd,'')

    OR ISNULL(DST.Days_Prodcd,'') <> ISNULL(SRC.Days_Prodcd,'')

    OR ISNULL(DST.Rev_Lease_Vent_Cd,'') <> ISNULL(SRC.Rev_Lease_Vent_Cd,'')

    OR ISNULL(DST.Prd_Lease_Vent_Cd,'') <> ISNULL(SRC.Prd_Lease_Vent_Cd,'')

    OR ISNULL(DST.[PHA_Flg],'') <> ISNULL(SRC.[PHA_Flg],'')

    )

    THEN UPDATE

    SET

    DST.[LOAD_DATE] = DST.[LOAD_DATE]

    OUTPUT SRC.[Well_Skey],SRC.[DateKey],SRC.[Cost_Center],SRC.[Wghtd_WI_Key],SRC.[Item_Cd],SRC.[Prd_Lease_Vent_Cd],SRC.[Rev_Lease_Vent_Cd],SRC.[PHA_Flg],SRC.Days_Prodcd,SRC.Days_Injctd,SRC.[Pressure_Base],SRC.[Metric_Desc],SRC.[Metric_Vol],SRC.[Update_Date], $Action AS MergeAction

    ) AS MRG

    --WHERE MRG.MergeAction = 'UPDATE'

    ;

  • You might hit an issue with Merge on a large record set.

    This is a handy article http://technet.microsoft.com/en-us/library/cc879317.aspx and gives some advice on indexes, and stats to look at when tuning the merge.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Thanks jason ill look into that.

  • We use a similar merge statement to insert into the target table, but with a 3 way insert for inserts, updates and deletes.

    We have clustered indexes on the match columns ie Well_Skey,DateKey,DateKey,Wghtd_WI_Key,Item_Cd,Metric_Desc

    It works but performance is not great. Filtering the source using a date_created column will help if you can do it, like

    MERGE dbo.[Fct_Prod_Mthly_Actuals] AS DST

    USING (SELECT * FROM dbo.[Fct_Prod_Mthly_Actuals_Temp] WHERE date_created> @load_from) AS SRC

    As far as loading the initial 30 million rows, you will be better off using a simple insert statement.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply