Upsert Dimension Table

  • The surrograte key is an identity field, how can it have zero in it?

    What I usually do is loading all the data in a staging table.  Then using left join to insert the new records into the dimension table. The dimension table does not need to have the surrogate key.

    INSERT INTO Dim(Natural_key, attribute1, attribute2, load_time)

    SELECT s.natural_key,  s.attribute1, s.attribute2, GETDATE()

    FROM staging_table s

    LEFT JOIN dw.dbo.dimension dim ON s.natural_key = dim.natural_key WHERE dim.natural_key is NULL

    BTW, your step 2 insert statement, did you forget to insert the natural_key?

    If the records are already in the dimension table, then update the record with the most recent information.

    Update d set d.attitute1=s.attitute1, d.attitute2 = s.attitute2, d.load_time=GETDATE() FROM dw.dbo.dimension d inner join staging_table s where (d.natural_key = s.natural_key) and (d.attitute1 <> s.attitute1 or d.attitute2 <> s.attitute2)

    I don't know what kind of benefit to do SCD3.  I think the effective date and expiration date is good enough in dimension table.  In SCD3 you are changing the dimension table structure.  What if the attribute changes 10 times, you would have a big table.

     

     

  • In SSIS you could use the <Slowly Changing Dimension> component which basically does the UPSERT.

  • Is there a reason you chose to use a list of "OR" vs CHECKSUM in your update statement? I realize CHECKSUM is not the same as CRC but it works just fine if comparing a single row.

    Example:

    UPDATE dim

    SET dim.attribute1 = tmp.attribute1,

    dim.attribute1 = tmp.attribute2,

    dim.load_time = tmp.load_time

    FROM dw.dbo.dimension dim

    INNER JOIN #dim tmp

    ON tmp.surrogate_key = dim.surrogate_key

    WHERE tmp.surrogate_key <> 0

    AND CHECKSUM(dim.attribute1,dim.attribute2) <> CHECKSUM(tmp.attribute1,tmp.attribute2)

    This is cleaner to read for SCD Type 1 changes and, at least in our environment, seems to be a bit quicker.

  • I tried using the Slowly Changing Dimension transormation. I basically need two paths - One to insert the new record ('New Output Path' to OLE DB Destination) and one to update changed attributes in an existing record based on a specified business key ('Changed Attribute Updates Output' to OLE DB Destination).

    The insertion works fine but when i modify the same record in source data (retaining the same primary key) the data gets routed in the Changed Attribute path but fails with a primary key violation (it has attempted to insert into the destination table). Am i missing a setting or something that forces update rather than insert?

  • cliffb,

    I could use checksum, I was just in a hurry so I wrote it in using 'OR' especially there were only two values to check.  However I would not use surrograte key to compare when I update the dimension table, I would use the natural key.

  • It all depends on the SCD type. If you chose TYPE I (overwrite) then in SSIS SCD wizard you chose <changing attributes> which basically means update existing record. However if you are dealing with TYPE II then you chose <Historical Attribute> which means add a new record. Here you must have a way to distinguish the current record from the old records. Usually we design the dimension table with fields such as <current_flag> and configure SCD component to set the flag to “y” or “n”. Now here the assumption is that your businees key is not your primary key. If it is then you can only use Type I SCD and have to configure the SCD component to <changing attributes> and not <Historical Attribute>. Hope this helps.

  • Hi everybody, apologise I came late to the discussion.

    Calvin, I'm not sure why the link is missing but the article is at http://www.sqlservercentral.com/columnists/vRainardi/2732.asp.

    Steve, the link to the article is still missing today (7th Jan).

    Uday, as Rasool said, to upsert a dimension in SSIS you can use SCD transformation such as http://www.microsoft.com/library/media/1033/technet/images/prodtechnol/sql/2005/s4_big.png, or as Alex and Robert said we can also use Lookup transformation.

    About your last post where an update is recognised as an insert, in the SSIS SCD Wizard did you choose the Change Type for that column as 'Changing Attribute' or as 'Historical Attribute'? And did you tick the 'Change all the matching records, including outdated records, when changes are detected in a changing attribute' check box?

    Loner, the surrogate_key column which has 0 value in it is a temporary table, not the target dimension table. Yes you are right I forgot to include the natural key column. Thank you for your correction. The correct insert statement is

    INSERT INTO #dim (surrogate_key, natural_key, attribute1, attribute2, load_time)

    SELECT ISNULL(dim.surrogate_key,0), src.natural_key, src.attribute1, src.attribute2, @current_load_time

    FROM stg.dbo.source src

    LEFT JOIN dw.dbo.dimension dim ON src.natural_key = dim.natural_key

    Yes we can also do it using the approach that you described, Loner, i.e. the temp table doesn't have to have the surrogate key and we leave the key assignment to identity of the target dimension table.

    SCD3 is normally use if we only require certain level of history, for example, if we want to keep only the last 2 region for a particular branch. So if branch A is allocated to region 1 in 2001, then to region 2 in 2002 (perhaps because of reorganisation or restructurisation), then to region 3 in 2003, and we only want to keep the last 2 regions, then SCD3 could be a technique that we could use. So we don't normally add a column every time. If we want to keep the last 2 values for example, then we created 3 sets of columns since the beginning (1 set is for the current value). But you are right, if we want to keep the last 10 values then the structure would be a big table.

    Cliff, I agree with Loner that we can use checksum instead of or, which is cleaner as you said. I am not sure that it is quicker though. Loner, you are right, we should use natural key when comparing rows in dimension table. The purpose of 'WHERE tmp.surrogate_key <> 0' is not for comparing, but to update the rows where the tmp table’s surrogate key is not 0, i.e. the rows already exist on the dimension table. For the rows where the surrogate key is 0 (not exist on the dimension table), we will insert them into the dimension table later on.

    Ah, Uday, I saw that Rasool already explained 'Changing Attribute' or as 'Historical Attribute' and how to tag the current records, in his last posting.

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply