What happens if new records are inserted in a source table during a package execution?

Question

What happens if new records are inserted in a source table during a package execution?

Paul Hernández

SSCarpal Tunnel

Points: 4961
More actions
May 3, 2013 at 5:12 am

#293424

Hi Folks,
I have a package that loads records in a CRM destination using a Web Service. This execution takes around let´s say 3 hours. I would like to know what would happen if new records are inserted in the source table during the package execution?
Supposed that there`s no filter in the query that gets the source records.
I can program the loading process in another way, I know. But what I really want to know is what would happen in this scenario? Are the new records loaded? Maybe depends on the buffer, if all the records were already catch when the new records come in?
EDIT: "suppose that the OLEDB source that I'm using is still in Yellow when the new records arrive"
Any comment would be appreciated.
Regards,
Paul Hernández

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 1

Paul Hernández (5/3/2013)
Hi Folks,
I have a package that loads records in a CRM destination using a Web Service. This execution takes around let´s say 3 hours. I would like to know what would happen if new records are inserted in the source table during the package execution?
Supposed that there`s no filter in the query that gets the source records.
I can program the loading process in another way, I know. But what I really want to know is what would happen in this scenario? Are the new records loaded? Maybe depends on the buffer, if all the records were already catch when the new records come in?
EDIT: "suppose that the OLEDB source that I'm using is still in Yellow when the new records arrive"
Any comment would be appreciated.
Regards,

The window is small after you issue the SELECT but it's possible for more rows to sneak into the resultset even when in READ COMMITTED mode.

When the source is yellow what is the wait type for the SPID selecting the rows? Async network or something else? Or is SSIS wrong and is the SPID cleared out?

This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.

Are you sweeping the rows, i.e. deleting them from the source after you push them through to the web service?

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Phil Parkin SSC Guru Points: 246695 More actions · Answer 2

This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.

+1 to this, me too.

If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 3

Phil Parkin (5/7/2013)
This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.
+1 to this, me too.

Ditto on the +1. If you want to do only an incremental extraction (instead of extracting ALL rows where ID < MAX(ID), create a log table with a column called extractedThroughID or something similar. When you begin the extraction, read the most recent extractedThroughID from the log table into a variable (lastExtractEndID, for example) and the MAX(ID) into another variable (e.g., currentExtractEndID) and extract only rows WHERE ID > lastExtractStartID AND ID <= currentExtractEndID. When the extraction process completes successfully, insert the value of currentExtractEndID in the log table as extractedThroughID so the extraction will start from there next time.

Jason Wolfkill

Paul Hernández SSCarpal Tunnel Points: 4961 More actions · Answer 4

Thanks so much for your answers,

Thanks to the opc answer I could learn about SPIDs, Wait Types, Packet Size, etc.

I have already implemented the IDENTITY column (Idx) in the staging table. After a record is successfully imported in the CRM I write the CRM uniqueidentifier back in the table together with the datetime of the update in a column called UpdateDT.

Then, I create a second package to execute this package. I pass the number of records to load as a parameter, and I query the MAX(Idx), then I have something like this in the source query:

Where Idx > MAX(Idx) and Idx<MAX(Idx)+RowsToLoad

Of course, I’m using some variables and other names.

I created a loop that keep the loading process active until there are no more records to load. At the moment is still running since monday (4 days) and 1.3 million of customers out of 2.2 are already imported.

Kind Regards,

Paul Hernández

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 5

wolfkillj (5/7/2013)
Phil Parkin (5/7/2013)
This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.
+1 to this, me too.
Ditto on the +1. If you want to do only an incremental extraction (instead of extracting ALL rows where ID < MAX(ID), create a log table with a column called extractedThroughID or something similar. When you begin the extraction, read the most recent extractedThroughID from the log table into a variable (lastExtractEndID, for example) and the MAX(ID) into another variable (e.g., currentExtractEndID) and extract only rows WHERE ID > lastExtractStartID AND ID <= currentExtractEndID. When the extraction process completes successfully, insert the value of currentExtractEndID in the log table as extractedThroughID so the extraction will start from there next time.

I guess I'm he odd ball. I've had some situations (high speed staging tables fed by triggers on multiple other tables) where lower numbered rows were still in the process of being committed that what the MAX(ID) was and it caused some real headaches. I don't rely on MAX(ID). Instead, I read all of the rows that have no locks on them by using READ PAST and I store the IDs in a temp table to control what I work on and delete, etc.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 6

Jeff Moden (5/10/2013)
wolfkillj (5/7/2013)
Phil Parkin (5/7/2013)
This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.
+1 to this, me too.
Ditto on the +1. If you want to do only an incremental extraction (instead of extracting ALL rows where ID < MAX(ID), create a log table with a column called extractedThroughID or something similar. When you begin the extraction, read the most recent extractedThroughID from the log table into a variable (lastExtractEndID, for example) and the MAX(ID) into another variable (e.g., currentExtractEndID) and extract only rows WHERE ID > lastExtractStartID AND ID <= currentExtractEndID. When the extraction process completes successfully, insert the value of currentExtractEndID in the log table as extractedThroughID so the extraction will start from there next time.
I guess I'm he odd ball. I've had some situations (high speed staging tables fed by triggers on multiple other tables) where lower numbered rows were still in the process of being committed that what the MAX(ID) was and it caused some real headaches. I don't rely on MAX(ID). Instead, I read all of the rows that have no locks on them by using READ PAST and I store the IDs in a temp table to control what I work on and delete, etc.

Jeff -

Thanks for mentioning this. It prompted me to think about how the existence of INSERT/UPDATE transactions that have been initiated but not committed could cause some rows to fall through the cracks of ETL processes that select rows for extraction based on the comparison of some ever-increasing column with a "cutoff value". Thanks for the suggesting your method - I'll be looking into it.

Jason

Jason Wolfkill

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 7

wolfkillj (5/13/2013)
Jeff Moden (5/10/2013)
wolfkillj (5/7/2013)
Phil Parkin (5/7/2013)
This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.
+1 to this, me too.
Ditto on the +1. If you want to do only an incremental extraction (instead of extracting ALL rows where ID < MAX(ID), create a log table with a column called extractedThroughID or something similar. When you begin the extraction, read the most recent extractedThroughID from the log table into a variable (lastExtractEndID, for example) and the MAX(ID) into another variable (e.g., currentExtractEndID) and extract only rows WHERE ID > lastExtractStartID AND ID <= currentExtractEndID. When the extraction process completes successfully, insert the value of currentExtractEndID in the log table as extractedThroughID so the extraction will start from there next time.
I guess I'm he odd ball. I've had some situations (high speed staging tables fed by triggers on multiple other tables) where lower numbered rows were still in the process of being committed that what the MAX(ID) was and it caused some real headaches. I don't rely on MAX(ID). Instead, I read all of the rows that have no locks on them by using READ PAST and I store the IDs in a temp table to control what I work on and delete, etc.
Jeff -
Thanks for mentioning this. It prompted me to think about how the existence of INSERT/UPDATE transactions that have been initiated but not committed could cause some rows to fall through the cracks of ETL processes that select rows for extraction based on the comparison of some ever-increasing column with a "cutoff value". Thanks for the suggesting your method - I'll be looking into it.
Jason

UPDATEs could be missed but that's an entirely different ballgame and would require a different approach than "less than or equal to IDENTITY". For INSERTs, I do not think they will fall through the cracks due to ACID properties.

I like Jeff's technique for the scenario he mentioned, high speed staging tables fed by triggers on multiple other tables, but it could be deemed over-engineered in some environments. I have added it to my toolkit though, thanks Jeff!

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 8

opc.three (5/13/2013)
wolfkillj (5/13/2013)
Jeff Moden (5/10/2013)
wolfkillj (5/7/2013)
Phil Parkin (5/7/2013)
This is probably what you meant by "I can program the loading process in another way, I know." but to remove most of the doubt I like to code extracts like this off of an ever-increasing key, i.e. an IDENTITY column setup as a unique index or primary key. This way I will select the maximum ID into a variable, then select all rows that are less than or equal to that ID.
+1 to this, me too.
Ditto on the +1. If you want to do only an incremental extraction (instead of extracting ALL rows where ID < MAX(ID), create a log table with a column called extractedThroughID or something similar. When you begin the extraction, read the most recent extractedThroughID from the log table into a variable (lastExtractEndID, for example) and the MAX(ID) into another variable (e.g., currentExtractEndID) and extract only rows WHERE ID > lastExtractStartID AND ID <= currentExtractEndID. When the extraction process completes successfully, insert the value of currentExtractEndID in the log table as extractedThroughID so the extraction will start from there next time.
I guess I'm he odd ball. I've had some situations (high speed staging tables fed by triggers on multiple other tables) where lower numbered rows were still in the process of being committed that what the MAX(ID) was and it caused some real headaches. I don't rely on MAX(ID). Instead, I read all of the rows that have no locks on them by using READ PAST and I store the IDs in a temp table to control what I work on and delete, etc.
Jeff -
Thanks for mentioning this. It prompted me to think about how the existence of INSERT/UPDATE transactions that have been initiated but not committed could cause some rows to fall through the cracks of ETL processes that select rows for extraction based on the comparison of some ever-increasing column with a "cutoff value". Thanks for the suggesting your method - I'll be looking into it.
Jason
UPDATEs could be missed but that's an entirely different ballgame and would require a different approach than "less than or equal to IDENTITY". For INSERTs, I do not think they will fall through the cracks due to ACID properties.
I like Jeff's technique for the scenario he mentioned, high speed staging tables fed by triggers on multiple other tables, but it could be deemed over-engineered in some environments. I have added it to my toolkit though, thanks Jeff!

I haven't examined it fully, but in some brief testing I did, I observed some interesting behavior that leads me to believe there could be issues with the "less than or equal to IDENTITY" approach with inserts.

I created a table named testTable with an identity column named rowNbr (and an irrelevant col1 varchar(10)).

I initiated a transaction to insert a thousand rows with a WxAITFOR DELAY '00:00:10' after the INSERT but before the COMMIT.

While this transaction was open, I opened another connection and executed in immediate succession an INSERT of a single row, a SELECT @maxRowNbr = MAX(rowNbr) FROM testTable, and a SELECT * FROM testTable WHERE rowNbr <= @maxRowNbr. The INSERT and the SELECT MAX(rowNbr) completed before the other transaction committed, and the SELECT MAX(rowNbr) returned 1001. The SELECT * FROM testTable seemed to be blocked by the other transaction, but returned 1001 rows once the other transaction committed.

I haven't examined the locking/blocking that occurred or researched this further, but I'm curious about whether SQL Server will ALWAYS require a large INSERT transaction that involves an IDENTITY column to complete before executing a SELECT * WHERE [IDENTITY column] <= [some number higher than the highest IDENTITY value in the INSERT batch].

Jason Wolfkill

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 9

The issuance of the IDENTITY column is part of the transaction so we are protected from loss of data.

In SSMS, Query Window 1, run this:

USE YourDatabaseName;

DROP TABLE dbo.StagingTable

CREATE TABLE dbo.StagingTable (StagingTableID INT NOT NULL IDENTITY(1,1), Col1 sysname);

In Query Window 2, populate the table with some data:

BEGIN TRAN

INSERT INTO dbo.StagingTable

(

Col1

)

SELECT TOP 1000

c1.name

FROM sys.columns c1

CROSS JOIN sys.columns c2;

-- note, I am leaving the transaction open on purpose

In Query Window 3, insert a single row:

INSERT INTO dbo.StagingTable

(

Col1

)

SELECT TOP 1

c1.name

FROM sys.columns c1

CROSS JOIN sys.columns c2;

-- note that the insert succeeds despite the fact

-- that there are 1000 rows trying to, but not yet

-- committed on another SPID

Query Window 4, along comes our SSIS Package wanting to get the MAX ID and copy the data:

SELECT MAX(StagingTableID)

FROM dbo.StagingTable;

Nope, must wait...due to the open transaction in Window 2. Go back to Window 2 and call COMMIT or ROLLBACK, your choice, and you'll see as soon as that happens Window 4 completes immediately and returns 1001. This is the protection we need, i.e. no lost data.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 10

opc.three (5/13/2013)
The issuance of the IDENTITY column is part of the transaction so we are protected from loss of data.
In SSMS, Query Window 1, run this:
USE YourDatabaseName;
DxROP TABLE dbo.StagingTable
CxREATE TABLE dbo.StagingTable (StagingTableID INT NOT NULL IDENTITY(1,1), Col1 sysname);
In Query Window 2, populate the table with some data:
BEGIN TRAN
INSERT INTO dbo.StagingTable
(
Col1
)
SELECT TOP 1000
c1.name
FROM sys.columns c1
CROSS JOIN sys.columns c2;
-- note, I am leaving the transaction open on purpose
In Query Window 3, insert a single row:
INSERT INTO dbo.StagingTable
(
Col1
)
SELECT TOP 1
c1.name
FROM sys.columns c1
CROSS JOIN sys.columns c2;
-- note that the insert succeeds despite the fact
-- that there are 1000 rows trying to, but not yet
-- committed on another SPID
Query Window 4, along comes our SSIS Package wanting to get the MAX ID and copy the data:
SELECT MAX(StagingTableID)
FROM dbo.StagingTable;
Nope, must wait...due to the open transaction in Window 2. Go back to Window 2 and call COMMIT or ROLLBACK, your choice, and you'll see as soon as that happens Window 4 completes immediately and returns 1001. This is the protection we need, i.e. no lost data.

Except this is not what happens when I execute this code in this manner on our 2008R2 instance. After I have executed Queries 1-3, I can see the SPID for the first insert sleeping. However, when I run Query 4, I get a result = 1001. SELECT * FROM dbo.StagingTable gives me the one row I inserted in Query 3. We're running in autocommit mode - are there other database- or instance-level settings that can affect this?

Jason Wolfkill

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 11

If you are using either Snapshot Isolation or READ COMMITTED SNAPSHOT ISOLATION (RCSI) mode that could account for the behavior you are seeing. My comments are related to the default iso level, READ COMMITTED, with RCSI OFF.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Jason Wolfkill SSCrazy Eights Points: 9832 More actions · Answer 12

opc.three (5/13/2013)
If you are using either Snapshot Isolation or READ COMMITTED SNAPSHOT ISOLATION (RCSI) mode that could account for the behavior you are seeing. My comments are related to the default iso level, READ COMMITTED, with RCSI OFF.

Ah - we are indeed running with READ_COMMITTED_SNAPSHOT set to ON.

Jason Wolfkill

Orlando Colamatteo SSC Guru Points: 182276 More actions · Answer 13

wolfkillj (5/13/2013)
opc.three (5/13/2013)
If you are using either Snapshot Isolation or READ COMMITTED SNAPSHOT ISOLATION (RCSI) mode that could account for the behavior you are seeing. My comments are related to the default iso level, READ COMMITTED, with RCSI OFF.
Ah - we are indeed running with READ_COMMITTED_SNAPSHOT set to ON.

This has been a good reminder (at least for me) that when row versioning is involved that some concurrency scenarios will play out differently than without, and it really needs to be considered carefully. I could certainly have made it explicit in my post that I was making the assumption that default iso level and no RCSI was what I was assuming, and in the future on threads like this where it matters I will certainly start doing that. Thanks for confirming wolfkillj. Mystery solved.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato