RE: Index on table with 70+ million rows

SSC Guru

Points: 1003861

August 23, 2016 at 7:52 pm

Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Rowan-283474 (8/23/2016)
Rowan-283474 (8/23/2016)
Sergiy (8/23/2016)
Sergiy (8/22/2016)
Do you really have the case when the source table does not have any kind of time stamp or an auto-incremental ID?
Can you answer this?
Yes unfortunately I have nothing like the above mentioned to work with
I do actually have a timestamp that I can work with but this doesn't help me when it comes to identifying rows that I need to update on my side, without some for of an ID or Key column.
If that time stamp indicates the time when a row was last time affected with an insert/update then you need to copy only those rows which have the time stamp later than the MAX(time stamp) in your reporting data set from yesterday.
You can do that, and I have previously, but then you potentially end up with two versions of the 'same' row so to speak. Because there is no way to link a row of yesterday with a row of 'today' you cannot merge the updated data from today with yesterday, you can only insert and thus the potential for different versions of the 'same' row

I'm sure that I'm preaching to the choir but this all has a really rotten smell to it. The data is coming from an SQL Server and I'm reasonably sure that they have some way to uniquely and consistently identify each row between the systems even if they have to use several columns to do it (as in some form of manufactured PK based on the data even if coming from more than 1 table). They need to provide that information and this will all become a cake walk (I suggest using EXCEPT between the new data and the old to quickly identify changes and additions and deletes will be a simple lookup).

Since you're using SSIS for this, are you saying that the company you work for owns that other server? If so, I'd bring guns to bear on the problem and have someone explain the problem they're causing and have someone fix it. If the server is owned by a 3rd party and there's a contract for this data, make them provide the data with some form of consistent PK so that you can do a decent merge without having to reload the whole snapshot.

Otherwise, you're pretty much stuck with what you have. You can seriously decrease the amount of time that data won't be available (to milliseconds) by doing the swap/rename that I previously suggested but you won't be able to decrease the amount of time the server has to work.

Again, not telling you anything you don't know but have to say it out loud... unless they provide a consistent unique key in the data, you're pretty much stuck with things like they are. You can't even bank on dates for a comparison right now.

And, sorry for the rant. I just hate it when people like "them" do things like this to people like "you". It's just not necessary and it's so very easy for them to fix it on their side.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)