Updating 22,000,000+ row table in DTS

Question

Post reply

Updating 22,000,000+ row table in DTS

valedawilliams

Grasshopper

Points: 24
More actions
October 19, 2007 at 12:38 pm

#375993

I have a 22,000,000+ row table that has a field that is a concatenation of 3 fields. I need to extract this information into 3 separate fields. What is the best way to go about this? Is it possible to read a record at a time and do this update? Any suggestions?

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323743 More actions · Answer 1

I'm assuming the field is a varchar field,

you din't give an example but there's two methods to look at below;

if the values are always the same position, or the values are between certain delimiters:

declare @sample varchar(30)

set @sample = 'ssn-000x0000'

--YOU KNOW the fields are the same format

select substring(@sample,1,3) As pt1,

substring(@sample,5,3) As pt2,

substring(@sample,9,4) As pt3

--the parts are delimited and not consistant?

select substring(@sample,1,charindex('-',@sample) -1) As pt1,

substring(@sample,charindex('-',@sample)+1 ,charindex('x',@sample) - charindex('-',@sample)-1 ) As pt2,

substring(@sample,9,4) As pt3

pt1 pt2 pt3

---- ---- ----

ssn 000 0000

pt1 pt2 pt3

------------------------------ ------------------------------ ----

ssn 000 0000

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

valedawilliams Grasshopper Points: 24 More actions · Answer 2

The problem I am having is space and time on the sql2000 server. I need to know how to accomplish this without hogging the system.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

1.Reading a record at a time versus inserting in bulk will lengthen your process something awful. It will take more resources, piss the server off for longer...(In short - don't do that). Row by row, or as one of the regular posters here has named it, RBAR (row by agonizing row) will take hours/days. Let the import run as fast as it possibly can - set-based should get through that a LOT faster.

2.If you wish to minimize pain - import first, then split the fields up. Better yet - import it into a table with no indexes, constraints, etc... Validate and parse once you have it in a table. By setting up a "straight import" and worrying about the parsing separately, you're going to limit the impact of each activity, which shouldn't "redline the server" for any reason. you might be able to get away with parse+validate WHILE you insert into the real table, but that might be your call.

3. Do the validations/parsing in chunks so that your transactions don't get "crazy", which brings up another thing - don't do this whole thing as one single transaction. There are lots of examples on here of how to loop through a large table like yours after using a SET ROWCOUNT to limit how many you affect. using small enough chunks should keep a lid on the activity.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

valedawilliams Grasshopper Points: 24 More actions · Answer 4

Thanks Matt for your reply.

I am actually trying the rowset option with a test database of 1 million records with rowset 1000. So far it has been running for 1 hour. Question: 1. Does this option release space after each chunk is processed? 2.Are you suggesting several queries within the DTS to accomplish this when you stated "not in a single transaction:?

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 5

Went to the doctor and said "Doctor, when I hold my arm up like this (deomonstrating strange angle to arm), it hurts like hell". Doctor said, "Don't do that... it'll hurt like hell".

First... if you have a table that has 3 fields combined into one, you need to find out who designed it, hunt them down, and feed them a roll of toilet paper to help them clean up their mind. 😛 That's a horrible design and they need to have their birthdate taken away from them :hehe:

If you can modify the table, add 3 calculated columns that do the split, and you're done.

If you can't do that, do a good ol' fashioned SELECT/INTO to create a sister table that contains the split data and join to that. Don't forget to copy the PK of the original table when you do that.

Also, you can update a million rows in seconds... you might even get away with updating 2 million rows in only twice that time... but a some point (different on every server), you reach a "tipping point" where adding just another hundred rows causes the server to loose it's mind and take hours, if not days to do the update. You need a control loop to limit the rows being updated to, say, a million rows, if it's an update you're doing.

And why the heck are you using DTS, anyway? All it's gonna do is slow things down. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

valedawilliams Grasshopper Points: 24 More actions · Answer 6

Hi Jeff,

I agree with you. That is a horrible design. I ran Friday using the following code in query analyzer. It ran for 1hr 14 min. for only 1,000,000,000 rows. Not good. The reason I am running this as a DTS is because the data needs to be used in an asp application. This data is imported from the mainframe weekly through a DTS job. My thoughts were to split up these fields during that time so that the data would be available for the application that needs to use it. I tried reading the data using substrings in the application and it times out. Because of this horrible table structure, there are no indexes or keys. Do you have a better suggestion (other than murder):w00t:

declare @continue as bit

set @continue = 1

set rowcount 1000

while @continue = 1

begin

update RESULTS

set N_Cont = substring(C_PROMO_PRC_INF0, 1,5),

N_RLTD_CUST = substring(C_PROMO_PRC_INF0, 10, 7),

C_Cont = substring(C_PROMO_PRC_INF0, 18, 1)

where n_CONT is null

if @@rowcount = 0

set @continue = 0

end

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

This might be "cheating", but - have you considered asking your mainframe folks to provide this data in a fixed width format (i.e., not delimited)? if you did - you wouldn't need 2 steps, since you could define the format file to break up the incoming data for you (even with DTS).

If you CAN't do that - consider adding an identity column as the clustered primary key for the table you're importing into. That way you could ensure that your updates are hitting "contiguous chunks" of data, and not spreading it across the entire DB.

Also - in most cases, 1000 is usually much to small a chunk for modern servers (it will slow it down too much) - try 50-100K records at a time.

Now - assuming you had an identity field called CTR on your RESULTS table (and it's defined as the clustered index) - you could try this (should be much faster):

declare @startctr bigint

declare @endctr bigint

declare @chunksize int

declatre @continue tinyint

set @startctr=1 --where to start

set @chunksize=50000 --how many records to update at a time

set @continue=1 --whether to break out of the loop or not

While (@continue=1)

begin

set @endctr=@startctr+@chunksize-1

update RESULTS

set N_Cont = substring(C_PROMO_PRC_INF0, 1,5),

N_RLTD_CUST = substring(C_PROMO_PRC_INF0, 10, 7),

C_Cont = substring(C_PROMO_PRC_INF0, 18, 1)

where

ctr between @startctr and @endctr --should be much faster since you're using the clustered index.

if (@@row_count=0)

set @continue=0

set @startctr=@startctr+@chunksize

end

---------------------------------------------

Still - if you can get the data file into a fixed-width file - the update process becomes useless. You can do it during the insert process.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?