October 19, 2007 at 12:38 pm
I have a 22,000,000+ row table that has a field that is a concatenation of 3 fields. I need to extract this information into 3 separate fields. What is the best way to go about this? Is it possible to read a record at a time and do this update? Any suggestions?
October 19, 2007 at 1:48 pm
I'm assuming the field is a varchar field,
you din't give an example but there's two methods to look at below;
if the values are always the same position, or the values are between certain delimiters:
declare @sample varchar(30)
set @sample = 'ssn-000x0000'
--YOU KNOW the fields are the same format
select substring(@sample,1,3) As pt1,
substring(@sample,5,3) As pt2,
substring(@sample,9,4) As pt3
--the parts are delimited and not consistant?
select substring(@sample,1,charindex('-',@sample) -1) As pt1,
substring(@sample,charindex('-',@sample)+1 ,charindex('x',@sample) - charindex('-',@sample)-1 ) As pt2,
substring(@sample,9,4) As pt3
pt1 pt2 pt3
---- ---- ----
ssn 000 0000
pt1 pt2 pt3
------------------------------ ------------------------------ ----
ssn 000 0000
Lowell
October 19, 2007 at 1:52 pm
The problem I am having is space and time on the sql2000 server. I need to know how to accomplish this without hogging the system.
October 19, 2007 at 2:16 pm
1.Reading a record at a time versus inserting in bulk will lengthen your process something awful. It will take more resources, piss the server off for longer...(In short - don't do that). Row by row, or as one of the regular posters here has named it, RBAR (row by agonizing row) will take hours/days. Let the import run as fast as it possibly can - set-based should get through that a LOT faster.
2.If you wish to minimize pain - import first, then split the fields up. Better yet - import it into a table with no indexes, constraints, etc... Validate and parse once you have it in a table. By setting up a "straight import" and worrying about the parsing separately, you're going to limit the impact of each activity, which shouldn't "redline the server" for any reason. you might be able to get away with parse+validate WHILE you insert into the real table, but that might be your call.
3. Do the validations/parsing in chunks so that your transactions don't get "crazy", which brings up another thing - don't do this whole thing as one single transaction. There are lots of examples on here of how to loop through a large table like yours after using a SET ROWCOUNT to limit how many you affect. using small enough chunks should keep a lid on the activity.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
October 19, 2007 at 2:44 pm
Thanks Matt for your reply.
I am actually trying the rowset option with a test database of 1 million records with rowset 1000. So far it has been running for 1 hour. Question: 1. Does this option release space after each chunk is processed? 2.Are you suggesting several queries within the DTS to accomplish this when you stated "not in a single transaction:?
October 19, 2007 at 8:05 pm
Went to the doctor and said "Doctor, when I hold my arm up like this (deomonstrating strange angle to arm), it hurts like hell". Doctor said, "Don't do that... it'll hurt like hell".
First... if you have a table that has 3 fields combined into one, you need to find out who designed it, hunt them down, and feed them a roll of toilet paper to help them clean up their mind. 😛 That's a horrible design and they need to have their birthdate taken away from them :hehe:
If you can modify the table, add 3 calculated columns that do the split, and you're done.
If you can't do that, do a good ol' fashioned SELECT/INTO to create a sister table that contains the split data and join to that. Don't forget to copy the PK of the original table when you do that.
Also, you can update a million rows in seconds... you might even get away with updating 2 million rows in only twice that time... but a some point (different on every server), you reach a "tipping point" where adding just another hundred rows causes the server to loose it's mind and take hours, if not days to do the update. You need a control loop to limit the rows being updated to, say, a million rows, if it's an update you're doing.
And why the heck are you using DTS, anyway? All it's gonna do is slow things down. 😀
--Jeff Moden
Change is inevitable... Change for the better is not.
October 22, 2007 at 6:12 am
Hi Jeff,
I agree with you. That is a horrible design. I ran Friday using the following code in query analyzer. It ran for 1hr 14 min. for only 1,000,000,000 rows. Not good. The reason I am running this as a DTS is because the data needs to be used in an asp application. This data is imported from the mainframe weekly through a DTS job. My thoughts were to split up these fields during that time so that the data would be available for the application that needs to use it. I tried reading the data using substrings in the application and it times out. Because of this horrible table structure, there are no indexes or keys. Do you have a better suggestion (other than murder):w00t:
declare @continue as bit
set @continue = 1
set rowcount 1000
while @continue = 1
begin
update RESULTS
set N_Cont = substring(C_PROMO_PRC_INF0, 1,5),
N_RLTD_CUST = substring(C_PROMO_PRC_INF0, 10, 7),
C_Cont = substring(C_PROMO_PRC_INF0, 18, 1)
where n_CONT is null
if @@rowcount = 0
set @continue = 0
end
October 22, 2007 at 12:38 pm
This might be "cheating", but - have you considered asking your mainframe folks to provide this data in a fixed width format (i.e., not delimited)? if you did - you wouldn't need 2 steps, since you could define the format file to break up the incoming data for you (even with DTS).
If you CAN't do that - consider adding an identity column as the clustered primary key for the table you're importing into. That way you could ensure that your updates are hitting "contiguous chunks" of data, and not spreading it across the entire DB.
Also - in most cases, 1000 is usually much to small a chunk for modern servers (it will slow it down too much) - try 50-100K records at a time.
Now - assuming you had an identity field called CTR on your RESULTS table (and it's defined as the clustered index) - you could try this (should be much faster):
declare @startctr bigint
declare @endctr bigint
declare @chunksize int
declatre @continue tinyint
set @startctr=1 --where to start
set @chunksize=50000 --how many records to update at a time
set @continue=1 --whether to break out of the loop or not
While (@continue=1)
begin
set @endctr=@startctr+@chunksize-1
update RESULTS
set N_Cont = substring(C_PROMO_PRC_INF0, 1,5),
N_RLTD_CUST = substring(C_PROMO_PRC_INF0, 10, 7),
C_Cont = substring(C_PROMO_PRC_INF0, 18, 1)
where
ctr between @startctr and @endctr --should be much faster since you're using the clustered index.
if (@@row_count=0)
set @continue=0
set @startctr=@startctr+@chunksize
end
---------------------------------------------
Still - if you can get the data file into a fixed-width file - the update process becomes useless. You can do it during the insert process.
----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Viewing 8 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply