UPDATE SET =NULL running over an hour

Question

UPDATE SET =NULL running over an hour

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 1

Matt Miller (12/3/2008)
Lynn Pettis (12/3/2008)
Never mind my post above, it won't work. I am having a brain dead day here at work and it is seeping over to here. You'd still need to test for not nulls.
Actually - I don't think you're brain dead at all. You're onto something: the end result from this query will be - "all rows will have NULLs in the admission_date and discharge_date columns"
Given that - I would look at doing the batching, but do it by forcing a "walk the clustered index" scenario, so that you're a. doing just one pass through the table, and b. just touching each data page one time.
You can always just tell it to ignore any rows where BOTH columns are already null (if there are a lot of those). If there wouldn't be a lot of those rows, then just batch them up by clustered index key, and skip checking for null at all (that way you get around the issue Jeff was mentioning).
Doug - what's the clustered index scenario? What's the column name/data type involved?

Matt, the CLAIMS table has a clustered index on member_id VARCHAR and dos_to SMALLDATETIME. Also, I need to wipe out the admission and discharge dates on these medical claim records and then I'm running a long script (100 or so statements) that repopulates those dates based on many criteria and it takes about 6-7 hours to run. I'm trying to shorten it and this UPDATE statement to wipe out those two dates is taking 1:13 hours alone.

-Doug

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 2

Here is another idea. The first update will take about 49 minutes (as seen earlier), but the second should actually run faster!

declare @BatchSize int;

set @BatchSize = 10000;

while exists(select 1 from dbo.Claims where admission_date is not null

begin

UPDATE TOP (@BatchSize) dbo.CLAIMS SET

admission_date = NULL,

discharge_date = NULL

WHERE

admission_date IS NOT NULL;

-- I'd put a transaction log backup here

end

while exists(select 1 from dbo.Claims where discharge_date is not null

begin

UPDATE TOP (@BatchSize) dbo.CLAIMS SET

admission_date = NULL,

discharge_date = NULL

WHERE

discharge_date IS NOT NULL;

-- I'd put a transaction log backup here

end

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 3

Jeff Moden (12/2/2008)
One more thing... you're updating the same columns that you have in your criteria. Sometimes, and it's again unpredictable, that will cause a recompile for every row updated.
You should probably capture the PK column info in a separate working table for rows that meet your criteria and the use that information as a join to do the update.

Jeff, I tried with some success doing what you wrote above. I created a #tmp1 table with all the claim_id's that have a NOT NULL admission_date and then ran the UPDATE statement to NULL them in the CLAIMS table. It transacted in 16 mins which should cut my 1:13 hrs in a little over 1/2 after running the 2nd statement which affects the same number of records.

PRINT CAST(GETDATE()AS VARCHAR) + '...START TRANSACTION 1'
CREATE TABLE #tmp1 (claim_id VARCHAR(50)PRIMARY KEY CLUSTERED)
INSERT INTO #tmp1 SELECT DISTINCT claim_id FROM dbo.CLAIMS WHERE admission_date IS NOT NULL;
PRINT CAST(GETDATE()AS VARCHAR) + '...START TRANSACTION 2'
UPDATE T1
SET admission_date=NULL
FROM dbo.CLAIMS T1 INNER JOIN #tmp1 T2 ON T1.claim_id=T2.claim_id;
GO
PRINT CAST(GETDATE()AS VARCHAR) + '...END TRANSACTION'

Jeff Moden SSC Guru Points: 1003946 More actions · Answer 4

I see a bunch of stuff in that code snippet that I think is working against you. Please post the real table creationtion statement for the table you're deleting from and a complete list of the indexes and the PK constraint. Also, post any triggers that may be present on the table. Let's see if we can get some better speed out of this...

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 5

Doug (12/3/2008)

Lynn Pettis (12/3/2008)
I added a BEGIN TRANSACTION in front of your script so I could ROLLBACK afterwards, which I did.

This in itself might be why Lynn's didn't help. The whole idea of the smaller batches was to not make the log file go crazy. By putting all of the small batches in a single transaction - you essentially are forcing the DB to keep ALL of the small batches...essentially dragging it down in the same way as the single BIG batch would do.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 6

Here's an example of walking the clustered index:

drop table #ranges

;with AlphanumericCTE as (

select

char(N) digit,

char(N+1) nextdigit

from tally where

n between 48 and 57 or --digits 0 through 9

n between 65 and 90) --A through Z

select row_number() over (order by a1.digit,a2.digit) rn,

a1.digit+a2.digit startrange,

a1.digit+a2.nextdigit endrange

into #ranges

from alphanumericCTE a1 cross join alphanumericCTE a2

order by startrange

declare @curr int

declare @lastrow int

set @curr=1;

select @lastrow=max(rn) from #ranges

--walk through the ranges bsaed on the first two digits of the member_ID

--presumes that the spread is fairly even - if not- then customize at will

while (@curr<=@lastrow)

begin

UPDATE CLAIMS

SET

admission_date = NULL,

discharge_date = NULL

from claims join #ranges on member_id between startrange and endrange

where rn=@curr and (admission_date is not null or discharge_date is not null)

set @curr=@curr+1;

end

--create table claims(member_ID varchar(20), admission_date datetime, discharge_date datetime)

Assuming you don't make that into one giant transaction - this should go a bit faster. Note that this is only one pass.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 7

Jeff Moden (12/3/2008)
I see a bunch of stuff in that code snippet that I think is working against you. Please post the real table creationtion statement for the table you're deleting from and a complete list of the indexes and the PK constraint. Also, post any triggers that may be present on the table. Let's see if we can get some better speed out of this...

No triggers. Here is the create statement:

CREATE TABLE [dbo].[CLAIMS](
[claim_key] [int] IDENTITY(1,1) NOT NULL,
[member_id] [varchar](20) NOT NULL,
[claim_id] [varchar](50) NOT NULL,
[service_line] [smallint] NOT NULL DEFAULT ((1)),
[provider_id] [varchar](22) NOT NULL DEFAULT (''),
[provider_specialty] [varchar](50) NOT NULL DEFAULT ('UNK'),
[gender] [char](1) NOT NULL,
[dob] [datetime] NOT NULL,
[dos_age] AS (CONVERT([int],CONVERT([float],datediff(day,[dob],case when [discharge_date] IS NULL then [dos_to] else [discharge_date] end)/(365.25),(0)),(0))) PERSISTED,
[pos] [char](2) NOT NULL DEFAULT (''),
[tob] [char](3) NOT NULL DEFAULT (''),
[facility_type] [char](1) NOT NULL DEFAULT (''),
[acute_inpatient_flag] [tinyint] NULL DEFAULT ((0)),
[mh_cd_inpatient_flag] [tinyint] NULL DEFAULT ((0)),
[admission_date] [smalldatetime] NULL,
[discharge_date] [smalldatetime] NULL,
[dos_from] [smalldatetime] NOT NULL,
[dos_to] [smalldatetime] NOT NULL,
[claim_status] [char](2) NOT NULL DEFAULT ('PD'),
[denied_days] [smallint] NOT NULL DEFAULT ((0)),
[discharge_status] [char](2) NOT NULL DEFAULT ('01'),
[CMS_drg] [char](3) NOT NULL DEFAULT (''),
[MS_drg] [char](3) NOT NULL DEFAULT (''),
[quantity] [int] NOT NULL DEFAULT ((1)),
[rev] [char](4) NOT NULL DEFAULT (''),
[cpt] [char](5) NOT NULL DEFAULT (''),
[mod1] [varchar](5) NOT NULL DEFAULT (''),
[iproc1] [char](4) NOT NULL DEFAULT (''),
[iproc2] [char](4) NOT NULL DEFAULT (''),
[iproc3] [char](4) NOT NULL DEFAULT (''),
[iproc4] [char](4) NOT NULL DEFAULT (''),
[iproc5] [char](4) NOT NULL DEFAULT (''),
[iproc6] [char](4) NOT NULL DEFAULT (''),
[dx1] [char](5) NOT NULL DEFAULT (''),
[dx2] [char](5) NOT NULL DEFAULT (''),
[dx3] [char](5) NOT NULL DEFAULT (''),
[dx4] [char](5) NOT NULL DEFAULT (''),
[dx5] [char](5) NOT NULL DEFAULT (''),
[dx6] [char](5) NOT NULL DEFAULT (''),
[dx7] [char](5) NOT NULL DEFAULT (''),
[dx8] [char](5) NOT NULL DEFAULT (''),
[dx9] [char](5) NOT NULL DEFAULT (''),
[received_date] [smalldatetime] NULL,
[paid_denied_date] [smalldatetime] NULL,
[net_paid] [money] NULL,
[processing_center] [tinyint] NULL DEFAULT ((0)),
[data_source] [varchar](15) NOT NULL,
[user1] [varchar](30) NULL,
[DateAppended] [smalldatetime] NOT NULL CONSTRAINT [DF_CLAIMS_DateAppended] DEFAULT (getdate()),
CONSTRAINT [PK_claimskey_in_CLAIMS] PRIMARY KEY NONCLUSTERED
(
[claim_key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Here are the indexes:

--------------------------------------------------
-- MEDICAL CLAIMS DETAIL TABLE
--------------------------------------------------
---- INDEXES -----
--This clustered index is necessary for the CDC measure.
CREATE CLUSTERED INDEX IX_CLAIMS_Mbrid_DOS
ON CLAIMS (member_id ASC, dos_to ASC); --Use Ascending inorder to pick up the latest service date.
PRINT 'IX_CLAIMS_Mbrid_DOS completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_Mbrid ON dbo.CLAIMS(member_id ASC);
PRINT 'IX_CLAIMS_Mbrid completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_cpt ON CLAIMS(cpt ASC);
PRINT 'IX_CLAIMS_cpt completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_rev ON CLAIMS(rev ASC);
PRINT 'IX_CLAIMS_rev completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_CMSdrg ON CLAIMS(CMS_drg ASC);
PRINT 'IX_CLAIMS_CMSdrg completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_MSdrg ON CLAIMS(MS_drg ASC);
PRINT 'IX_CLAIMS_MSdrg completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_claimid ON CLAIMS(claim_id ASC);
PRINT 'IX_CLAIMS_claimid completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_pos ON CLAIMS(pos ASC);
PRINT 'IX_CLAIMS_pos completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_tob ON CLAIMS(tob ASC);
PRINT 'IX_CLAIMS_tob completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_datasource ON dbo.CLAIMS(data_source ASC);
PRINT 'IX_CLAIMS_datasource completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_providerspec ON dbo.CLAIMS(provider_specialty ASC);
PRINT 'IX_CLAIMS_providerspec completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_CptMbridDostoClmid ON [dbo].[CLAIMS]
(
[cpt] ASC,
[member_id] ASC,
[dos_to] ASC,
[claim_id] ASC
)
INCLUDE ( [provider_id],
[dos_from]) WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
PRINT 'IX_CLAIMS_CptMbridDostoClmid completed.'
GO
---------
CREATE NONCLUSTERED INDEX IX_CLAIMS_PosDostoAdmdtClmstatMbridDosfrCptDosageClmid ON [dbo].[CLAIMS]
(
[pos] ASC,
[dos_to] ASC,
[admission_date] ASC,
[claim_status] ASC,
[member_id] ASC,
[dos_from] ASC,
[cpt] ASC,
[dos_age] ASC,
[claim_id] ASC
)WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
PRINT 'IX_CLAIMS_PosDostoAdmdtClmstatMbridDosfrCptDosageClmid completed.'
GO
---------
CREATE NONCLUSTERED INDEX [IX_CLAIMS_MbridGenderDosfrom] ON [dbo].[CLAIMS]
(
[member_id] ASC,
[gender] ASC,
[dos_from] ASC
)
INCLUDE ( [claim_id],
[provider_specialty],
[dos_age],
[pos],
[dos_to],
[claim_status],
[rev],
[cpt]) WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
PRINT 'IX_CLAIMS_MbridGenderDosfrom completed.'
GO
---------
CREATE NONCLUSTERED INDEX [IX_CLAIMS_ClaimidMbridDostoCpt] ON [dbo].[CLAIMS]
(
[claim_id] ASC,
[member_id] ASC,
[dos_to] ASC,
[cpt] ASC
)WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY];
GO
PRINT 'IX_CLAIMS_ClaimidMbridDostoCpt completed.'
---------
CREATE NONCLUSTERED INDEX [IX_CLAIMS_ClaimidMbrid] ON [dbo].[CLAIMS]
(
[claim_id] ASC,
[member_id] ASC
)WITH (SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY];
GO
PRINT 'IX_CLAIMS_ClaimidMbrid completed.'
---------

Thanks Jeff.

-Doug

Jeff Moden SSC Guru Points: 1003946 More actions · Answer 8

Holy moly... good thing we're not doing deletes or it would also be deleting from a bazillion indexes (some with duplicate functionality). My recommendation would be to simply drop the indexes that have the two update columns in them and add them back in after the update. Also, Claim_Key should be the key that you copy to the Temp table because it's the only thing that's unique for this table. That will eliminate the need for the DISTINCT in the copy code.

Also, the code you posted doesn't do it in batches... you're still probably hitting a tipping point on 21.3 million rows. If you add an IDENTITY column to the temp table, that'll make it a lot easier to figure out which groups of rows to delete especially if you insert them in clustered index order. SELECT INTO followed by the addition of a non-clustered index on the Claim_key and maybe a clustered index on the IDENTITY column would help a lot on the join.

Last, but not least, you might also want to take the time to do a bit of an index analysis and see if you can get rid of some of them.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 9

Lynn Pettis (12/3/2008)
Here is another idea. The first update will take about 49 minutes (as seen earlier), but the second should actually run faster!
declare @BatchSize int;
set @BatchSize = 10000;
while exists(select 1 from dbo.Claims where admission_date is not null
begin
UPDATE TOP (@BatchSize) dbo.CLAIMS SET
admission_date = NULL,
discharge_date = NULL
WHERE
admission_date IS NOT NULL;
-- I'd put a transaction log backup here
end
while exists(select 1 from dbo.Claims where discharge_date is not null
begin
UPDATE TOP (@BatchSize) dbo.CLAIMS SET
admission_date = NULL,
discharge_date = NULL
WHERE
discharge_date IS NOT NULL;
-- I'd put a transaction log backup here
end

Thanks Lynn. I ran this exact statement above and did so without the BEGIN TRAN statement since Matt thought it would also make the log go crazy. This ran for 1:49 hours and was on the 123rd batch until I stopped it. Never made it to the 2nd transaction.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 10

May I ask another dumb question? If you are going to update these fields using another procedure, why are you trying to set them to null, why not just update them directly regardless of the current value?

Edit: Only dumb because I didn't ask it earlier...

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 11

Lynn Pettis (12/3/2008)
May I ask another dumb question? If you are going to update these fields using another procedure, why are you trying to set them to null, why not just update them directly regardless of the current value?

Good question. When the logic changes on how to identify the admission/discharge dates, I have to wipe them out when I run the admissions script once again. If I don't wipe them all out, then some erroneous dates may be left in there. Only inpatient records get dates since only inpatient claims have an admission and a discharge. So, if I don't set them to NULL then they may be left populated with erroneous dates due to an older version of the script. The script has about 100 statements with various logic to capture admission/discharge dates. It's hard to believe, but medical claim records do not come with discharge dates and the admission dates captured by the facility can't be trusted so I have to determine them by various other means (100+ other statements).

Thanks Lynn for looking into it.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 12

Did you ever try my idea on? Just curious what you ran into.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 13

Matt Miller (12/4/2008)
Did you ever try my idea on? Just curious what you ran into.

Hi Matt,

I'm still working on Jeff's solution. It takes a while to run these scenarios, then put it back the way it was and run again.

Thanks for your input. I will try it after testing Jeff's. I removed the one index that references the two date columns and it cut my original two statements from 1:13 hrs to 24:17 mins. I'm trying it now with the batching and tmp table (after I repopulate them to where they were before).

I'll let you know the results from Jeff's solution and then will work on yours.

thanks again,

Doug

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 14

Doug (12/4/2008)
Matt Miller (12/4/2008)
Did you ever try my idea on? Just curious what you ran into.
Hi Matt,
I'm still working on Jeff's solution. It takes a while to run these scenarios, then put it back the way it was and run again.
Thanks for your input. I will try it after testing Jeff's. I removed the one index that references the two date columns and it cut my original two statements from 1:13 hrs to 24:17 mins. I'm trying it now with the batching and tmp table (after I repopulate them to where they were before).
I'll let you know the results from Jeff's solution and then will work on yours.
thanks again,
Doug

No issue - it's just easy for solutions to get lost when people are tossing stuff at you.

I didn't even notice the index including the columns being updated, so Jeff is definitely onto something there.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 15

Jeff Moden (12/3/2008)
My recommendation would be to simply drop the indexes that have the two update columns in them and add them back in after the update. Also, Claim_Key should be the key that you copy to the Temp table because it's the only thing that's unique for this table. That will eliminate the need for the DISTINCT in the copy code.
Also, the code you posted doesn't do it in batches... you're still probably hitting a tipping point on 21.3 million rows. If you add an IDENTITY column to the temp table, that'll make it a lot easier to figure out which groups of rows to delete especially if you insert them in clustered index order. SELECT INTO followed by the addition of a non-clustered index on the Claim_key and maybe a clustered index on the IDENTITY column would help a lot on the join.

Jeff, I attempted what you wrote above and this script ran for 1:44 hrs before I killed it. Let me know if I misunderstood or if you see how I might improve it. The two INSERT INTO statements were 16 mins combined and the WHILE loop was 1:27 hrs at time of death. I mentioned to Matt earlier how your recommendation of dropping the one index with these date cols did cut the time in the original UPDATEs by more than 1/2.

CREATE TABLE #tmp1 (id INT IDENTITY(1,1)PRIMARY KEY CLUSTERED, claim_key VARCHAR(50))
CREATE NONCLUSTERED INDEX IX_TMP1_claimkey ON #tmp1(claim_key ASC);
INSERT INTO #tmp1 SELECT claim_key FROM dbo.CLAIMS WHERE admission_date IS NOT NULL;
INSERT INTO #tmp1 SELECT T1.claim_key
FROM dbo.CLAIMS T1 LEFT OUTER JOIN #tmp1 T2 ON T1.claim_key=T2.claim_key
WHERE T1.discharge_date IS NOT NULL AND T2.claim_key IS NULL;
DECLARE @BatchSize INT;
SET @BatchSize = 10000;
WHILE EXISTS(SELECT 1 FROM dbo.CLAIMS WHERE admission_date IS NOT NULL)
BEGIN
UPDATE TOP (@BatchSize) T1
SET admission_date=NULL, discharge_date=NULL
FROM dbo.CLAIMS T1 INNER JOIN #tmp1 T2 ON T1.claim_key=T2.claim_key;
END