Reducing the hash match inner join cost

Question

Reducing the hash match inner join cost

rathimittha.mb

SSChasing Mays

Points: 639
More actions
July 20, 2016 at 4:45 am

#328016

Hi, I have a query as follows. I have also listed the indexes on them. I am unable to understand the cost that i being taken up for the index seek (21%) and the hash match inner joins. The query takes 10 minutes to execute and I am unable to make it perform better.
I have attached the query plan too
select
fca.StartDate,
datepart(ww,fca.StartDate) as StartWeek,
convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), 5), 'dd-MMM')) as StartWeekRange,
datepart(mm,fca.StartDate) as StartMonth,
datename(mm,fca.StartDate) as StartMonthName,
datepart(yy,fca.StartDate) as StartYear,
fca.EndDate,
datepart(ww,fca.EndDate) as EndWeek,
convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), 5), 'dd-MMM')) as EndWeekRange,
datepart(mm,fca.EndDate) as EndMonth,
datename(mm,fca.EndDate) as EndMonthName,
datepart(yy,fca.EndDate) as EndYear,
fca.IsApplied,
fca.AppliedDate,
datepart(ww,fca.AppliedDate) as AppliedWeek,
convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.AppliedDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.AppliedDate), 5), 'dd-MMM')) as AppliedWeekRange,
datepart(mm,fca.AppliedDate) as AppliedMonth,
datename(mm,fca.AppliedDate) as AppliedMonthName,
datepart(yy,fca.AppliedDate) as AppliedYear,
fca.ChemicalsMixed,
ap.ApplicationPlanId,
ap.ApplicationPlanName,
ap.Isfreezed,
dac.ApplicationChemicalId,
das.ApplicationScheduleid,
das.ApplicationScheduleName,
dac.AgroChemicalId,
dac.RecommendedQuantity,
dac.QuantityUnitId,
das.MethodofApplicationId,
das.JustificationForUseId,
das.DaysAfterSowing,
das.Interval,
dac.sequenceno,
dac.concentration,
dac.concentrationunitid,
getutcdate() talendate
from dbo.ApplicationPlan ap
join dbo.ApplicationSchedule das
on ap.applicationplanid = das.applicationplanid
and ap.isactive = 1
AND das.isactive = 1
join dbo.ApplicationChemical dac
on dac.ApplicationScheduleid=das.ApplicationScheduleid
AND dac.isactive = 1
join dbo.FarmerCropApplications fca
on fca.ApplicationScheduleid=das.ApplicationScheduleid and
fca.isactive = 1
The index details are
Table ApplicationPlan
clustered index ApplicationPlan.applicationplanid
Table ApplicationSchedule
clustered index ApplicationSchedule.ApplicationScheduleId
Table ApplicationChemical
clustered index ApplicationChemical.ApplicationChemicalId
create nonclustered index idx_ApplicationScheduleid on ApplicationChemical(ApplicationScheduleid)
Table FarmerCropApplications
clustered index FarmerCropApplications.FarmerCropApplicationId
create nonclustered index idx_compp on FarmerCropApplications(isactive) include
(ApplicationScheduleid,startdate,enddate,isapplied,applieddate,chemicalsmixed)
create nonclustered index FarmerCropApplications.FarmerCropId
create nonclustered index FarmerCropApplications.ApplicationScheduleId
Any thoughts on the same would be greatly appreciated
Thanks

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 1

The scans and the hash matches make sense. You don't have any kind of a WHERE clause to filter the data so the query is returning 300,000 rows. Scans & hash matches are the way to go. What's a bit of a surprise is the seek when it's returning 123067 out of 123581. However, since it's a seek in conjunction with a hash match, it's not that radically out of line. The cost distributions are pretty logical, the seek of 123k rows costs a lot more than the scans against 1.6k, 400 and 5k respectively. Then the job of creating a hash table for each of the joins in order to put the data together is, again, costly, but in line with the processing being done.

You're returning everything from the table. There's not a real way to speed that up except bring back less data, or get faster disks.

You're getting a residual probe on one of the hash joins:

[qaCropin1].[dbo].[FarmerCropApplications].[ApplicationScheduleId] as [fca].[ApplicationScheduleId]=[qaCropin1].[dbo].[ApplicationSchedule].[ApplicationScheduleId] as [das].[ApplicationScheduleId]

It's using this index:

create nonclustered index idx_compp on FarmerCropApplications(isactive) include

(ApplicationScheduleid,startdate,enddate,isapplied,applieddate,chemicalsmixed)

I'm guessing here, but I'll be the IsActive column isn't very selective. Probably a 1 or 0. It's not eliminating very many rows at all. You have an index on ApplicationScheduleID, but it's not used because it doesn't have the necessary columns to satisfy the rest of the query. Possibly adding an INCLUDE for IsActive and the rest of the columns to that other index would better than this index. It's retrieving most of the data (probably almost all satisfies IsActive), then probing against the INCLUDE column. That's not going to be as efficient as it would be if that probe column was the index. In fact, you might see a merge there since it will be able to bring the data back in the correct order.

That's the best I have at the moment.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

ScottPletcher SSC Guru Points: 100948 More actions · Answer 2

If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.

Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Sergiy SSC Guru Points: 110206 More actions · Answer 3

Who is that poor guy who's meant to read through those 300k lines of this report?

Considering 60 lines per screen it's 5 thousand of screens to look through.

And if it's not for a human eye - why all those extensive formattings?

Try to do SELECT .... INTO #Test to see how much time it actually takes to execute the query and how long it takes to pass that horrible recordset from the server to your screen.

_____________
Code for TallyGenerator

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 4

ScottPletcher (7/20/2016)
If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.
Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 5

Also, if you look at the last hash match on the left of the execution plan, you've got some pretty serious relational multiplication going on there. Another name for that is "many-to-many" join and might be classified as an "accidental cross join". Of course, that may be what was actually intended since this is all about the scheduling of the application of chemicals, which could certainly happen over more than one schedule period. Just thought I'd mention it, though.

And I agree with Sergiy... it's not likely (and shouldn't be allowed if it is likely) that someone wants to see 330K rows of data on the screen. I wouldn't use that as a litmus strip for performance.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

rathimittha.mb SSChasing Mays Points: 639 More actions · Answer 6

Hi,

Thanks again for the great response.

I am using this query to build a small datawarehouse. And this query is supposed to run 3 times a day.

The active flag has 1 or 0. I am supposed to pick only active = 1 rows.

Hence, the reason for so many a rows.

Are there any other ideas I could try on this query?

Thanks in advance

Rathi

Sergiy SSC Guru Points: 110206 More actions · Answer 7

rathimittha.mb (7/20/2016)
I am using this query to build a small datawarehouse. And this query is supposed to run 3 times a day.
...
Hence, the reason for so many a rows.

Then, there is no reason at all for this:

datepart(ww,fca.StartDate) as StartWeek,

convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), 5), 'dd-MMM')) as StartWeekRange,

datepart(mm,fca.StartDate) as StartMonth,

datename(mm,fca.StartDate) as StartMonthName,

datepart(yy,fca.StartDate) as StartYear,

datepart(ww,fca.EndDate) as EndWeek,

convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), 5), 'dd-MMM')) as EndWeekRange,

datepart(mm,fca.EndDate) as EndMonth,

datename(mm,fca.EndDate) as EndMonthName,

datepart(yy,fca.EndDate) as EndYear,

You should take it out of the query

_____________
Code for TallyGenerator

rathimittha.mb SSChasing Mays Points: 639 More actions · Answer 8

Hi Sergiy,

I did remove those computations. And added option(recomplie) in the end. This is because everytime I run the query the data and the indexes would have for sure gone through a change

The query now run within 2 minutes. It retrieves around 3 million data. Is this time good considering that this is used for datawarehouse purposes?

Thanks

Rathi

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 9

Even for a data warehouse you don't want to move all the rows every time. You have limited to zero tuning opportunities for a query without any filtering.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

rathimittha.mb SSChasing Mays Points: 639 More actions · Answer 10

Hi Grant,

I agree with you. But as per the requirement I have I am supposed to take up all the active = 1 rows and migrate it to the other database.

In this case, my only filter is active = 1.

Unfortunately,I have no other filter criteria.

Is there any way I can use active = 1, so that the optimizer just looks at active = 1 rows. Like in the below syntax, use it in the WHERE filter_predicate?

CREATE NONCLUSTERED INDEX index_name

ON <object> ( column [ ASC | DESC ] [ ,...n ] )

[ INCLUDE ( column_name [ ,...n ] ) ]

[ WHERE <filter_predicate> ]

Thanks

Rathi

Grant Fritchey SSC Guru Points: 398690 More actions · Answer 11

It's only filtering a very small subset of the data. Because of that, it really is cheaper for the optimizer to scan rather than seek. Go back to my original suggestions. They're the best I have without any other kind of filter. Again, every data warehouse I've worked with and helped support only moves a subset of the data instead of all data every time.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

rathimittha.mb SSChasing Mays Points: 639 More actions · Answer 12

Thank you Grant. I guess I'l need to figure out a way to get delta(changed) data rather than the entire set

Thanks Again !!

ScottPletcher SSC Guru Points: 100948 More actions · Answer 13

Jeff Moden (7/20/2016)
ScottPletcher (7/20/2016)
If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.
Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.
Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.

Actually, I think you did misunderstand somewhat what I said. I never mentioned an NCI. If, and only if (as I stated), a large number of rows are inactive, I would cluster first on the inactive flag, as I stated, to immediately seek to only the rows SQL has to process. Yes, rows would move, but only once, and presumably roughly in ascending order of cluster key column #2 and beyond.

Really my preference is to have a completely separate table for inactive rows, but many people love the idea of a "soft delete" flag in the main table; which really does seem to me to be a holdover from pre-relational days.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 14

ScottPletcher (7/21/2016)
Jeff Moden (7/20/2016)
ScottPletcher (7/20/2016)
If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.
Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.
Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.
Actually, I think you did misunderstand somewhat what I said. I never mentioned an NCI. If, and only if (as I stated), a large number of rows are inactive, I would cluster first on the inactive flag, as I stated, to immediately seek to only the rows SQL has to process. Yes, rows would move, but only once, and presumably roughly in ascending order of cluster key column #2 and beyond.
Really my preference is to have a completely separate table for inactive rows, but many people love the idea of a "soft delete" flag in the main table; which really does seem to me to be a holdover from pre-relational days.

Thanks for confirming that I didn't misunderstand.

Clustering (or creating any index as the first column) on a flag (or other extremely low cardinality column) is a really bad idea for anything where the flag can change even once. Like I said, I paralyzed Expedia.com for about two minutes by making such an index.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)