Reducing the hash match inner join cost

  • Hi, I have a query as follows. I have also listed the indexes on them. I am unable to understand the cost that i being taken up for the index seek (21%) and the hash match inner joins. The query takes 10 minutes to execute and I am unable to make it perform better.

    I have attached the query plan too

    select

    fca.StartDate,

    datepart(ww,fca.StartDate) as StartWeek,

    convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), 5), 'dd-MMM')) as StartWeekRange,

    datepart(mm,fca.StartDate) as StartMonth,

    datename(mm,fca.StartDate) as StartMonthName,

    datepart(yy,fca.StartDate) as StartYear,

    fca.EndDate,

    datepart(ww,fca.EndDate) as EndWeek,

    convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), 5), 'dd-MMM')) as EndWeekRange,

    datepart(mm,fca.EndDate) as EndMonth,

    datename(mm,fca.EndDate) as EndMonthName,

    datepart(yy,fca.EndDate) as EndYear,

    fca.IsApplied,

    fca.AppliedDate,

    datepart(ww,fca.AppliedDate) as AppliedWeek,

    convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.AppliedDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.AppliedDate), 5), 'dd-MMM')) as AppliedWeekRange,

    datepart(mm,fca.AppliedDate) as AppliedMonth,

    datename(mm,fca.AppliedDate) as AppliedMonthName,

    datepart(yy,fca.AppliedDate) as AppliedYear,

    fca.ChemicalsMixed,

    ap.ApplicationPlanId,

    ap.ApplicationPlanName,

    ap.Isfreezed,

    dac.ApplicationChemicalId,

    das.ApplicationScheduleid,

    das.ApplicationScheduleName,

    dac.AgroChemicalId,

    dac.RecommendedQuantity,

    dac.QuantityUnitId,

    das.MethodofApplicationId,

    das.JustificationForUseId,

    das.DaysAfterSowing,

    das.Interval,

    dac.sequenceno,

    dac.concentration,

    dac.concentrationunitid,

    getutcdate() talendate

    from dbo.ApplicationPlan ap

    join dbo.ApplicationSchedule das

    on ap.applicationplanid = das.applicationplanid

    and ap.isactive = 1

    AND das.isactive = 1

    join dbo.ApplicationChemical dac

    on dac.ApplicationScheduleid=das.ApplicationScheduleid

    AND dac.isactive = 1

    join dbo.FarmerCropApplications fca

    on fca.ApplicationScheduleid=das.ApplicationScheduleid and

    fca.isactive = 1

    The index details are

    Table ApplicationPlan

    clustered index ApplicationPlan.applicationplanid

    Table ApplicationSchedule

    clustered index ApplicationSchedule.ApplicationScheduleId

    Table ApplicationChemical

    clustered index ApplicationChemical.ApplicationChemicalId

    create nonclustered index idx_ApplicationScheduleid on ApplicationChemical(ApplicationScheduleid)

    Table FarmerCropApplications

    clustered index FarmerCropApplications.FarmerCropApplicationId

    create nonclustered index idx_compp on FarmerCropApplications(isactive) include

    (ApplicationScheduleid,startdate,enddate,isapplied,applieddate,chemicalsmixed)

    create nonclustered index FarmerCropApplications.FarmerCropId

    create nonclustered index FarmerCropApplications.ApplicationScheduleId

    Any thoughts on the same would be greatly appreciated

    Thanks

  • The scans and the hash matches make sense. You don't have any kind of a WHERE clause to filter the data so the query is returning 300,000 rows. Scans & hash matches are the way to go. What's a bit of a surprise is the seek when it's returning 123067 out of 123581. However, since it's a seek in conjunction with a hash match, it's not that radically out of line. The cost distributions are pretty logical, the seek of 123k rows costs a lot more than the scans against 1.6k, 400 and 5k respectively. Then the job of creating a hash table for each of the joins in order to put the data together is, again, costly, but in line with the processing being done.

    You're returning everything from the table. There's not a real way to speed that up except bring back less data, or get faster disks.

    You're getting a residual probe on one of the hash joins:

    [qaCropin1].[dbo].[FarmerCropApplications].[ApplicationScheduleId] as [fca].[ApplicationScheduleId]=[qaCropin1].[dbo].[ApplicationSchedule].[ApplicationScheduleId] as [das].[ApplicationScheduleId]

    It's using this index:

    create nonclustered index idx_compp on FarmerCropApplications(isactive) include

    (ApplicationScheduleid,startdate,enddate,isapplied,applieddate,chemicalsmixed)

    I'm guessing here, but I'll be the IsActive column isn't very selective. Probably a 1 or 0. It's not eliminating very many rows at all. You have an index on ApplicationScheduleID, but it's not used because it doesn't have the necessary columns to satisfy the rest of the query. Possibly adding an INCLUDE for IsActive and the rest of the columns to that other index would better than this index. It's retrieving most of the data (probably almost all satisfies IsActive), then probing against the INCLUDE column. That's not going to be as efficient as it would be if that probe column was the index. In fact, you might see a merge there since it will be able to bring the data back in the correct order.

    That's the best I have at the moment.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.

    Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Who is that poor guy who's meant to read through those 300k lines of this report?

    Considering 60 lines per screen it's 5 thousand of screens to look through.

    And if it's not for a human eye - why all those extensive formattings?

    Try to do SELECT .... INTO #Test to see how much time it actually takes to execute the query and how long it takes to pass that horrible recordset from the server to your screen.

    _____________
    Code for TallyGenerator

  • ScottPletcher (7/20/2016)


    If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.

    Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

    Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Also, if you look at the last hash match on the left of the execution plan, you've got some pretty serious relational multiplication going on there. Another name for that is "many-to-many" join and might be classified as an "accidental cross join". Of course, that may be what was actually intended since this is all about the scheduling of the application of chemicals, which could certainly happen over more than one schedule period. Just thought I'd mention it, though.

    And I agree with Sergiy... it's not likely (and shouldn't be allowed if it is likely) that someone wants to see 330K rows of data on the screen. I wouldn't use that as a litmus strip for performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi,

    Thanks again for the great response.

    I am using this query to build a small datawarehouse. And this query is supposed to run 3 times a day.

    The active flag has 1 or 0. I am supposed to pick only active = 1 rows.

    Hence, the reason for so many a rows.

    Are there any other ideas I could try on this query?

    Thanks in advance

    Rathi

  • rathimittha.mb (7/20/2016)


    I am using this query to build a small datawarehouse. And this query is supposed to run 3 times a day.

    ...

    Hence, the reason for so many a rows.

    Then, there is no reason at all for this:

    datepart(ww,fca.StartDate) as StartWeek,

    convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.StartDate), 5), 'dd-MMM')) as StartWeekRange,

    datepart(mm,fca.StartDate) as StartMonth,

    datename(mm,fca.StartDate) as StartMonthName,

    datepart(yy,fca.StartDate) as StartYear,

    datepart(ww,fca.EndDate) as EndWeek,

    convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), - 1), 'dd-MMM')) + ' - ' + convert(NVARCHAR(50), Format(dateadd(ww, datediff(ww, 0,fca.EndDate), 5), 'dd-MMM')) as EndWeekRange,

    datepart(mm,fca.EndDate) as EndMonth,

    datename(mm,fca.EndDate) as EndMonthName,

    datepart(yy,fca.EndDate) as EndYear,

    You should take it out of the query

    _____________
    Code for TallyGenerator

  • Hi Sergiy,

    I did remove those computations. And added option(recomplie) in the end. This is because everytime I run the query the data and the indexes would have for sure gone through a change

    The query now run within 2 minutes. It retrieves around 3 million data. Is this time good considering that this is used for datawarehouse purposes?

    Thanks

    Rathi

  • Even for a data warehouse you don't want to move all the rows every time. You have limited to zero tuning opportunities for a query without any filtering.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi Grant,

    I agree with you. But as per the requirement I have I am supposed to take up all the active = 1 rows and migrate it to the other database.

    In this case, my only filter is active = 1.

    Unfortunately,I have no other filter criteria.

    Is there any way I can use active = 1, so that the optimizer just looks at active = 1 rows. Like in the below syntax, use it in the WHERE filter_predicate?

    CREATE NONCLUSTERED INDEX index_name

    ON <object> ( column [ ASC | DESC ] [ ,...n ] )

    [ INCLUDE ( column_name [ ,...n ] ) ]

    [ WHERE <filter_predicate> ]

    Thanks

    Rathi

  • It's only filtering a very small subset of the data. Because of that, it really is cheaper for the optimizer to scan rather than seek. Go back to my original suggestions. They're the best I have without any other kind of filter. Again, every data warehouse I've worked with and helped support only moves a subset of the data instead of all data every time.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Thank you Grant. I guess I'l need to figure out a way to get delta(changed) data rather than the entire set

    Thanks Again !!

  • Jeff Moden (7/20/2016)


    ScottPletcher (7/20/2016)


    If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.

    Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

    Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.

    Actually, I think you did misunderstand somewhat what I said. I never mentioned an NCI. If, and only if (as I stated), a large number of rows are inactive, I would cluster first on the inactive flag, as I stated, to immediately seek to only the rows SQL has to process. Yes, rows would move, but only once, and presumably roughly in ascending order of cluster key column #2 and beyond.

    Really my preference is to have a completely separate table for inactive rows, but many people love the idea of a "soft delete" flag in the main table; which really does seem to me to be a holdover from pre-relational days.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (7/21/2016)


    Jeff Moden (7/20/2016)


    ScottPletcher (7/20/2016)


    If the "isactive" flag is significant, i.e. a significant number of rows are actually inactive, then cluster first on it. If not, get rid of it. At the risk of sounding like Celko, they often really are a hold-over from the '60s and '70s when files were always fully read anyway, and thus such flags were "free" to use.

    Further advice would require seeing table definitions. For best performance overall, concentrate first and by far foremost on getting the best clustered index on every table.

    Gosh... If I understand what you're said, I'd never cluster on a flag column first. I won't even build an NCI with a leading flag column. I learned my lesson on that several years ago when I paralyzed Expedia.com for a good 2 minutes before I realized my mistake and flipped the first two column of the index around. The page splits where rampant and the whole world was timing out because of it.

    Actually, I think you did misunderstand somewhat what I said. I never mentioned an NCI. If, and only if (as I stated), a large number of rows are inactive, I would cluster first on the inactive flag, as I stated, to immediately seek to only the rows SQL has to process. Yes, rows would move, but only once, and presumably roughly in ascending order of cluster key column #2 and beyond.

    Really my preference is to have a completely separate table for inactive rows, but many people love the idea of a "soft delete" flag in the main table; which really does seem to me to be a holdover from pre-relational days.

    Thanks for confirming that I didn't misunderstand.

    Clustering (or creating any index as the first column) on a flag (or other extremely low cardinality column) is a really bad idea for anything where the flag can change even once. Like I said, I paralyzed Expedia.com for about two minutes by making such an index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply