indexed views, aggregations and poor cardinality estimations

  • I've got a large sales table (66m rows) that I need to query in an aggregated way to return a bestseller list for specific dates and stores.

    I think i'm running into an issue relating to statistics and the distribution of values, but I'm not sure....

    Here is the structure of the tables involved in the query:

    tblSales table structure:

    [date_id] [int] NOT NULL,

    [store_id] [int] NOT NULL,

    [product_id] [bigint] NOT NULL,

    [volume_sold] [int] NOT NULL

    date_id is an int representing a specific date, store_id is the id of the store that made the sale, product_id is the id of the product that sold, volume_sold

    is the... volume sold of a specific product on a specific date in a specific store. Pretty straightforward.

    tblDates:

    date_id int,

    calendar_date datetime,

    period int

    calendar_date is the actual date the date_id represents, period is an int that is used to group the date_ids into weeks (week 1 = period 1, week2 = period 2, etc.)

    [dbo].tblStores

    [dbo].[tblStores_To_Aggregates]

    [dbo].[tblAggregates]

    these tables are used to assign groups of stores into aggregates that can then be queried on (ex: west coast aggregate, small stores aggregate, ALL aggregate, etc.)

    to speed up the query i've created an indexed view on the data like so:

    select

    tblDates.period,

    product_id,

    [tblAggregates].aggregate_id,

    SUM(volume_sold) as vol_sold,

    COUNT_BIG(*) AS FREQUENCY

    from

    [dbo].[tblSales] sales

    INNER JOIN [dbo].tblStores

    ON sales.store_id = tblStores.store_id

    INNER JOIN [dbo].[tblStores_To_Aggregates]

    ON tblStores.store_id = [tblStores_To_Aggregates].store_id

    INNER JOIN [dbo].[tblAggregates]

    ON [tblStores_To_Aggregates].aggregate_id = [tblAggregates].aggregate_id

    INNER JOIN [dbo].tblDates

    ON tblDates.date_id = sales.date_id

    group by

    [tblAggregates].aggregate_id,

    tblDates.period,

    product_id

    And then a clustered index on the view in the form:

    [aggregate_id] ASC,

    [period] ASC,

    [product_id] ASC

    And here is the query:

    select top 1000

    sales.product_id,

    SUM(volume_sold) as vol_sold,

    RANK() OVER (ORDER BY SUM(volume_sold) DESC) as product_rank

    FROM

    [dbo].[tblSales] sales

    INNER JOIN [dbo].tblStores

    ON sales.store_id = tblStores.store_id

    INNER JOIN [dbo].[tblStores_To_Aggregates]

    ON tblStores.store_id = [tblStores_To_Aggregates].store_id

    INNER JOIN [dbo].[tblAggregates]

    ON [tblStores_To_Aggregates].aggregate_id = [tblAggregates].aggregate_id

    INNER JOIN [dbo].tblDates

    ON tblDates.date_id = sales.date_id

    WHERE

    period >= 460

    and period <= 470

    and [tblAggregates].aggregate_id = 1

    group by

    product_id

    order by

    product_rank asc

    The issue comes when varying the aggregate_id.

    The aggregate could contain many stores, or only a few: aggregate_id 1 has 1878 stores. 2 has 93.

    When querying on id 2, the query returns in subsecond and 1 takes 30+ seconds.

    Looking at the query plan it seems like the issue in caused by a poor estimation of the amount of rows that will come back.

    The plan for 2 has an estimation that is close. But for 1 it's way off. This is causing it to spill to tempdb and slow down the query.

    My guess is that the problem is related to the statistics (they are up to date btw) and them giving poor cardinality estimations.

    I'm kind of at a loss at what to do next. Any advice or suggestions would be appreciated.

  • Have you tried turning this into a stored procedure and including the WITH RECOMPILE option when you create it? This will force the optimiser to recalc the execution plan every time the procedure runs. It may be that the optimiser is currently reusing the same plan (optimised for the smaller query) hence causing the performance hit.

    One other option would be to break this one query into separate smaller query steps (if that's an option for you) E.g:

    1 Select list of stores into @tablevariable

    2 Select list of products by store

    3 etc...

    For my sins we still use a 2000 server so run into performance issues all the time. RANK is a luxury we don't have 😉

  • Thanks for the reply. I tried SPs before to no avail.

    Right now I've seen some performance gains by referring to the indexed view directly in the query like so:

    select

    product_id,

    SUM(vol_sold) as vol_sold,

    RANK() OVER (ORDER BY SUM(vol_sold) DESC) as product_rank

    from

    [dbo].[view_SalesData_Weekly] WITH (NOEXPAND)

    WHERE

    period >= 460

    and period <= 480

    and aggregate_id = 1

    group by

    product_id

    Even though SQL was already using the indexed view in the query i posted before, it appears that by specifying the view directly that more accurate statistics are found. I suspect that the joins are what was causing the issue.

  • Ah, didn't realise you weren't already referencing the view directly. The optimiser is smart, but sometimes it does need a helping hand. It usually finds reasons not to use a materialised view (hence the no expand option) rather than always seeking them out.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply