Index design on large and "wide" tables

  • We have a database of some statistical measures for some entities per day.

    I can't share the full schema, but trying to simplify so one still can follow the generic idea.

    The basic design is following:

    CREATE TABLE dbo.StatisticsTable (
    statisticsType INT, timePeriod INT, entityID INT
    , statisticalMeasure1 NUMERIC(19,12)
    ...
    , statisticalMeasureN NUMERIC(19,12)
    primary key (statisticsType, entityID, period))

    period contains a time stamp value like 20210929. There are perhaps 100000 different entities, 3 different statistics types and 100 million rows in total per type. So, in total something like 300 million rows.

    Each row contains something like 400 different columns / measures. There are some measures that are strings so each row has quite a large size.

    Every day, we want to generate latest data as well as recalculate previous day in case some rows are changed.

    So, we do a:

    create table maxPeriodTable (entityid, maxPeriod primary key (entityid, maxperiod))

    insert into maxPeriodTable
    select entityid, max(timeperiod)
    from StatisticsTable
    where statisticstype = 1
    group by entityid
    ... Perform stats calculation

    delete t
    from StatisticsTable t
    inner join maxPeriodTable pm
    ON pm.entityid = t.entityid
    and t.timeperiod >= pm.maxperiod
    where t.statisticstype = 1

    The problem is! The DELETE performs horribly. Even though we only recalculate last day of each entity (like 100k rows max).

    Usually SQL Server decides it wants to do full scan of StatisticsTable and then hash match against the "last period" table, see attached query plan image. This goes against at least my knowledge that it should just perform a NESTED Loop Join.

    But even when you do a forced "loop join" hint, the thing doesn't go fast. In fact, sometimes it even goes slower than the "full scan mode".

    For some reason, it helps to create an index only on entityID. In that case, it can do a index seek, and since the PK clustered fields are included in the index, it doesn't need any lookups.

    My question is, how come the PK isn't used for the delete. Is it because of excessive IO that it needs to perform by trudging through the big data rows? And why does it help to create an index, even though it matches the PK.

    • This topic was modified 2 years, 7 months ago by  siggemannen.
    Attachments:
    You must be logged in to view attached files.
  • OK, lots to unpack. Honestly, first thing I have to say, it really sounds like this data needs to be properly normalized. I would absolutely focus a lot of effort on making that happen. I think you'd end up with radically less data if it was normalized. However, let's talk about the immediate problem.

    An index on EntityID is not the same as a compound index on StatisticsType, EntityID, Period. Yes, EntityID is part of the compound key, but, it's not the leading edge, the first column. That first column is what is used for statistics. Statistics is how SQL Server determines row count estimates. Row count estimates drive the choices the optimizer makes. By having an index on EntityID, you get statistics about EntityID. Evidently that gives SQL Server much better row estimates than StatisticsType does and you get a superior execution plan because of it. From there, as you say, the clustered key is in the nonclustered index, so it finds what it needs to run the DELETE. I'd go with that.

    But, good gosh, look at the normalization.

     

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • And one other point, I just noticed, statisticstype you say has three values. You've got it as the first column in your compound key. The statistics probably have one or two rows with row count estimates in the hundreds of millions. Changing the ordering of the index could be extremely valuable to your organization. For example, I'd suggest the most restrictive data as the leading edge. The data that's going to give you the narrowest, and most accurate, row counts is going to be the best one. Also, when dealing with indexes, a good habit is to always think: "The Key, The Whole Key and Nothing But the Key, So Help Me Codd". It's a design maxim, but it applies to performance, very much.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • To determine proper clustering (PKs) for the tables, we'd need to know more about how the data is most often processed, by statistics type or entityid or timeperiod range.

    However, as to the DELETE itself, you have one potential performance-hurter that you don't need.  You should specify "=" rather than ">=" on the JOIN regarding the timeperiods.  At least, the way I read it, it's not possible for the value in Stats to be ">" the value in max.

    delete t

    from StatisticsTable t

    inner join maxPeriodTable pm

    ON pm.entityid = t.entityid

    and t.timeperiod = pm.maxperiod --rather than ">="

    where t.statisticstype = 1

    What you really want to see is a MERGE join rather than a LOOP or HASH join.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hello, and thanks for the answers! I realized that i never specified exactly how the data was "read".

    Statisticstype is always part of the "where", because it makes no sense to mix and match the types.

    In most of other cases, WHERE will contain one or several entities.

    Finally, there's usually but not always a time period BETWEEN selection.

    So a standard select query could be:

    select timePeriod, statisticalMeasure1, statisticalMeasure2
    from StatististicsTable s
    where s.statisticstype = 1
    and s.entityid = 123
    and s.timePeriod between 20210101 and 20211231

    These measures can also be aggregated, but this is the main version.

    I just don't quite see how i would go with the index, or what kind of query plan i should be working towards that would solve this kind of "delete" the best way. Perhaps there's no quick way, since a loop join still needs to do at least 100k lookups (one per entity).

    As you write, there are three or four distinct statisticstype values, each containing a 100 million rows or so. The least "number" of rows for the delete statement can be found in "timePeriod" column, since it contains mostly last couple of days to be deleted. So a suggested index could be something like timePeriod, entityID, statisticstype.

    It would likely be good for inserts, since data arriving will be sorted naturally, but i'm pretty sure it would perform badly if timeperiod selection won't be part of the Select queries.

    What bothers me about the whole thing is that Index on entityID seems to help for no logical reason, because the "maxPeriodTable" still contains all available entities, so it doesn't narrow anything down. It's almost like scanning the non-clustered index is much much cheaper. Non-Cl indexes do lie on same disks as Clustered ones though so there shouldn't be any performance affinity of using them.

    Would an NON-clustered index ON timePeriod then work wonders? Or partitioning on statisticstype and indexing on entityID.

    The guy who did the system before me says that what helped him getting it up to speed was to create separate Non-cl indexes for each of the PK-columns, giving sql server full flexibility to choose from. It doesn't sound correct to me, but this is a REALLY large data set so perhaps there are other play rules that apply.

    As for normalization, i'm not sure if there is a way of doing it, other than splitting table into several by horizontal column split? Sure, this could probably be done and might be done in the future.

    About the>= vs =, yes, you're correct, it can be = in this example. I didn't mention it, but often we rerun a few days in the back, so we still need to fetch more than just the last exact date.

    Once again, really appreciate your answers

  • About the>= vs =, yes, you're correct, it can be = in this example. I didn't mention it, but often we rerun a few days in the back, so we still need to fetch more than just the last exact date.

    But that would <= the (max) date, not >=.

    Did you re-try it with = JOIN rather than >=?  You should get a better plan.

    If that's your typical query, then your clustering (PK as you prefer) is fine as it is.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Hi. Yesterday i added a NON-cL index on entityID and it seems to do much better, see attached plan.

    So it seems we need to cover the bases with a second index when there are a lot of rows left after the left-most column (statisticstype).

    Actually, i now recall that in my previous work we had similar issues. Every table there had something called "companyID", which differentiated to which "sub"-business the row belonged to. So every selection had:

    select ... 
    from orders o
    inner join orderrows r
    on r.companyid = o.companyid
    and r.orderid = o.orderid
    where o.companyiD = 1
    and ...

    Table PKs were either companyID, tableID or tableID, companyID depending on the taste of the developer.

    Most of our clients had only one company, so the selection didn't narrow anything down. And sometimes server decided to get weird and did full scans.

    Perhaps there is need for both indexes to cover the bases correctly.

    Attachments:
    You must be logged in to view attached files.
  • A picture of a plan isn't a plan. Also, plans are not a measure of performance. They're a description of behavior. Be cautious about looking at them as a measurement stick.

    I'm still where i was. Any column would be better as the leading edge of the index than a column with only 4 values across hundreds of millions of rows. The statistics on an index like that would be useless. And I'm not saying drop the column from you compound key. I am saying, rearranging the order of the key will help. I can't tell you specific order without a lot more details.

    As to a scan of non-clustered index being faster than a scan of the clustered one, well, yeah. The optimizer can see that too. If a non-clustered index can satisfy the needs of the query, even if it's just a scan, and it's radically smaller than the clustered index (and I'm positive it is in this case), then it will use that smaller index to do a faster scan. Yes, maybe the same number of rows, but more rows per page, so fewer pages being scanned. Remember, data is stored with the clustered index, but not with the non-clustered index unless you specifically tell it as an INCLUDE.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey wrote:

    Also, plans are not a measure of performance. They're a description of behavior. Be cautious about looking at them as a measurement stick.

    +1000000!  I will probably not hear more accurate or important words today.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • siggemannen wrote:

    Each row contains something like 400 different columns / measures. There are some measures that are strings so each row has quite a large size.

    The problem is! The DELETE performs horribly. Even though we only recalculate last day of each entity (like 100k rows max).

    Usually SQL Server decides it wants to do full scan of StatisticsTable and then hash match against the "last period" table, see attached query plan image. This goes against at least my knowledge that it should just perform a NESTED Loop Join.

    But even when you do a forced "loop join" hint, the thing doesn't go fast. In fact, sometimes it even goes slower than the "full scan mode".

    Are you sure the performance problems aren't mainly down to deleting rows from a table with wide rows? I've found deleting wide rows can be very slow no matter how good the plan is the query is using.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply