Managing large table in SQL Server 2014

  • If the app is going to be reading the entire table anyway, if possible, use a clustered columnstore index.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Sergiy - Tuesday, March 20, 2018 8:00 AM

    Well, I stand by my suggestion - the best long time solution to the performance problems you have is to move away from using PowerBI.
    Because query tools which build queries like these can be described with only one word: "Rubbish".

    Query 1.

    a) Long list of columns selected for [t1], especially

    convert(date, [_].[Date]) as [Date],

    does not make any sense.
    Only columns which are actually used are [Week Ending] and [TimeId]
    b) Left Outer Join is overturned by WHERE clause and made into effectively INNER JOIN, but not as effective as actual INNER JOIN.
    c)  TimeID does not have a logical correspondence to the clustered index on table [t0]. Regardless which column is used for clustered index - [Id], as now, or StartTime], as suggested.
    Therefore JOIN by that filed will cause a clustered index scan (aka table scan). Indexing this field would not help much, because the optimiser could not predict the range of TimeID values coming out of [t1], so it would have to choose
    between {index scan + bookmark lokup} and {CI scan}.CI scan usually wins.
    d) using TOP (n) without ORDER BY does not make much sense, as it would return randomly selected rows.
    It looks like the only purpose of it is to prevent memory overload when the reurned recordset it too big. But that may be, and must be, achiived in a different way. At least, a user must get an indication that there are rows which were omitted from the returned recordset.

    Assuming, you have already  changed the clustered index on [dbo].[QueryStats] that's how the query should look like:

    SELECT
    [t1].[Week Ending],
    COUNT_BIG([t0].[EventClass]) AS [a0]
    FROM [dbo].[QueryStats] AS [t0]
    INNER JOIN (
    select MIN([Date]) as [Fromdate],
    DATEADD(dd, DATEDIFF(dd, 0, MAX([Date]))+1, 0) as [ToDate],
    [WeekEnding]
    from [dbo].[Time]
    WHERE [Week Ending] < CAST( '20180321 00:00:00' AS datetime)
    AND [t1].[Week Ending] >= CAST( '20180124 00:00:00' AS datetime)
    GROUP BY [WeekEnding]
    ) [t1] on [t0].[StartTime] >= [t1].[Fromdate] AND [t0].[StartTime] < [t1].[ToDate]
    WHERE
    ([t0].[EventClass] IN (N'RPC:Completed',N'SQL:BatchCompleted',N'Query End'))
    GROUP BY [t1].[Week Ending]

    But since you cannot rewrite the query, you may fool the instrument amd make it do what you need to do.
    For that you're gonna need to figure out the full range of dates covered by the selected Week Endings, and add a filter on StartTime >= FirstDateFromTheRange and StartTime < DATEADD(dd, 1, LastdateFromTheRange).
    With a clustered index on StartTime you should not have any performance issues even on much bigger data sets.

    Query 2.
    Pretty much the same issue.
    But because the filter is against [t1].[date] instead of [t1].WeekEnding] the fix is even easier.
    Just add a filter on [StartTime] within the same range as [Date].

    Query 3
    same as Query 2.

    Query 4.
    It seems like it returns the aggregations against the whole history, as far as it goes into the past.
    It's meant to scan the whole data set by definition, so there is not much to be improved.
    I can only suggest to apply a filter on StartTime limiting the period covered by the report.
    To me, it's pretty stupid to recalculate aggregations based on read-only datasets for previous months, which have been already calculated many times earlier, but the tool is as dummy as it is.

    Query 5
    Is the same as Query 4, just 1 less item in grouping.

    Hi Sergiy,

    I appreciate the time you spent looking at my issue, thank you.

    Moving away from Power BI isn't something the business will be doing any time soon, it was brought in to solve several business requirements and our end users are happy with it. They won't accept us taking it away because we have some technical challenges.

  • ScottPletcher - Tuesday, March 20, 2018 8:19 AM

    If the app is going to be reading the entire table anyway, if possible, use a clustered columnstore index.

    Yes, I agree. I will move the nvarchar(max) column away from the main table and that will allow me to convert the current clustered index to a columnstore.

  • Jim-S - Tuesday, March 20, 2018 8:25 AM

    ScottPletcher - Tuesday, March 20, 2018 8:19 AM

    If the app is going to be reading the entire table anyway, if possible, use a clustered columnstore index.

    Yes, I agree. I will move the nvarchar(max) column away from the main table and that will allow me to convert the current clustered index to a columnstore.

    Exactly. Don't need to sit here trying to find out how you are querying the database. You know how PowerBI is going to do it, try to optimize for that the best way possible. Reduce the amount of bytes and records being read here the best you can.

  • Sergiy - Monday, March 19, 2018 7:55 PM

    frederico_fonseca - Monday, March 19, 2018 7:07 PM

    When loading the data onto Powerbi split the datetime fields onto date portion and time portion as separate columns.

    The big int fields should also, where possible, be split
    e.g. a number

    123,123,123,122 could be split onto 3

    Never a good idea.
    If a package assumes it as a good practice the best approach would be not to use it.

    can you write a query to select all queries started during off-hours last night?
    And what about a query selecting all events with writes > certain amount?

    With split columns both queries would require table scans, which means reading the whole table into memory and apply filtering in there.
    Which is exactly the problem with the current state of affairs in OP's database.

    With Vertipaq it is almost always a good idea - this is not SQL processing it.
    Once the data is loaded on Vertipaq querying the data based on your 2 examples above would be quite fast and easy enough to do with DAX. And Vertipaq is a In Memory engine so limitation will be how much memory is required and available to PowerBI.

    Now that I saw what queries are being issued by PowerBI I am even quite inclined to say that some changes on how it is being done on it could most likely improve the whole process for at least parts of it.

  • frederico_fonseca - Tuesday, March 20, 2018 12:47 PM

    Sergiy - Monday, March 19, 2018 7:55 PM

    frederico_fonseca - Monday, March 19, 2018 7:07 PM

    When loading the data onto Powerbi split the datetime fields onto date portion and time portion as separate columns.

    The big int fields should also, where possible, be split
    e.g. a number

    123,123,123,122 could be split onto 3

    Never a good idea.
    If a package assumes it as a good practice the best approach would be not to use it.

    can you write a query to select all queries started during off-hours last night?
    And what about a query selecting all events with writes > certain amount?

    With split columns both queries would require table scans, which means reading the whole table into memory and apply filtering in there.
    Which is exactly the problem with the current state of affairs in OP's database.

    With Vertipaq it is almost always a good idea - this is not SQL processing it.
    Once the data is loaded on Vertipaq querying the data based on your 2 examples above would be quite fast and easy enough to do with DAX. And Vertipaq is a In Memory engine so limitation will be how much memory is required and available to PowerBI.

    Now that I saw what queries are being issued by PowerBI I am even quite inclined to say that some changes on how it is being done on it could most likely improve the whole process for at least parts of it.

    Yeah, for sure. The only thing to be aware of here though is that while it may be fast once it gets into the Vertipaq, getting it there can still time out. There is a hard 10 minute timeout cap by default that you can ignore, but taking that long to query something can still fail in terms of the connection. Thus, it's still best practice to weigh your options to ensure success.

  • frederico_fonseca - Tuesday, March 20, 2018 12:47 PM

    With Vertipaq it is almost always a good idea - this is not SQL processing it.
    Once the data is loaded on Vertipaq querying the data based on your 2 examples above would be quite fast and easy enough to do with DAX. And Vertipaq is a In Memory engine so limitation will be how much memory is required and available to PowerBI.

    "Once the data loaded" is exactly the problem.
    OP is complaining about the latency caused exactly by that - uploading data from storage to memory.

    And still SQL processing has an advantage - SQL is processing data while loading it, the engine may in some cases even ignore unnecessary columns which are not required for the resultset, reducing the amount of memory required.
    Vertipaq has to wait for the whole data set to be loaded into memory, including irelevant fields, to start in-memory processing.

    _____________
    Code for TallyGenerator

  • Jim-S - Tuesday, March 20, 2018 4:03 AM

    Sergiy - Tuesday, March 20, 2018 3:54 AM

    Before starting on this queries I have to ask:Where do they come from?What/who wrote/generated these queries?Is there a possibility to change the way they built?

    They are generated by Power BI based on my report criteria such as filters etc. I am not aware of any way to rewrite them.

    You can partially influence how they are written - but it all depends on how the different tables relate and on how much report change you are going to give the users of each pbi.
    Left outer joins can be changed to inner joins if "Assume Referential Integrity " is ticked when creating the relationships - obviously only if this applies between 2 tables.

    columns that go to the sql's can also be changed by editing the source query and specifying only the columns that will be used by each pbi
    for example for your queries I think the following would be enough from dbo.time

    select TimeId
         , CalendarYear
         , CalendarPeriod
         , WeekEnding
    from dbo.Time


    Also in some cases its better to flatten the tables - no hard rule here!!
    so the 2 queries you have for querystats and time could eventually be combined onto a single query source which may perform better

    select qs.tracetype
         , qs.LoginName
         , qs.EventClass
         , qs.TraceType
         , ti.CalendarYear
         , ti.CalendarPeriod
         , ti.WeekEnding
    from dbo.querystats qs
    inner join dbo.time ti
    on ti.timeid = qs.timeid

  • Adding to what's said by Frederico - by clustering on StartTime and applying the filter against this column you'll most certainly make the query run much faster.

    _____________
    Code for TallyGenerator

  • One of the biggest issues here is the join by TimeId.

    If you can change the table design, you better drop the column TimeId from dbo.Time, make Date ap PK, Replace TimeID with Date in QueryStats and create a clustered index on (Date, Starttime) on that table.
    In this case you could join by [Date] , and it would put the clustering into a very good use.

    _____________
    Code for TallyGenerator

  • I appreciate everyone's contribution, I have enough information here to get started on something. Will post back the results once available.

Viewing 11 posts - 16 through 25 (of 25 total)

You must be logged in to reply to this topic. Login to reply