Query on a large database

  • Hi All.

    I was asked to create a report on very large data. Below are the requirements. Any help would be greatly appreciated. Thanks.

    We have a very large fact table with 100 millions of records. I will have to create a report on that table. I just created a query and it is taking more than an hour to retrieve Month's worth of data. As this report will be used by our customers, it should not take longer than 15 seconds. Would creating tables with precalculated values help improve the performance.

    Thanks.

  • Make sure you have a clustered index on "record date time" column.

    Then selecting of a monthly range of records will work really fast, and the rest of conditions would apply to that small subset, so whatever they are they could not affect performance of the query.

    _____________
    Code for TallyGenerator

  • Sergiy (10/19/2015)


    Make sure you have a clustered index on "record date time" column.

    I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.

    If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report

  • Table definitions, index definitions, query and actual execution plan please.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.

    Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.

    Jayanth Kurup[/url]

  • A fact table with 100 million rows is not that large comparing it to the tables I had in my previous company. However, good indexing is important. We can't give any advice without the information asked by Gail. For more detailed explanation on what we ask, read the following article: http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks all for your suggestions. Here is more information on the table design etc:

    I will need to create a report on below tables.

    Order Facts Table has below columns:

    ID bigint Clustered, Date(datetime), StoreItemID int, CustomerID bigint

    We also have a non clustered index on Date includes (Storeitemid,CustomerId)

    Store Items table:

    ItemID int Clustered, ItemName nvarchar(100), Item Category nvarchar(100)

    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    Report should show customers by Item purchased in a month and Customers by Item Category in a month.

  • GilaMonster (10/20/2015)


    Table definitions, index definitions, query and actual execution plan please.

    Table and indexes as their CREATE statements please, it makes things easier to read

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • sreeya (10/20/2015)


    Thanks all for your suggestions. Here is more information on the table design etc:

    I will need to create a report on below tables.

    Order Facts Table has below columns:

    ID bigint Clustered, Date(datetime), StoreItemID int, CustomerID bigint

    We also have a non clustered index on Date includes (Storeitemid,CustomerId)

    Store Items table:

    ItemID int Clustered, ItemName nvarchar(100), Item Category nvarchar(100)

    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    Report should show customers by Item purchased in a month and Customers by Item Category in a month.

    GilaMonster (10/20/2015)


    Table definitions, index definitions, query and actual execution plan please.

    The details you have posted are insufficient. Gail's requests represent the minimum requirement for a decent chance of offering you any help.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi,

    Index creation on date column and table partition would be a better option for creating reports on a big table.

  • sreeya (10/20/2015)


    The same Customer may purchase multiple times on the same day plus we have about 100 millions of active users. As I said facts table is pretty big with more than several millions of transactions each day.

    thats a lot of sales a day...being curious what industry are you involved in?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • Kristen-173977 (10/20/2015)


    Sergiy (10/19/2015)


    Make sure you have a clustered index on "record date time" column.

    I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.

    If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report

    As we can see from the following post there is nothing important about existing clustered indexes. Just common junior mistake of putting a clustered index on an identity column.

    But I would not easily suggest recreating/changing a clustered index on a table with 100 mil rows. Might be a quite long excercise.

    There is still an option of using existing clustered index:

    Select MaxID = MAX(ID), MinID = MIN(ID)

    From ... Where -[Date] between @MinDate and @MaxDate

    And then use those MaxID and MinID to limit the number of Ross processed by all other conditions in the WHERE clause.

    Still need to apply the filter against the transaction date.

    _____________
    Code for TallyGenerator

  • Sergiy (10/20/2015)


    Kristen-173977 (10/20/2015)


    Sergiy (10/19/2015)


    Make sure you have a clustered index on "record date time" column.

    I would only re-create a clustered index if there isn't one already - any existing clustered index may well be critical to something else.

    If a clustered index already exists then I would try creating a non-clustered index on "record date time" to improve the report

    As we can see from the following post there is nothing important about existing clustered indexes. Just common junior mistake of putting a clustered index on an identity column.

    But I would not easily suggest recreating/changing a clustered index on a table with 100 mil rows. Might be a quite long excercise.

    There is still an option of using existing clustered index:

    Select MaxID = MAX(ID), MinID = MIN(ID)

    From ... Where -[Date] between @MinDate and @MaxDate

    And then use those MaxID and MinID to limit the number of Ross processed by all other conditions in the WHERE clause.

    Still need to apply the filter against the transaction date.

    Why not index Date in one table, Item in the other?

    Assuming that anyone can see all Customers and Stores.

    It would be interesting to see requested information for table definition.

    And query plan in Profiler.

    Would be very helpful to getting better feedback.

  • Jayanth_Kurup (10/20/2015)


    with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.

    Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.

    Careful now... Partitioning it by month won't help anymore than having the proper index on a monolithic table and a proper query to take advantage of it. All else being equal, Partititioning can actually slow down queries thanks to the multiple B-Trees involved behind the scenes.

    Not sure how you think moving a table to SSAS will actually improve performance.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/21/2015)


    Jayanth_Kurup (10/20/2015)


    with 100 million rows your better off using a Columnstore index or moving the table to SSAS. If the report requires data on a monthly basis partitioning it by month could help improve the performance. What you need to understand in this case is that most of the performance issue your facing is likely because of the massive IO of fetching data from disk , or aggregating large data sets (CPU issues), indexes help search for an eliminate unwanted rows, but if you still have millions of rows after the fact your will still face performance issues that require a suitable technology to be used.

    Summary tables are a good solution but you will loose granularity and might need to create many different summaries at different grouping for different reports and this can become unmanageable over a period of time.

    Careful now... Partitioning it by month won't help anymore than having the proper index on a monolithic table and a proper query to take advantage of it. All else being equal, Partititioning can actually slow down queries thanks to the multiple B-Trees involved behind the scenes.

    Not sure how you think moving a table to SSAS will actually improve performance.

    Partitioning is more for having sliding windows of data for archiving. Surprising how many think it will improve performance.

    If users are dumping details for all customers, all order lines anyways, you would be missing the whole point of SSAS.

    Although just in setting up a cube to run, they would have to fix the core performance issue.

    100 million rows in order table

    100 million users who can have multiple orders per day

    Millions of rows per day

    Without a cross join between the tables, this just doesn't make sense to me

    Hope the poster comes back with some details

Viewing 15 posts - 1 through 15 (of 104 total)

You must be logged in to reply to this topic. Login to reply