Historical Data Reporting

  • Hi

    Ok here is my scenario.

    I need to generate generate reports for my client. The application is a bidding portal where buyers post their projects and service provider bid against these projects. Now My client needs certain reports both from current and historical perspective. We are using sql server express 2008 for the application. I am using reporting services and business intelligence development studio report project to design the reports. Report designing is no issue, its getting the data for my reports that is creating confusions.

    My client is interested in knowing statistics such as , how many projects were posted on a date interval, how many projects had status active during a time interval , List of bidders with the most number of completed jobs during an interval, count of users submitting there gender or age in their profile during an interval. For this i think i need to keep a historical table sort of thing. But as of now i am looking for best practices and strategies that should be followed. I must tell you that currently there is no logging or auditing in the application.

    I am currently going with this approach:

    I would first identify the statistics that are needed by my client for the reports such as number of open missions in a date interval. I will build some Historical tables with these statics as columns and i will create a job to fetch data from the live db on nightly basis. My scheduled job will be responsible for building these statistics and then populating in the historical tables. Ok so far so good , i can show how many projects were open or who was the top service provider on a giver date, but what if in future my client wants to go back in date , and see the details of those active projects [such as what was the title of each of these active projects or who created them etc ]? The historical table does not contain that information -- stuck 🙁

    First of all tell me if this is a right approach and if this is then should i create similar tables based on the tables that are used in the live db and then with my nighltly service i fetch data from those tables and simply dump in the historical tables?

    I will be very thankful if any one can help me in this regard

  • That's the right approach, if, your client and their business model just wants the historical aggregation information. If they want more, as you say, you're stuck. I can't tell you that you're on the right track. You need to talk to your client and clarify exactly what they want. If they're anything like people I've worked with, they want everything. So then you have to start telling them the trade-offs. For example, you can give them historical aggregates, but they can't get historical data. If they want historical data, then you have determine, with the client, how often the data changes, and how much additional storage will be needed to support that. If they're willing to pay for the storage, then you design the system that keeps the information. Since you're using Express, you're somewhat limited in how you can implement that (I don't think Change Data Capture is available in Express, but I could be wrong). This means you'll have to build some method of historical tracking. You can do this a number of ways. I've been using a system where you create a version ID and then simply do nothing but inserts into all tables in the system, each set of changes gets one version id across all tables. Then, to get latest, you select the top version id that is equal to or less than the top version for the widget you're retrieiving. But there are lots of ways you can go. The key is, work with the client, explain the trade-offs to them, get their buy-in on the design.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi

    Thanks for your response. For now i know that they are looking for historical aggregates. If they want to see more data , i will probably convince them to have logging implemented in the application using either custom solution or change data capture. So i think i will go with the approach of using historical tables for aggregates and then my reports will just be a static view of data fetched from these tables. One more thing, will it benefit me if i keep these historical tables in a separate database? probably my primary database will be free from the load of fetching data for reports?

    Thanks & regards

  • Hard to say, I don't know your systems or the volumes involved. I would only split to another database if I thought the load required it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey

    You need to talk to your client and clarify exactly what they want. If they're anything like people I've worked with, they want everything. So then you have to start telling them the trade-offs.

    In addition to what Grant has said, I would think that you should reduce everything to writing, not tech speak, but in clear precise english (or whatever is your and your clients native language). Else you can run into the problem of "I think I told you ... and in turn .. I thought I heard you say" I have found it amazing how reducing what is to be accomplished and equally important what is NOT going to be accomplished to writing, and then having the client read it, and if they approve sign it.

    In my experience it takes a minimum of one rewrite, and most times 2 or 3 before each party fully understands what product is to be delivered.

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Thanks both of you for your responses. I just want to know your opinion regarding having denormalized tables in such scenario, what is the standard practice in such scenarios, i think i would prefer going with a denormalized design for my historical tables. For instance in my case i need to show the user with most number of complete jobs, user having most fans , user having most reviews. So i would probably have a flat table with userId columns repeating each for my aggregate indicator.

    What do you guys think? I'd appreciate a response in this context.

    Regards

  • It really depends on what the needs of the project are. I tend to shy away from denormalization as a plan. I usually plan on normalizing with a standard design, or using a star schema or something like that for reporting systems. I wouldn't advocate planning for a denormalized approach.

    That said, depending on the data, yes, that might be viable. Since it's an aggregate, I wouldn't think of it so much as "denormalized" as a different structure. For example, Let's say you have an ID, a Description, and then you're going to put in aggregations about those fields. You don't need to put the descripion into the table with the aggregates. You can still just link to the ID and pull the description out with a join.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Sorry to bother you again, But can you guide me to the ways through which i could ensure that my scheduled job actually builds the statistics reliably and without getting an exception. And what to do for those cases. Also my application will be live all the time , will my scheduled job create any problems for the users online by probably locking the tables or anything like that. Can you provide me with links on these topics.

    Thanks & regards

  • Whoa, that's a lot of information. To reliably gather the statistics, you need to write the queries that put those statistics together carefully so that it uses indexes and performs well. Keep the transactions small & tight and it shouldn't affect other users. In general, this should be small reads & writes, so you shouldn't run into problems, but the way to get around that is to have a run table that records the last time a successful run was complete so that you can retrieve data based on the last successful run.

    Like I said, that's a lot of stuff compressed into a few lines. Links? No, nothing particular comes to mind. We're talking queries, indexes, tuning, scheduling, pretty much a lot of what you have to do managing SQL Server. I'd say as you hit problems, post questions.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Ok So i am faced with a situation now. User can create missions and others can bid on them. Mission and bids both can be suspended by the administrators. So my client needs is to be able to see who is the top bidder or Who has created the most missions during a date range. Now if A and B are two users. A post 3 missions and B posts 2 mission on a particular date say 25th feb 2011. My services runs at night, it counts the mission created on 25th by the users and hence A comes out to be the winner. Now on 26th neither A nor B post any new missions but Admin suspends 2 missions of A. So if the report is run for the interval from 25th to 26th the clearly B should be the one with Most Missions created [suspended missions are not be taken into account].

    To produce this result my approach is to Count All created Missions as well as count all suspended and unsuspended missions on a given date and keep these calculated values along with the user's IDs in my historical table. Now when the report is run for a duration i will calculate [CreatedMissions minus Suspended Missions plus Unsupended Missions ]from the historical table for that duration against any user for which data lies in that interval.

    Now my first concern is with the growing number of missions and users , this job will take long to execute , and my second concern is if i am doing right or wong?:cool:

    Or May be i should calculate this data at run time. Because i have the created date in mission table and the date column in action table , which actually keeps track of the actions happening on mission , such as mission getting suspended and unsuspended etc.

    I'd really appreciate if you can help me out here

    Thanks & regards

  • wildfairy1 (2/24/2011)


    Ok So i am faced with a situation now. User can create missions and others can bid on them. Mission and bids both can be suspended by the administrators. So my client needs is to be able to see who is the top bidder or Who has created the most missions during a date range. Now if A and B are two users. A post 3 missions and B posts 2 mission on a particular date say 25th feb 2011. My services runs at night, it counts the mission created on 25th by the users and hence A comes out to be the winner. Now on 26th neither A nor B post any new missions but Admin suspends 2 missions of A. So if the report is run for the interval from 25th to 26th the clearly B should be the one with Most Missions created [suspended missions are not be taken into account].

    To produce this result my approach is to Count All created Missions as well as count all suspended and unsuspended missions on a given date and keep these calculated values along with the user's IDs in my historical table. Now when the report is run for a duration i will calculate [CreatedMissions minus Suspended Missions plus Unsupended Missions ]from the historical table for that duration against any user for which data lies in that interval.

    Now my first concern is with the growing number of missions and users , this job will take long to execute , and my second concern is if i am doing right or wong?:cool:

    I'd really appreciate if you can help me out here

    Thanks & regards

    I'll try to help, but understand, not sitting at your desk, I can't see all that you see or know all that you know.

    In general, when I see a nightly process that does "all" of something, it's a red flag. As you say, as the amount of data increases, the processing time to recover "all" will go up. Instead, I'd strongly recommend you have a mechanism for identifying which bidder has had a change, and only update their counts. You should make sure that you're doing that as a batch, not as a cursor or a loop. That should minimize the amount of data that you have to run on any given night since, I'm assuming, the likelihood of all clients in a system performing updates every day is unlikely. If that's a bad assumption, find some other approach to limit the data set that you have to update.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi

    I am not sure if I have understand your statement fully here when you mention

    "You should make sure that you're doing that as a batch, not as a cursor or a loop."

    If we take the bid example here , do you actually mean that i should update the count of created Bids as a batch. Which probably would mean that i identify the bids added each day and then update the counts daily through my query instead of updating count for each bid individually. But i believe the batch processing will ultimately require a nightly service.

    Thanks & Regards

  • wildfairy1 (2/26/2011)


    Hi

    I am not sure if I have understand your statement fully here when you mention

    "You should make sure that you're doing that as a batch, not as a cursor or a loop."

    If we take the bid example here , do you actually mean that i should update the count of created Bids as a batch. Which probably would mean that i identify the bids added each day and then update the counts daily through my query instead of updating count for each bid individually. But i believe the batch processing will ultimately require a nightly service.

    Thanks & Regards

    Yeah, you want to avoid any types of looping structures in SQL Server. They just destroy performance, and do it in a way that drags down other processes with it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • OK i will definitely avoid looping , thanks for your suggestion. One more thing that i wanted to know about is the Count function in T-SQL. I did some search and found that Count performs slow as the number of rows increases such as over 200000+ rows. The way i see my queries, most of them are about counting rows based on some criteria. Have you come across any such issues with the Count function ?Are there any alternative. Some one suggested to query sys.partitions to get the count. But that gives the count of the total rows rather than for specific criteria.

  • wildfairy1 (2/27/2011)


    OK i will definitely avoid looping , thanks for your suggestion. One more thing that i wanted to know about is the Count function in T-SQL. I did some search and found that Count performs slow as the number of rows increases such as over 200000+ rows. The way i see my queries, most of them are about counting rows based on some criteria. Have you come across any such issues with the Count function ?Are there any alternative. Some one suggested to query sys.partitions to get the count. But that gives the count of the total rows rather than for specific criteria.

    I'm not sure where you're getting that information. COUNT is an aggregate function, same as SUM, AVG and others. It's performance really depends on what is aggregated. If there is an index available for it to use, the performance can be very fast. Or, if you're filtering with a WHERE clause against an index and aggegating the filtered results, again, it's very fast. If you're just performing aggregates against a heap table or without indexing or filtering, yeah, like any other database engine, it's likely to be slow. But, as you describe your situation, you should be able to filter the information that needs to be aggregated each night based on what was modified during the day. That's why I don't suggested rebuilding the entire data set every single night.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply