How to perform Incremental loads over a set of aggregated tables

  • Hi all,

    My doubts may be are stupids but I want to discuss with you my current task.

    To explain the situation I've created the following diagram:

    The system consist of a Data Hub, basically a set of tables with some numerical information and contexts. The Data Hub was initially loaded.

    The data flow is as follow:

    1. Delta data arrive: apart from the initial load, every day new data is comming.

    2. A query with some aggregations is performed.

    3. A second query with aggragations is performed.

    4. The customer table is populated with aggregated information from the Table B and contextualize with information from the Data Hub (5).

    For this, I've created several dataflows. The first dataflow execute the query to fill the Table A, the second to fill the Table B and so on.

    My questions are the followings:

    - My approach is ok for the first loading, but what happen with the aggregations when new data is comming?

    - Do I have to erase the content from Table A and B and run the aggregations again?

    For instance, if I want to record the average of purchases that a particular customer makes during a month, I have to keep this value updated. Maybe as I receive the data in a daly-basis I can group this data by day and then the reporting guy could make this avarage stuff in his queries.

    Any comment or question will be appreciated

    Kind Regards,

    Paul Hernández
  • Ok, I think my initial explanation was not so gut. I will try to clarify the scenario.

    Supose that in the Data Hub I have the following table with information about customer purchases:

    Customer IdArticleDeparmentAmountDate of Purchase

    545 Shoes Fashion 30$ 01.07.2012

    545 Book Library 20$ 01.07.2012

    545 CD Library 20$ 01.07.2012

    From this table I load my first table with Aggregated data called Department Aggregations

    Customer IdDepartmentAmount Date of Purchase

    545 Fashion 30$ 01.07.2012

    545 Library 40$ 01.07.2012

    I query the table above and fill a second table with aggregated data

    Customer IdTotal ArticlesTotal DeparmetsTotal AmountAverage Shopping Days

    545 3 2$ 70$ 1

    Finaly I load the Customer table:

    Customer IdTelephoneEmail CityTotal expenses

    545 654654 cust@exp.com ABC 70$

    My question is: When our dear customer, Mr 545, makes a new purchase, what is the less expensive option in order to update all the tables? I think that an incremental approach would be good because the actual Fact_Details tables contains around 165 millions of records and truncate this table and loaded again is too expensives, but I don't know how to updates aggregations.

    Any comment will be appreciated

    Kind Regards

    Paul Hernández
  • Hello Paul I am attempting to develop a similar solution, how did you resolve this?

    Regards,

  • Hi SQLSponge,

    I recently read an article that proposed one of the best solutions I've ever seen. It is a little bit complex (technically speaking) but I suggest you to read it and take your time to understanding it:

    Incremental Data Loading Using CDC[/url]

    I also suggest you to read the two previous articles of these series, you can find the links in the article above.

    If this solution does not persuade you, please send me the design of the tables, specially the granularity and the type of the aggregations (AVG, COUNT, SUM, etc.)

    Kinde Regards,

    Paul Hernández

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply