better solution than using the full outer join index

  • I have three tables:

    CREATE TABLE [dbo].[Item] (

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](255) NOT NULL,

    [IDParent] [int] NOT NULL

    )

    CREATE TABLE [dbo].[Statistics](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [IDItem] [int] NOT NULL,

    [Date] [smalldatetime] NOT NULL,

    [stat1] [decimal](19, 2) NULL,

    [stat2] [decimal](19, 2) NULL,

    [stat3] [decimal](19, 2) NULL,

    )

    CREATE TABLE [dbo].[KPIs](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [IDItem] [int] NOT NULL,

    [Date] [smalldatetime] NOT NULL,

    [KPI1] [decimal](19, 2) NULL,

    [KPI2] [decimal](19, 2) NULL,

    [KPI3] [decimal](19, 2) NULL,

    )

    With a certain Item belong 0 or 1 Statistic and 0 or 1 KPI

    What i want is a query that gives me summed results of the statistics & Kpis for each Item with a certain parent (grouped by Item).

    This would look something like:

    SELECT Item.Name

    , SUM(Statistics.stat1)

    , SUM(Statistics.stat2)

    , SUM(Statistics.stat3)

    , SUM(KPIs.kpi1)

    , SUM(KPIs.kpi2)

    , SUM(KPIs.kpi3)

    FROM Items

    LEFT JOIN Statistics ON statistics.IDItem = Items.ID AND statistics.Date BETWEEN '2009-01-01' AND '2009-01-31'

    LEFT JOIN KPIs ON KPIs.IDItem = Items.ID AND KPIs.Date BETWEEN '2009-01-01' AND '2009-01-31'

    WHERE Items.IDParent = 123

    GROUP BY Item.Name

    The problem with this query is that the Statistic and KPI are not "joined", so 1 kpi record can occur 31 times if there are also 31 days of statistics for the same item.

    My next try was creating a full outer join index:

    ALTER VIEW [dbo].[V_StatsKpis]

    AS

    ISNULL(statistics.IDItem, kpis.IDItem) IDItem,

    , ISNULL(statistics.Date, kpis.Date) Date

    , Statistics.stat1

    , Statistics.stat2

    , Statistics.stat3

    , KPIs.kpi1

    , KPIs.kpi2

    , KPIs.kpi3

    FROM dbo.Statistics

    FULL OUTER JOIN KPIs ON Statistics.Date = KPIs.Date AND Statistics.IDItem = KPIs.IDItem

    And than the query would look like :

    SELECT Item.Name

    , SUM(v.stat1)

    , SUM(v.stat2)

    , SUM(v.stat3)

    , SUM(v.kpi1)

    , SUM(v.kpi2)

    , SUM(v.kpi3)

    FROM Items

    LEFT JOIN V_StatsKpis v ON v.IDItem = Items.ID AND v.Date BETWEEN '2009-01-01' AND '2009-01-31'

    WHERE Items.IDParent = 123

    GROUP BY Item.Name

    But...... this one is EXTREMELY slow. And you are not allowed to put an index on a view with full outer joins.

    So: How should i solve this problem, and the performance is EXTREMELY important. It is even so important that modifying the tablestructure is possible. But putting the statistics and kpi data in 1 table isn't possible because this data is imported from different sources.

    Some extra info:

    -in real world there are not just 3 statistics / kpis but 10 statistic columns and 15 kpi columns.

    -The Item table contains 60.000.000 rows

    -The Statistic table contains 200.000.000 rows and grows with 1.500.000 rows a day

    -The KPI table contains 50.000.000 rows and grows with 100.000 rows a day

  • I think i have found the best solution:

    SELECT Item.Name

    , stat.Stat1

    , stat.Stat2

    , stat.Stat3

    , kpi.kpi1

    , kpi.kpi2

    , kpi.kpi3

    FROM Items

    LEFT JOIN (

    SELECT

    IDItem

    , SUM(Statistics.stat1) stat1

    , SUM(Statistics.stat2) stat2

    , SUM(Statistics.stat3) stat3

    FROM Statistics

    WHERE statistics.Date BETWEEN '2009-01-01' AND '2009-01-31'

    GROUP BY IDItem

    ) stat ON stat.IDItem = Item.ID

    LEFT JOIN (

    SELECT

    IDItem

    , SUM(KPIs.kpi1) kpi1

    , SUM(KPIs.kpi2) kpi2

    , SUM(KPIs.kpi3) kpi3

    FROM KPIs

    WHERE KPIs.Date BETWEEN '2009-01-01' AND '2009-01-31'

    GROUP BY IDItem

    ) kpi ON kpi.IDItem = Item.ID

    WHERE Items.IDParent = 123

  • The solution provided here is somewhat similar to what Sander has provided, and I am not sure on whether this is going to be faster, but may be you can do a test & let us know, how these solutions performed.

    SELECTI.Name,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI1 ELSE 0 END ) AS Stat1,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI2 ELSE 0 END ) AS Stat2,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI3 ELSE 0 END ) AS Stat3,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI1 ELSE 0 END ) AS KPI1,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI2 ELSE 0 END ) AS KPI2,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI3 ELSE 0 END ) AS KPI3

    FROMItem I

    LEFT JOIN

    (

    SELECT'S' AS RecordType, IDItem,

    SUM( stat1 ) StatKPI1,

    SUM( stat2 ) StatKPI2,

    SUM( stat3 ) StatKPI3

    FROM[Statistics]

    WHERE[Date] >= '2009-01-01'

    AND [Date] <= '2009-01-31'

    GROUP BY IDItem

    UNION all

    SELECT'K' AS RecordType, IDItem,

    SUM( kpi1 ) StatKPI1,

    SUM( kpi2 ) StatKPI2,

    SUM( kpi3 ) StatKPI3

    FROMKPIs

    WHERE[Date] >= '2009-01-01'

    AND [Date] <= '2009-01-31'

    GROUP BY IDItem

    ) SK ON I.ID = SK.IDItem

    WHEREI.IDParent = 123

    GROUP BY I.NAME

    --Ramesh


  • Ramesh (2/9/2009)


    The solution provided here is somewhat similar to what Sander has provided, and I am not sure on whether this is going to be faster, but may be you can do a test & let us know, how these solutions performed.

    SELECTI.Name,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI1 ELSE 0 END ) AS Stat1,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI2 ELSE 0 END ) AS Stat2,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI3 ELSE 0 END ) AS Stat3,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI1 ELSE 0 END ) AS KPI1,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI2 ELSE 0 END ) AS KPI2,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI3 ELSE 0 END ) AS KPI3

    FROMItem I

    LEFT JOIN

    (

    SELECT'S' AS RecordType, IDItem,

    SUM( stat1 ) StatKPI1,

    SUM( stat2 ) StatKPI2,

    SUM( stat3 ) StatKPI3

    FROM[Statistics]

    WHERE[Date] >= '2009-01-01'

    AND [Date] <= '2009-01-31'

    GROUP BY IDItem

    UNION all

    SELECT'K' AS RecordType, IDItem,

    SUM( kpi1 ) StatKPI1,

    SUM( kpi2 ) StatKPI2,

    SUM( kpi3 ) StatKPI3

    FROMKPIs

    WHERE[Date] >= '2009-01-01'

    AND [Date] <= '2009-01-31'

    GROUP BY IDItem

    ) SK ON I.ID = SK.IDItem

    WHEREI.IDParent = 123

    GROUP BY I.NAME

    Interesting solution. But i don't see this working in my situation. In real world the columns of the statistics table and the kpi table are very different. So union them to 1 table won't work, or i have to use lots of NULL columns.

    I don't think the performance will be very different than the solution i came up with. Since the affected rows will be the same in both situations, the only difference is that i don't have to do 6 cases for every row (and there can be up to 1.000.000 rows)

    But i thank you very much for thinking of a solution!!!

  • Is the Date column in the Stats table relevant to the sum you want to do? And does it contain time values, or just dates?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The date is relevant because you have to select the sum of the statistics / kpis in a certain timerange.

    Date is of type smalldatetime and contains just the datepart

  • Assuming (from the forum you posted in) that you're using SQL 2005, this should get you what you want:

    ;with

    [Stats] (ItemID, Stat1, Stat2, Stat3) as

    (select ItemID, sum(Stat1), sum(Stat2), sum(Stat3)

    from dbo.Statistics

    where ItemID in

    (select ItemID

    from dbo.Items

    where IDParent = 123)

    and Date between '2009-01-01' and '2009-01-31'

    group by ItemID),

    Ks (ItemID, KPI1, KPI2, KPI3) as

    (select ItemID, sum(KPI1), sum(KPI2), sum(KPI3)

    from dbo.KPIs

    where ItemID in

    (select ItemID

    from dbo.Items

    where IDParent = 123)

    and Date between '2009-01-01' and '2009-01-31'

    group by ItemID)

    select ItemName, Stat1, Stat2, Stat3, KPI1, KPI2, KPI3

    from dbo.Items

    left outer join [Stats]

    on Items.ItemID = [Stats].ItemID

    left outer join Ks

    on Items.ItemsID = Ks.ItemID

    where IDParent = 123;

    If it's actually SQL 2000, you'll need to modify the CTEs into derived tables in the From clause.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Sander A. (2/9/2009)


    Interesting solution. But i don't see this working in my situation. In real world the columns of the statistics table and the kpi table are very different. So union them to 1 table won't work, or i have to use lots of NULL columns.

    Ramesh has selected a key and aggregate columns which match in the two UNIONed queries. It doesn't matter how different the two (or more) source tables are - provided that you find some common ground between them, you can UNION them.

    Sander A. (2/9/2009)


    I don't think the performance will be very different than the solution i came up with. Since the affected rows will be the same in both situations, the only difference is that i don't have to do 6 cases for every row (and there can be up to 1.000.000 rows)

    He isn't - he's performing six CASEs on a filtered, preaggregated subset - exactly where it should be for top performance πŸ˜‰

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Sander A. (2/9/2009)


    Interesting solution. But i don't see this working in my situation. In real world the columns of the statistics table and the kpi table are very different. So union them to 1 table won't work, or i have to use lots of NULL columns.

    ...I agree with the real world situation, but I am also sure (somewhat) that the no. of. statistics/KPIs are limited (say 50-70 max, that I am thinking..), which can be converted to UNION ALL query. And also you are working with large amount data, So its worth doing a test.

    I don't think the performance will be very different than the solution i came up with. Since the affected rows will be the same in both situations, the only difference is that i don't have to do 6 cases for every row (and there can be up to 1.000.000 rows)

    I'd better like to see a performance test:)..., And I know if it fares well, you don't mind typing too many case statements...:D

    BTW, I was thinking you & Sander are two different people:hehe:, don't know why??

    Edit:

    I was doing some other stuff while replying to it, so I didn't see the earlier replies....

    Thanks Chris for the explanation...

    --Ramesh


  • Ramesh (2/9/2009)I'd better like to see a performance test:)..., And I know if it fares well, you don't mind typing too many case statements...:D

    BTW, I was thinking you & Sander are two different people:hehe:, don't know why??

    I'm doing a performance test right now. But first i have to find a good subset of data and rewrite the query 3 times to test all solutions that where given in this topic. When i'm done (probably tomorrow, since my workingday is done in a half hour πŸ˜› ) i will post the results here.

    Me & Sander being the same person : well, it was indeed a bit confusing because i posted the solution to my own question. So it isn't that strange πŸ˜‰

  • Oke, the results of the performance test. I tested this on 40.000 items with a difference of 30 days between the start and enddate (so that will be around 1.000.000 statistics and 100.000 kpis)

    During the test i found a possible improvement of my own query, in stead of:

    LEFT JOIN (

    SELECT

    IDItem

    , SUM(Statistics.stat1) stat1

    FROM Statistics

    WHERE statistics.Date BETWEEN '2009-01-01' AND '2009-01-31'

    GROUP BY IDItem

    ) stat ON stat.IDItem = Item.ID

    it's faster to use:

    LEFT JOIN (

    SELECT

    IDItem

    , SUM(Statistics.stat1) stat1

    FROM Statistics

    WHERE statistics.Date BETWEEN '2009-01-01' AND '2009-01-31'

    AND ItemID in (select ItemID from dbo.Items where IDParent = 123)

    GROUP BY IDItem

    ) stat ON stat.IDItem = Item.ID

    I ran the query several times and used the average execution time:

    My own solution : 2330 ms

    My improved solution : 2120 ms

    Ramesh solution : 2090 ms

    GSquared solution : 1890 ms

    So GSquare is the winner !! Now i "only" have to refactor my complete java-Querybuilder .....:w00t:

    thnx for your solutions

  • Can you re-do the test with the same update applying on my version of the query & post back the results? 'Cause I've a strong feelin' that it should work far better than you have mentioned.

    And I know, Gus, you can prove me wrong:), if you wanted to....

    --Ramesh


  • Ramesh, it's easy enough to test performance on a simplified version of this.

    create table #Stats (

    ID int identity primary key,

    ItemID int,

    Date smalldatetime,

    Stat1 float,

    Stat2 float,

    Stat3 float);

    create table #KPIs (

    ID int identity primary key,

    ItemID int,

    Date smalldatetime,

    KPI1 float,

    KPI2 float,

    KPI3 float);

    create table #Items (

    ID int identity primary key,

    Name char(10));

    insert into #Items (Name)

    select number

    from dbo.Numbers

    where number between 1 and 10;

    ;with Dates (Date) as

    (select dateadd(day, number-1, '1/1/09')

    from dbo.numbers

    where number between 0 and 31)

    insert into #Stats (ItemID, Date, Stat1, Stat2, Stat3)

    select number, date, checksum(newid()), checksum(newid()), checksum(newid())

    from dbo.numbers

    cross join dates

    where number between 1 and 10;

    ;with Dates (Date) as

    (select dateadd(day, number-1, '1/1/09')

    from dbo.numbers

    where number between 0 and 31)

    insert into #KPIs (ItemID, Date, KPI1, KPI2, KPI3)

    select number, date, checksum(newid()), checksum(newid()), checksum(newid())

    from dbo.numbers

    cross join dates

    where number between 1 and 10;

    set nocount on;

    set statistics io on;

    set statistics time on;

    SELECT I.Name,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI1 ELSE 0 END ) AS Stat1,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI2 ELSE 0 END ) AS Stat2,

    SUM( CASE WHEN SK.RecordType = 'S' THEN SK.StatKPI3 ELSE 0 END ) AS Stat3,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI1 ELSE 0 END ) AS KPI1,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI2 ELSE 0 END ) AS KPI2,

    SUM( CASE WHEN SK.RecordType = 'K' THEN SK.StatKPI3 ELSE 0 END ) AS KPI3

    FROM #Items I

    LEFT JOIN

    (

    SELECT 'S' AS RecordType, ItemID,

    SUM( stat1 ) StatKPI1,

    SUM( stat2 ) StatKPI2,

    SUM( stat3 ) StatKPI3

    FROM #Stats [Statistics]

    GROUP BY ItemID

    UNION all

    SELECT 'K' AS RecordType, ItemID,

    SUM( kpi1 ) StatKPI1,

    SUM( kpi2 ) StatKPI2,

    SUM( kpi3 ) StatKPI3

    FROM #KPIs KPIs

    GROUP BY ItemID

    ) SK ON I.ID = SK.ItemID

    GROUP BY I.NAME;

    ;with

    [Stats] (ItemID, Stat1, Stat2, Stat3) as

    (select ItemID, sum(Stat1), sum(Stat2), sum(Stat3)

    from #Stats

    group by ItemID),

    Ks (ItemID, KPI1, KPI2, KPI3) as

    (select ItemID, sum(KPI1), sum(KPI2), sum(KPI3)

    from #KPIs

    group by ItemID)

    select Name, Stat1, Stat2, Stat3, KPI1, KPI2, KPI3

    from #Items

    left outer join [Stats]

    on #Items.ID = [Stats].ItemID

    left outer join Ks

    on #Items.ID = Ks.ItemID;

    If you run the above and look at the actual execution plans for both of the final query versions, you'll find that the CTE version has less steps and less work.

    On my computer, the first query took 152 milliseconds total, 140 milliseconds of CPU time, while the second took 13 milliseconds total and 16 milliseconds of CPU time (better parallelism). They had identical IO stats.

    Of course, with more rows, the possibilities of various indexes being involved, etc., the execution plans will change, but the complexity of them will still have about the same ratio.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The total calculation works now, but now i'm comming to the history screens and again i'm running into a problem (i'm feeling quite dumb that i can't figure this out but after trying several thins i'm runnig out of well performing options):

    First of all, the tables again

    CREATE TABLE [dbo].[Item] (

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [Name] [nvarchar](255) NOT NULL,

    [IDParent] [int] NOT NULL

    )

    CREATE TABLE [dbo].[Statistics](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [IDItem] [int] NOT NULL,

    [Date] [smalldatetime] NOT NULL,

    [stat1] [decimal](19, 2) NULL,

    [stat2] [decimal](19, 2) NULL,

    [stat3] [decimal](19, 2) NULL,

    )

    CREATE TABLE [dbo].[KPIs](

    [ID] [int] IDENTITY(1,1) NOT NULL,

    [IDItem] [int] NOT NULL,

    [Date] [smalldatetime] NOT NULL,

    [KPI1] [decimal](19, 2) NULL,

    [KPI2] [decimal](19, 2) NULL,

    [KPI3] [decimal](19, 2) NULL,

    )

    Now i don't want a total query, but i want a history for a certain item. The query i wrote looks something like this:

    SELECT i.Name

    , s.stat1

    , s.stat2

    , s.stat3

    , k.kpi1

    , k.kpi2

    , k.kpi3

    FROM Items i

    LEFT JOIN Statistics s ON s.IDItem = i.ID

    LEFT JOIN KPIs k ON k.IDItem = i.ID

    WHERE i.ID = 100

    AND s.Date = k.Date

    The problem in this case is that there hasn't to be a statistic or kpi record for every day. But because of the s.Date = k.Date only the days which have both statistics and kpis are shown in the result.

    How can i fix this problem??

  • SELECTi.Name, s.stat1, s.stat2, s.stat3, k.kpi1, k.kpi2, k.kpi3

    FROMItems i

    INNER JOIN

    (

    SELECTISNULL( s.IDItem, k.IDItem ) AS IDItem,

    s.stat1, s.stat2, s.stat3, k.kpi1, k.kpi2, k.kpi3

    FROM[STATISTICS] s

    FULL OUTER JOIN KPIs k ON s.IDItem = k.IDItem AND s.Date = k.Date

    WHEREs.IDItem = 100 OR k.IDItem = 100

    ) ks ON i.ID = ks.IDItem

    WHERE i.ID = 100

    --Ramesh


Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply