Multi-column, single result aggregation

Question

Multi-column, single result aggregation

Evil Kraig F

SSC Guru

Points: 100851
More actions
September 16, 2010 at 10:52 am

#226370

Hey folks, hoping I can shortcut the full test harness thing and see if anyone's done some multi-column (to single result) aggregation recently. A quickie example:
CREATE TABLE #tmp
(tIDINT IDENTITY(1, 1),
Acct_IDINT,
Category VARCHAR(20),
Value1MONEY,
Value2MONEY
)
GO
SELECT
Acct_ID,
SUM( SumMe) As Value
FROM
(SELECT
Acct_ID,
CASE WHEN Category IN ( 'a', 'bdq', 'rr') THEN Value1
WHEN Category IN ( 'zz', 'xx', 'qrq') THEN Value2
ELSE 0
END AS sumMe
FROM
#tmp
)
GROUP BY
Acct_ID
What I'm doing is dealing with variable entry cycles where a 'positive increase to an account' could be in either value 1 or value 2 depending on the category, because the category may apply to another account (either taking money from, or sending money to) the account I'm looking for.
The real query is wrapped around dates and a few other conditions, but my primary concern is this is a 60 million row+ recordset that's wide and while I can control the indexing, I'm hoping someone will share their research/results/links to a well defined, already done, generic optimization in this style. I just can't be the first person to bang my head on this wall. 🙂
- Craig Farrell
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

chrisfradenburg SSCrazy Eights Points: 9592 More actions · Answer 1

Generic optimizations are difficult to give since how long something takes will vary depending on the schema. One thing that may come in handy is using CTE to handle some of this. You may be able to get the sums of the values separately that way and then sum them after that.

declare @tmp table (tID INT IDENTITY(1, 1),

Acct_ID INT,

Category VARCHAR(20),

Value1 MONEY,

Value2 MONEY

);

with sum_stage as (select acct_id, 'value1' [col], sum(value1) [value] from @tmp where category in ('a', 'bdq', 'rr')

group by acct_id

union all

select acct_id, 'value2' [col], sum(value2) [value] from @tmp where category in ('zz', 'xx', 'qrq')

group by acct_id)

select acct_id, sum(value) from sum_stage

If you're taking in an account ID as a parameter then that can be included in the CTE. Or, since you were a little vague on the description, this may end up being to difficult to do in your environment depending on the number of columns.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 2

cfradenburg (9/16/2010)
Generic optimizations are difficult to give since how long something takes will vary depending on the schema.

Yeah, I know, it's painful, but I was figuring this had been done a billion times and I could just apply it.

One thing that may come in handy is using CTE to handle some of this. You may be able to get the sums of the values separately that way and then sum them after that.
declare @tmp table (tID INT IDENTITY(1, 1),
Acct_ID INT,
Category VARCHAR(20),
Value1 MONEY,
Value2 MONEY
);
with sum_stage as (select acct_id, 'value1' [col], sum(value1) [value] from @tmp where category in ('a', 'bdq', 'rr')
group by acct_id
union all
select acct_id, 'value2' [col], sum(value2) [value] from @tmp where category in ('zz', 'xx', 'qrq')
group by acct_id)
select acct_id, sum(value) from sum_stage
If you're taking in an account ID as a parameter then that can be included in the CTE. Or, since you were a little vague on the description, this may end up being to difficult to do in your environment depending on the number of columns.

Yeah, the problem is I already know I'm scanning the index/table i'll be working from, because as your last comment indicates, I'm not passing any form of filtering other then some date work. This is for an ETL process and I'm working off original schema that is vendor controlled. I'm trying to keep it simple because instead of ending up lost in the details I was hoping to find a good baseline. The inside of that WITH statement is equivalently what I'm looking at the moment, with a wrapper sum on the external that's pretty cheap.

I'm still building out the general first pass against the structure. Once I do I'll post up my code/structures/etc. I do appreciate the initial attempt, however. Thank you.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

LutzM SSC Guru Points: 107049 More actions · Answer 3

Would that be an easy baseline?

SELECT

Acct_ID,

SUM( CASE WHEN Category IN ( 'a', 'bdq', 'rr') THEN Value1

WHEN Category IN ( 'zz', 'xx', 'qrq') THEN Value2

ELSE 0

END) AS VALUE

FROM #tmp

GROUP BY Acct_ID

It still performs a table scan and the execution plan is identical to your original query.

All that's changed is syntax (maybe readability...).

So I'm not sure if it will be of any value for you...:unsure:

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 4

Yeah, Lutz, pretty much the same result, but I see what you're driving at. I had been originally tempted not to put in any code at all and ask for resources, but I knew some folks would get a better idea from a sample example. 🙂

Here's a sample of what I'm trying to optimize (realize I'm incomplete and this isn't what I'd want to optimize finally, but it shows the beginning of the complexity). I believe the problem is I was hoping to find a few ideas towards the principals of multi-field aggregation optimization.

CREATE TABLE [dbo].[vPortfolioTransaction](

[PortfolioID] [int] NOT NULL,

[PortfolioTransactionID] [int] NOT NULL,

[TradeDate] [datetime] NULL,

[SequenceNo] [int] NOT NULL,

[RecID] [tinyint] NOT NULL,

[TransactionCode] [char](2) NOT NULL,

[TranCodeLabel] [varchar](50) NULL,

[Comment] [varchar](70) NULL,

[SecTypeCode1] [char](2) NULL,

[SecurityID1] [int] NULL,

[SettleDate] [datetime] NULL,

[OriginalCostDate] [datetime] NULL,

[Quantity] [float] NULL,

[ClosingMethodCode] [char](1) NULL,

[SecTypeCode2] [char](2) NULL,

[SecurityID2] [int] NULL,

[TradeDateFX] [float] NULL,

[SettleDateFX] [float] NULL,

[OriginalFX] [float] NULL,

[MarkToMarket] [bit] NULL,

[TradeAmount] [float] NULL,

[OriginalCost] [float] NULL,

[WithholdingTax] [float] NULL,

[ExchangeID] [tinyint] NULL,

[ExchangeFee] [float] NULL,

[Commission] [float] NULL,

[ImpliedCommission] [bit] NULL,

[OtherFees] [float] NULL,

[CommissionPurposeID] [tinyint] NOT NULL,

[IsPledge] [bit] NULL,

[CustodianID] [int] NULL,

[IsDestPledge] [bit] NULL,

[DestCustodianID] [int] NULL,

[OriginalFace] [float] NULL,

[YieldOnCost] [float] NULL,

[DurationOnCost] [float] NULL,

[TransUserDef1ID] [tinyint] NOT NULL,

[TransUserDef2ID] [tinyint] NOT NULL,

[TransUserDef3ID] [tinyint] NOT NULL,

[TranID] [int] NULL,

[IPCounter] [varchar](16) NULL,

[SourceID] [int] NULL,

[PostDate] [datetime] NULL,

[LotNumber] [smallint] NULL,

[ReclaimAmount] [float] NULL,

[StrategyID] [int] NULL,

[RecordDate] [datetime] NULL,

[DivTradeDate] [datetime] NULL,

[PerfContributionOrWithdrawal] [bit] NULL,

[VersusDate] [datetime] NULL,

[FeePeriodDate] [datetime] NULL,

[BrokerFirmID] [int] NULL,

[BrokerRepSecurityID] [int] NULL,

[TradeBlotterLineID] [int] NULL

) ON [PRIMARY]

The initial aggregation code I'm currently working off of (I'm still fleshing out the logic, trying to remove # of passes, etc):

(Note to self, test even half done code before posting, sorry, it's better now...)

DECLARE @DataUntilDate DATETIME

SET @DataUntilDate = GETDATE()

SELECT

ISNULL( drvPos.pID, drvNeg.pID) AS pID,

ISNULL( drvPos.sID, drvNeg.sID) AS sID,

ISNULL( drvPos.qty, 0) - ISNULL( drvNeg.qty, 0) AS qty,

ISNULL( drvPos.amt, 0) - ISNULL( drvNeg.amt, 0) AS amt

FROM

(SELECT

pID,

sID,

ISNULL( sum( qty), 0) AS qty,

sum( amt) as amt

FROM

(

select

portfolioID AS pID,

securityID1 AS sID,

sum( CASE WHEN sectypecode1 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) AS qty,

SUM( TradeAmount) AS amt

FROM

vPortfolioTransaction

WHERE

transactionCode IN ( 'ti', 'li', 'dp', 'by', 'cs')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

securityID1

UNION ALL

SELECT

PortfolioID,

securityID2,

SUM( CASE WHEN sectypecode2 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) ,

SUM( TradeAmount)

FROm

vPortfolioTransaction

WHERE

transactionCode IN ( 'sl', 'wd', 'in', 'dv')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

securityID2

) AS drv

GROUP BY

pID,

sID

) AS drvPos

FULL OUTER JOIN

(SELECT

pID,

sID,

ISNULL( sum( qty), 0) AS qty,

sum( amt) as amt

FROM

(

select

portfolioID AS pID,

securityID1 AS sID,

sum( CASE WHEN sectypecode1 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) AS qty,

SUM( TradeAmount) AS amt

FROM

vPortfolioTransaction

WHERE

transactionCode IN ( 'to', 'lo', 'wd', 'sl', 'ss')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

securityID1

UNION ALL

SELECT

PortfolioID,

securityID2,

SUM( CASE WHEN sectypecode2 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) ,

SUM( TradeAmount)

FROm

vPortfolioTransaction

WHERE

transactionCode IN ( 'by', 'dp')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

securityID2

) AS drv

GROUP BY

pID,

sID

) AS drvNeg

ONdrvPos.pID = drvNeg.pID

AND drvPos.sID = drvNeg.sID

ORDER BY

ISNULL( drvPos.pID, drvNeg.pID),

ISNULL( drvPos.sID, drvNeg.sID)

No, it's not pretty, and I'm sure I'll come up with ways to reduce the # of passes, but this gives you an idea of where I'm going with it. It will end up a lot more complex, which is part of why I was hoping for generic principals help. :blink:

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

LutzM SSC Guru Points: 107049 More actions · Answer 5

What I'd basically do is to move the decision whether to use securityID1 or securityID2 into a CASE statement instead of a UNION to avoid running a table scan/seek more often than needed.

I'm not sure if the FULL JOIN can be replaced with a CASE statement as well. At the moment it looks like it's possible. It would be great to have some sample data and expected result set to play with.

DECLARE @DataUntilDate DATETIME

SET @DataUntilDate = GETDATE()

SELECT

ISNULL( drvPos.pID, drvNeg.pID) AS pID,

ISNULL( drvPos.sID, drvNeg.sID) AS sID,

ISNULL( drvPos.qty, 0) - ISNULL( drvNeg.qty, 0) AS qty,

ISNULL( drvPos.amt, 0) - ISNULL( drvNeg.amt, 0) AS amt

FROM

(SELECT

portfolioID AS pID,

CASE WHEN transactionCode IN ( 'sl', 'wd', 'in', 'dv') THEN securityID2 ELSE securityID1 END AS sID,

SUM( CASE WHEN sectypecode1 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) AS qty,

SUM( TradeAmount) AS amt

FROM

vPortfolioTransaction

WHERE

transactionCode IN ( 'ti', 'li', 'dp', 'by','sl', 'wd', 'in', 'dv')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

CASE WHEN transactionCode IN ( 'sl', 'wd', 'in', 'dv') THEN securityID2 ELSE securityID1 END

) AS drvPos

FULL OUTER JOIN

(SELECT

portfolioID AS pID,

CASE WHEN transactionCode IN ( 'by', 'dp') THEN securityID2 ELSE securityID1 END AS sID,

SUM( CASE WHEN sectypecode1 IN ( 'ca', 'ex', 'ep') THEN TradeAmount ELSE quantity END) AS qty,

SUM( TradeAmount) AS amt

FROM

vPortfolioTransaction

WHERE

transactionCode IN ( 'to', 'lo', 'wd', 'sl', 'by', 'dp')

AND tradeDate <= @DataUntilDate

GROUP BY

portfolioID,

CASE WHEN transactionCode IN ( 'by', 'dp') THEN securityID2 ELSE securityID1 END

) AS drvNeg

ON drvPos.pID = drvNeg.pID

AND drvPos.sID = drvNeg.sID

ORDER BY

ISNULL( drvPos.pID, drvNeg.pID) ,

ISNULL( drvPos.sID, drvNeg.sID)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

LutzM (9/16/2010)
I'm not sure if the FULL JOIN can be replaced with a CASE statement as well. At the moment it looks like it's possible. It would be great to have some sample data and expected result set to play with.

Yeah, that's part of the whole NDA problem at the moment. What results I'm looking for and the sample data are proprietary, and I'd need to setup a whole separate build to create the test data. I'll see what I can come up with on that front but it's not simple.

Still going through your code, and I believe you're right... Get back to you on that after a few rounds with the data. 🙂

EDIT: I should have said manually create the test data. I have no way of automating test data that will also give intelligent results.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

LutzM SSC Guru Points: 107049 More actions · Answer 7

I think all that's needed so far are 14 sample rows (one for each transactionCode value). If there are more than those 14 values mentioned so far you should include a sample of the ones not mentioned yet.

From my point of view, the easy part is to set up the sample data. To come up with values for all remaning columns leading to a result set that can be verified to either be correct or not is a totally different story.... Absolutely understood.

From my point of view, you'll have plenty of time: I'm going to take a nap (it's 1am over here) followed by a few hours at the place where I make the money for living. So it's going to be at least 15hrs before I'll have a chance to get back here.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]