November 1, 2010 at 11:01 am
Hi all,
I'm not exactly new to MS SQL, but am facing an optimisation issue that I could use some help on - and the best answers might, I hope, also be helpful to complete SQL Newbies as well.
My question is...
I have an "auction-style" application which contains multiple products being sold (with many duplicate products), and with many possible prices for these items in it, from different people selling the item.
I want to write the most efficient "single pass" query for getting the cheapest price for each item, but also identifying the rows which have identically matching cheapest prices, and returning them as well.
My old SQL "Obi-Wan" guru (sadly not available to ask question of nowadays) always told me to run away from HAVING and IN clauses, and it's proved good advice so far.
I know that answers to these "what works the best" questions depend on a wide variety of factors: but assume that the source Table Variable (@ExampleTable) below is a regular table containing many hundreds of thousands (if not millions) of rows, that I have full control over indexes etc, that the data will be changing *extremely* regularly, that it will be queried often, that there are many other tables hanging off the ExampleTableID field containing various other variables to be filtered on during the select, and that I'm on SQL 2008.
Here are 3 sample queries, illustrating three different ways of getting the same result set from the same base data - and I can think of many more ways than these three (involving CTEs etc).
Can anyone shed some light (given the above) on what the best method would be - and please don't limit yourself to the 3 versions below; I'm absolutely sure that none of them is the most efficient method, but I am really not sure what would be!
Many thanks in advance.
J
declare @ExampleTable table
(
ExampleTableID int identity,
ExampleProduct varchar(50),
ExampleValue int
)
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 1', 50
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 1', 120
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 2', 25
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 2', 36
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 3', 14
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 3', 129
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 3', 13
insert into @ExampleTable (ExampleProduct,ExampleValue)
select 'Product 2', 25
-- Method 1 of getting the rowset containing the Smallest Value of each Product, using HAVING
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from @ExampleTable et
group by
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue
having et.ExampleValue in
(
select MIN(ExampleValue) from @ExampleTable et2
where et.ExampleProduct = et2.ExampleProduct
)
order by et.ExampleProduct asc
-- Method 2 of getting the rowset containing the Smallest Value of each Product, using IN
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from @ExampleTable et
where et.ExampleValue in (select MIN(et2.ExampleValue) from @ExampleTable et2
where et.ExampleProduct = et2.ExampleProduct)
group by
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue
order by et.ExampleProduct asc
-- Method 3, separation of smallest values into smaller table and then run from there
declare @SmallestValues table
(
ExampleProduct varchar(50),
ExampleValue int
)
insert into @SmallestValues (ExampleProduct, ExampleValue)
select et.ExampleProduct, MIN(et.ExampleValue)
from @ExampleTable et
group by et.ExampleProduct
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from @ExampleTable et
where exists (select * from @SmallestValues sv
wheresv.ExampleProduct = et.ExampleProduct
andsv.ExampleValue = et.ExampleValue)
order by et.ExampleProduct asc
November 1, 2010 at 12:03 pm
;WITH CTE AS
(
SELECT t.ExampleTableID,
t.ExampleProduct,
t.ExampleValue,
RN = RANK() OVER (PARTITION BY ExampleProduct ORDER BY ExampleValue)
FROM @ExampleTable t
)
SELECT *
FROM CTE
WHERE RN = 1;
This article discusses the RANK() and other Windowing Functions: SQL Server Ranking Functions[/url]
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 12:13 pm
WayneS (11/1/2010)
;WITH CTE AS
(
SELECT t.ExampleTableID,
t.ExampleProduct,
t.ExampleValue,
RN = RANK() OVER (PARTITION BY ExampleProduct ORDER BY ExampleValue)
FROM @ExampleTable t
)
SELECT *
FROM CTE
WHERE RN = 1;
This article discusses the RANK() and other Windowing Functions: SQL Server Ranking Functions[/url]
Hi Wayne,
That's exactly what I suspected I was missing - and thank you muchly for your reply.
Mods, please feel free to close this thread now.
Rgds,
James N
November 1, 2010 at 12:13 pm
James,
First, nice work on the sampleset and work shown. Helped a lot.
I usually use a variant on your Method 2 and 3, along these lines:
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from
@ExampleTable et
JOIN
(select
ExampleProduct,
MIN(ExampleValue) AS MinVal
from
@ExampleTable
GROUP BY
ExampleProduct
) AS drv
ONet.ExampleProduct = drv.ExampleProduct
AND et.ExampleValue = drv.MinVal
order by
et.ExampleProduct
This will depend on your indexing and whatnot, of course, but with no indexes involved and examining the execution plans, my version does less work than the rest of the options. Combine this with method 3 for a temp table (or tablevar) to relink into a more complex query to get better optimization out of it if needed. I would only do that for roughly 2000+ rows in the grouped min() with 10-20k in the external after a bit of testing if I wasn't happy with the speed.
EDIT: I cross posted with Wayne. 🙂 I took his code and checked the plans against mine. Unindexed, and with this tiny dataset, they're about even in work but his has a few more steps including a compute scalar. I'd try both out and see what results you get with the larger dataset.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 1, 2010 at 12:20 pm
Craig Farrell (11/1/2010)
James,First, nice work on the sampleset and work shown. Helped a lot.
I usually use a variant on your Method 2 and 3, along these lines:
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from
@ExampleTable et
JOIN
(select
ExampleProduct,
MIN(ExampleValue) AS MinVal
from
@ExampleTable
GROUP BY
ExampleProduct
) AS drv
ONet.ExampleProduct = drv.ExampleProduct
AND et.ExampleValue = drv.MinVal
order by
et.ExampleProduct
This will depend on your indexing and whatnot, of course, but with no indexes involved and examining the execution plans, my version does less work than the rest of the options. Combine this with method 3 for a temp table (or tablevar) to relink into a more complex query to get better optimization out of it if needed. I would only do that for roughly 2000+ rows in the grouped min() with 10-20k in the external after a bit of testing if I wasn't happy with the speed.
EDIT: I cross posted with Wayne. 🙂 I took his code and checked the plans against mine. Unindexed, and with this tiny dataset, they're about even in work but his has a few more steps including a compute scalar. I'd try both out and see what results you get with the larger dataset.
Hi Craig,
Heh - I crossposted with yours too 🙂
Personally I felt I was missing something with CTEs, and that "felt" like a Eureka moment for me (I've only just moved up to 2008 from 2000, and CTEs have been a bit of an "omg" moment - as have the geospatial datatypes and HeirarchyIDs) - but I'll definitely check both methods against each other on the actual data and see where I get with what I've got.
Many thanks to you both.
Rgds,
JamesN
November 1, 2010 at 12:40 pm
jamesniesewand (11/1/2010)
Personally I felt I was missing something with CTEs, and that "felt" like a Eureka moment for me (I've only just moved up to 2008 from 2000, and CTEs have been a bit of an "omg" moment - as have the geospatial datatypes and HeirarchyIDs) - but I'll definitely check both methods against each other on the actual data and see where I get with what I've got.
My personal omg moment with ctes was the recursive nature. Everything else is just a pretty subquery organizer to me, but those recursions... those are the true power! (Well, in my head, anyway...)
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 1, 2010 at 12:42 pm
jamesniesewand (11/1/2010)
Hi Wayne,That's exactly what I suspected I was missing - and thank you muchly for your reply.
Great, I'm glad it's working right. Thanks for posting the DML/DDL to produce what you're trying to do.
Mods, please feel free to close this thread now.
We don't do that around here... there are frequently some awesome discussions that evolve after there's been one solution - and sometimes even better solutions come forth.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 12:51 pm
Craig Farrell (11/1/2010)
EDIT: I cross posted with Wayne. 🙂 I took his code and checked the plans against mine. Unindexed, and with this tiny dataset, they're about even in work but his has a few more steps including a compute scalar. I'd try both out and see what results you get with the larger dataset.
The compute scalar is the "1" being used in my plan.
The filter is the "RN = 1".
Your solution performs very well also... I also would really like to see some performance tests on this. Guess it's time to make a million row test rig, and test them.... stay tuned.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 12:53 pm
Craig Farrell (11/1/2010)
jamesniesewand (11/1/2010)
Personally I felt I was missing something with CTEs, and that "felt" like a Eureka moment for me (I've only just moved up to 2008 from 2000, and CTEs have been a bit of an "omg" moment - as have the geospatial datatypes and HeirarchyIDs) - but I'll definitely check both methods against each other on the actual data and see where I get with what I've got.My personal omg moment with ctes was the recursive nature. Everything else is just a pretty subquery organizer to me, but those recursions... those are the true power! (Well, in my head, anyway...)
Agreed Craig. With the exception of the recursive CTEs, I tend to think of a CTE as a "pre-defined sub-query". As such, when using them, it makes it obvious how to change your query when working with someone on Sql2000. I just find a CTE a lot easier to test than sub-queries.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 1:26 pm
Okay, I made a dataset with ten-million records (over 5000 products), then ran both code:
Test harness / code is attached (trouble posting again) below:
if OBJECT_ID('tempdb..#test') IS NOT NULL DROP TABLE #test;
CREATE TABLE #test
(
ExampleTableID int identity,
ExampleProduct varchar(50),
ExampleValue int
)
-- See Jeff Moden's article
-- The "Numbers" or "Tally" Table: What it is and how it replaces a loop.
-- at http://www.sqlservercentral.com/articles/T-SQL/62867/.
-- NOTE! A permanent tally table will always be MUCH faster
-- than this inline one. See the above article to create your own!
;WITH
Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL
SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),
Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),
Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),
Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions),
Products AS (SELECT Product = 'Product ' + CONVERT(varchar(5), N) FROM Tally WHERE N <= 5000)
INSERT INTO #test
SELECT TOP (10000000) p.Product, value = ABS(CHECKSUM(NEWID()))%100+1
FROM Tally t CROSS JOIN Products p;
DROP INDEX #test.ProductValue;
CREATE INDEX ProductValue ON #test (ExampleProduct, ExampleValue) INCLUDE (ExampleTableID);
declare @i int = 1;
SET STATISTICS IO,TIME ON;
;WITH CTE AS
(
SELECT t.ExampleTableID,
t.ExampleProduct,
t.ExampleValue,
RN = RANK() OVER (PARTITION BY ExampleProduct ORDER BY ExampleValue)
FROM #test t
)
SELECT *
FROM CTE
WHERE RN = @i;
select
et.ExampleTableID,
et.ExampleProduct,
et.ExampleValue as SmallestValueForExampleProduct
from
#test et
JOIN
(select
ExampleProduct,
MIN(ExampleValue) AS MinVal
from
#test
GROUP BY
ExampleProduct
) AS drv
ON et.ExampleProduct = drv.ExampleProduct
AND et.ExampleValue = drv.MinVal
order by
et.ExampleProduct ;
SET STATISTICS IO,TIME OFF;
Results:
Wayne's Code:
Table '#test_00000000000C'. Scan count 1, logical reads 44024, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 12079 ms, elapsed time = 15344 ms.
Craig's Code:
Table '#test_00000000000C'. Scan count 14, logical reads 44795, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 6594 ms, elapsed time = 6702 ms.
Craig, I guess you whopped up on me! Only 40% of the time of mine... very good.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 1:46 pm
WayneS (11/1/2010)
Great, I'm glad it's working right. Thanks for posting the DML/DDL to produce what you're trying to do.Mods, please feel free to close this thread now.
We don't do that around here... there are frequently some awesome discussions that evolve after there's been one solution - and sometimes even better solutions come forth.
Good Lord - you people are the best 😎
I mean seriously - on almost any forum I can mention, this topic would have been closed at my request (or ignored by everyone else).
The responses posted, and the speed (not just of the queries, but of the postings, testbeds as well!) has been a breathtaking first welcome for me to this forum and the expertise herein.
If there isn't a :bowdown: smiley there should be - can't thank you guys enough.
Best,
JamesN
November 1, 2010 at 1:51 pm
WayneS (11/1/2010)
Craig, I guess you whopped up on me! Only 40% of the time of mine... very good.
Well, every now and then I get lucky. :w00t: Honestly though, my code beating out yours makes me look closer at why. 🙂 I took your code and ran it and checked through the results.
Couple of notes.
1) DECLARE @i INT = 1; doesn't work in 2k5, had to change it to DECLARE @i INT; SET @i = 1;
2) You've got a better machine then I do. 😛
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 1, logical reads 44143, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 6688 ms, elapsed time = 39886 ms.
(99667 row(s) affected)
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 14, logical reads 48953, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 3282 ms, elapsed time = 16956 ms.
Of particular curiousity to me though is while I had more elapsed time, I used less CPU. About half. Very odd.
3) A review of the execution plans shows that I'm only pulling 22 rows out during the aggregation portion. I'd have expected 5k rows from the product method you used:
Products AS (SELECT Product = 'Product ' + CONVERT(varchar(5), N) FROM Tally WHERE N <= 5000)
/*AND*/
Tally t CROSS JOIN Products p
However:
select count( DISTINCT ExampleProduct) FROM #test
= 11.
select DISTINCT ExampleProduct FROM #test
Product 1
Product 11
Product 13
Product 15
Product 17
Product 19
Product 2
Product 3
Product 5
Product 7
Product 9
What am I missing in how that result set appears? Seems very strange to me.
I believe, though, that the entire speed difference is the volume of data being carried. If we figure out why the product list out of a million rows is only producing 11 distinct products and we bring this up to the 5000 products you'd expect, we will see closer runtimes, I believe.
EDIT: AHAHA! It's from the top limiter vs. the crossjoin. Need to get that to behave itself. Get back to you with slightly modified code there.
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
November 1, 2010 at 2:00 pm
Craig Farrell (11/1/2010)
WayneS (11/1/2010)
Craig, I guess you whopped up on me! Only 40% of the time of mine... very good.Well, every now and then I get lucky. :w00t: Honestly though, my code beating out yours makes me look closer at why. 🙂 I took your code and ran it and checked through the results.
Couple of notes.
1) DECLARE @i INT = 1; doesn't work in 2k5, had to change it to DECLARE @i INT; SET @i = 1;
Whoops... I tried that afterwards - didn't help any.
2) You've got a better machine then I do. 😛
Shhh! It's an old, sucky Dell Latitude D620. (I'd love to run this on my laptop at home!)
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 1, logical reads 44143, physical reads 0, read-ahead reads 8, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 6688 ms, elapsed time = 39886 ms.
(99667 row(s) affected)
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 14, logical reads 48953, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 3282 ms, elapsed time = 16956 ms.
Of particular curiousity to me though is while I had more elapsed time, I used less CPU. About half. Very odd.
Yours has the worktable (and more reads???). So yours is half the elapsed time also.
3) A review of the execution plans shows that I'm only pulling 22 rows out during the aggregation portion. I'd have expected 5k rows from the product method you used:
Products AS (SELECT Product = 'Product ' + CONVERT(varchar(5), N) FROM Tally WHERE N <= 5000)
/*AND*/
Tally t CROSS JOIN Products p
However:
select count( DISTINCT ExampleProduct) FROM #test
= 11.
select DISTINCT ExampleProduct FROM #test
Product 1
Product 11
Product 13
Product 15
Product 17
Product 19
Product 2
Product 3
Product 5
Product 7
Product 9
What am I missing in how that result set appears? Seems very strange to me.[/quote]
Indeed. I've got no clue for that - maybe your system is overtaxed?
I believe, though, that the entire speed difference is the volume of data being carried. If we figure out why the product list out of a million rows is only producing 11 distinct products and we bring this up to the 5000 products you'd expect, we will see closer runtimes, I believe.
EDIT: AHAHA! It's from the top limiter vs. the crossjoin. Need to get that to behave itself. Get back to you with slightly modified code there.
Okay, let's see how it looks now.
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 2:07 pm
jamesniesewand (11/1/2010)
Good Lord - you people are the best 😎I mean seriously - on almost any forum I can mention, this topic would have been closed at my request (or ignored by everyone else).
The responses posted, and the speed (not just of the queries, but of the postings, testbeds as well!) has been a breathtaking first welcome for me to this forum and the expertise herein.
If there isn't a :bowdown: smiley there should be - can't thank you guys enough.
Best,
JamesN
James,
:blush:Thanks for the very kind words.
But, get used to it around here. Especially when you take the time to post table DDL and DML in readily consumable format - lot's of people "lurk", and will work on things if there is code ready to work with. If not, they'll just skip it.
You'll find that most of the regulars here are always interested in better ways to accomplish things. This is not an attempt to show up others, but a serious approach of learning for all.
And finally... WELCOME!
Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes
November 1, 2010 at 2:41 pm
James, thank you for the kindness. It's appreciated... especially recently.
Hahah, okay, now that the selectivity is frying the 0.3% rule (1 day in a year, as a rule of thumb, James, since you're still following...) We're still seeing similar times but the execution plans are VERY different.
And... strangely... faster.
Modify the beginning code as so (I have a real Tally table in my tempdb. 🙂 ):
TRUNCATE TABLE #test
;WITH Products AS (SELECT Product = 'Product ' + CONVERT(varchar(5), N) FROM Tally WHERE N <= 5000)
INSERT INTO #test
SELECT TOP (10000000) p.Product, value = ABS(CHECKSUM(NEWID()))%100+1
FROM Tally t CROSS JOIN Products p
WHERE t.N < 200;
This gives us 995,000 rows (close enough), but 200/product, and a proper count on the # of products for selectivity.
Now my results look like this:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.
(11429 row(s) affected)
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 1, logical reads 4674, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 844 ms, elapsed time = 5467 ms.
(11429 row(s) affected)
Table '#test_______________________________________________________________________________________________________________000000000010'. Scan count 1, logical reads 4674, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 360 ms, elapsed time = 2563 ms.
That's some huge differences, in both our timing tests... same box. Yes. No, honestly. That's the same box.
The execution plans have changed, well, at least mine has. Wayne's code didn't change much execution plan wise, so I'm highly confused as to the results of the timing test.
I've attached the resultant sqlplans. That filter is apparently incredibly expensive, Wayne. I had figured the larger Segment would be worse. We end up with the same reads, too.
Sometimes this optimizer seriously confuses me. 🙂
Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.
For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA
Viewing 15 posts - 1 through 15 (of 18 total)
You must be logged in to reply to this topic. Login to reply