Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Stairway to Database Design Level 8: Cursors Expand / Collapse
Author
Message
Posted Wednesday, November 2, 2011 9:49 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, July 17, 2014 12:27 PM
Points: 1, Visits: 154
arp 24442,

Finding the Median depends on the ordering of of the data you're processing. SQL server 2005's ROW_NUMBER() function allows your to easily calculate a row's position in an ordered set.

Your code might look something like:
DECLARE @MedianValue as decimal
DECLARE @rowCount as int

SELECT @rowCount = COUNT(*) FROM dataTable

SELECT
@MedianValue = dataValue
FROM
(
SELECT
dataValue
,ROW_NUMBER() OVER (ORDER BY dataValue) as rowNumber
FROM dataTable
) as t
WHERE
t.rowNumber = @rowCount / 2 + 1

In the code above you first get the total number of rows in your set. Then you query for the n/2+1th Item (rowNumber always begins counting at 1). Note that I had to use a subquery in order to easily apply a WHERE clause to the row numbering.

We're also assuming that the row Count is odd. If it's even, Wikipedia says that the Median for even-numbered sets is defined as the mean of the middle two items. That means that you find the @rowCount/2 and @rowCount/2+1 items and average them out.

To find the 25th and 75th percentile items, adjust your formula to find items 1/4 and 3/4 of the way down the list.

If you're stuck using SQL server 2000, there's a nifty way of using SET RowCount and variable assignments to get the exact rows that you need. This articleexplains how to do it with Paging but the method can be used for finding the median as well.

Hope this helps!

Update:
I just realized that there may be the case where the data is partitioned by some other value or values. This complicates the solution but it can still be done with SQL server 2005:

DECLARE @MedianValue as decimal
DECLARE @rowCount as int

SELECT @rowCount = COUNT(*) FROM dataTable

SELECT
dataValue as Median
,groupingValue
FROM
( -- count of values partitioned by grouping Value
SELECT
count(dataValue) / 2 as medianLocation
,groupingValue
FROM dataTable
GROUP BY groupingValue
) as countTable
INNER JOIN
( -- row counts table partitioned by grouping Value
SELECT
dataValue
,ROW_NUMBER() OVER (ORDER BY dataValue PARTITION BY groupingValue) as rowNumber
,groupingValue
FROM dataTable
) as rowTable
ON countTable.medianLocation = rowTable.rowNumber
AND countTable.groupingValue = rowTable.groupingValue

The logic is fairly straightforward: the first subquery calculates the location of the median per grouped value. The second query calculates the row number for each set of values based on the grouped value. All we then do is join the two tables together so we can get the datavalue (median) for each grouping Value. A good basic introduction to Row_Number() and partitioning can be found here.

The query above will not be the fastest thing on the planet but I'm willing to bet it would still be faster than using a cursor. Unfortunately, I don't think there's an SQL server 2000 equivalent.
Post #1199403
Posted Wednesday, November 2, 2011 10:32 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, November 28, 2011 9:57 AM
Points: 3, Visits: 12
The advice about not writing more than five cursors in one's career does not make any sense. I think that a cursor is the perfect solution for certain scenarios and those scenarios may arise anytime in our careers.
Post #1199439
Posted Wednesday, November 2, 2011 1:30 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, January 28, 2014 7:07 AM
Points: 3, Visits: 42
One instance that springs to mind where cursors actually out-perform while key > last_key loops is once you get involved in Sync Framework and start using timestamps. Set-based doesn't handle timestamps all that well when you rely on your information being indexed and batched according to them. Rowversion also behaves in a similar way and certainly isn't suitable for use with Sync Framework.

The problem defines the solution - so don't be afraid to use all the tools you have at hand. Saying you should only use cursors 5 times during your career is like telling a mechanic they can't use a particular spanner more than 5 times - chances are it'll be the only spanner that'll work on that 6th engine.
Post #1199599
Posted Wednesday, November 2, 2011 5:45 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Wednesday, July 16, 2014 3:03 AM
Points: 450, Visits: 842
It's quite disrespectful calling Mr Celko silly.

"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."

This is a correct statement, but probably just needs a comma after "electronics". In electronics yes its about power maximisation and avoiding reflections, but in computing it is certainly about the mismatch between relational and procedural ( which are themselves analogues for parallel and serial).

On that Dinosaur languages comment, a quick search on wikipedia reveals
"Lisp is the second-oldest high-level programming language in widespread use today; only Fortran is older (by one year)." Maybe it should be LISP.

Joe Celko is a venerable statesman in the SQL community, he contributed to the SQL-89 and SQL-92 standards and has written and contributed to more than 8 books, not to mention this informative Stairway series. Attacking him on trite points and calling him silly just detracts from the point of all this.

Newbies and Seasoned pros alike can benefit from stairway installments. They suit my short attention span perfectly.

The comment about procedural languages providing an undercover cursor was something I'd never considered, but I suddenly recall all the csv / xls and other flat reports i've ever done that benefitted dramatically from not processing the dataset in the procedural language, and shifting the bulk work back to SQL Server.

Where is Jeff Moden's comment in all this? If you still don't believe cursors are slow just ask him!
Post #1199689
Posted Wednesday, November 2, 2011 9:30 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: 2 days ago @ 11:55 AM
Points: 1,945, Visits: 2,860
arp 24442 (11/2/2011)
Hi all,
I am all for new knowledge. I have come to SQL from a programmer background and still find it hard NOT to think sequentially through a set. Recently, I had to find the 25th, 50th (median) and 75th percentiles (or 1st, 2nd, 3rd quartiles) for a box and whisker plot. After much digging I found a post from a guy using a cursor to find the median and tweaked it to give me the other two stats.
[/code]


A to ne point decdes ago, the Median was the "SQL Proble,m du Jour"; Chris date published one answer in his column, I replied with another answer, and it bounced around the community for a few months. The last edition of SQL FOR SMARTIES has six ways.

The approach I like today is to use
SELECT x
ROW_NUMBER() OVER (ORDER BY x ASC) AS up_cnt,
ROW_NUMBER() OVER (ORDER BY x DESC) AS dn_cnt
FROM Foobar;

the look for up_cnt IN ( dn_cnt, dn_cnt -1, dn_cnt +1)


But you have to wqth out for duplicate values. lay with it see if you like it.


Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #1199704
Posted Thursday, November 3, 2011 5:58 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Yesterday @ 3:59 PM
Points: 269, Visits: 414
david.howell (11/2/2011)
It's quite disrespectful calling Mr Celko silly.

"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."

This is a correct statement, but probably just needs a comma after "electronics". In electronics yes its about power maximisation and avoiding reflections, but in computing it is certainly about the mismatch between relational and procedural ( which are themselves analogues for parallel and serial).

On that Dinosaur languages comment, a quick search on wikipedia reveals
"Lisp is the second-oldest high-level programming language in widespread use today; only Fortran is older (by one year)." Maybe it should be LISP.

Joe Celko is a venerable statesman in the SQL community, he contributed to the SQL-89 and SQL-92 standards and has written and contributed to more than 8 books, not to mention this informative Stairway series. Attacking him on trite points and calling him silly just detracts from the point of all this.

Newbies and Seasoned pros alike can benefit from stairway installments. They suit my short attention span perfectly.

The comment about procedural languages providing an undercover cursor was something I'd never considered, but I suddenly recall all the csv / xls and other flat reports i've ever done that benefitted dramatically from not processing the dataset in the procedural language, and shifting the bulk work back to SQL Server.

Where is Jeff Moden's comment in all this? If you still don't believe cursors are slow just ask him!


I'm all about Jeff Moden and anti-RBAR. I just work with and have always worked with and been surrounded by people who consider OO programming to be greater than functional programming by default and it bothers me. The reason I left Lisp out of the dinosaur category is because it was basically so brilliant that it's become timeless.

I have no issue with his SQL knowledge and I have no issue with cutting down on cursors. I just didn't like the implied outdated-ness of functional programming.

But again, straying from topic.

My company has me writing alllllllll of the business logic for a program that's going to be a huge part of our business cycle in SQL. And it's driving me nuts because I've had to write cursors for it in more than one place.

Post #1199830
Posted Thursday, November 3, 2011 6:42 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Wednesday, January 16, 2013 4:23 PM
Points: 415, Visits: 2,333

david.howell (11/2/2011)
It's quite disrespectful calling Mr Celko silly.

"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."

This is a correct statement


If you really want to insist on that statement's correctness, I doubt I can convince you otherwise.

Anyways, statesman or not, his writing style is very silly to me. Sorry!
Post #1199846
Posted Thursday, November 3, 2011 11:05 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Wednesday, January 16, 2013 4:23 PM
Points: 415, Visits: 2,333
So one use of cursors I would consider would be running totals based on a date ordering of a result set. It seems to me that using the set oriented solutions I've would get progressively slower as the result set got larger. Are there set oriented solutions that wouldn't increase in execution time at more than a linear rate in proportion to the number of records in the set getting the running total calculated?

I can't conceptually figure out how to accumulate and list a total for each record without constantly recalculating it against all the previous records preceding it in the ordered result set.
Post #1200074
Posted Thursday, November 3, 2011 11:12 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Yesterday @ 9:29 AM
Points: 2,422, Visits: 7,435
patrickmcginnis59 (11/3/2011)
So one use of cursors I would consider would be running totals based on a date ordering of a result set. It seems to me that using the set oriented solutions I've would get progressively slower as the result set got larger. Are there set oriented solutions that wouldn't increase in execution time at more than a linear rate in proportion to the number of records in the set getting the running total calculated?

I can't conceptually figure out how to accumulate and list a total for each record without constantly recalculating it against all the previous records preceding it in the ordered result set.


Jeff's quirky update solves the running total issue, and "Denali" offers new options.



Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1200081
Posted Thursday, November 3, 2011 11:58 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Yesterday @ 4:57 PM
Points: 361, Visits: 562
A cursor does have its place because some things (e.g., DDL, procedure execution) can't be done in a set-based query. Imagine using a SELECT to execute a procedure to send email? Execution of T-SQL select and update statements stored in a column as part of a set-based query in a single transaction? Needing a cursor in or procedural code this case is probably a good thing.

As to avoiding a cursor, any form of a procedural work around (e.g., WHILE loops) still misses the point. SQL Server can't optimize procedural code well - if at all. If you can specify the problem as a set-based query declaration, SQL server might still end up with a horrid execution plan. Even so, I bet it does a better job of it. How many of you will use multiple threads to process something procedurally? (However, if SQL is going parallel, the query probably has not been optimized.)

With a reservation system, the problem often is that the record to be processed depends upon completing the processing of the prior record. This might seem to require a cursor or procedural code, but there are options to do it set based. Even so, I notice some will do everything with the cursor once a cursor is "required" - even if most of the problem can be solved via set-based steps without much thought.

PS I have to look up the cursor syntax every time I use it. I think that's as it should be.


Randy
"Perhaps I'm not the only one that does not know what you are doing."
Post #1200118
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse