|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Thursday, March 07, 2013 7:21 AM
Points: 1,
Visits: 119
|
|
arp 24442,
Finding the Median depends on the ordering of of the data you're processing. SQL server 2005's ROW_NUMBER() function allows your to easily calculate a row's position in an ordered set.
Your code might look something like:
DECLARE @MedianValue as decimal DECLARE @rowCount as int
SELECT @rowCount = COUNT(*) FROM dataTable
SELECT @MedianValue = dataValue FROM ( SELECT dataValue ,ROW_NUMBER() OVER (ORDER BY dataValue) as rowNumber FROM dataTable ) as t WHERE t.rowNumber = @rowCount / 2 + 1
In the code above you first get the total number of rows in your set. Then you query for the n/2+1th Item (rowNumber always begins counting at 1). Note that I had to use a subquery in order to easily apply a WHERE clause to the row numbering.
We're also assuming that the row Count is odd. If it's even, Wikipedia says that the Median for even-numbered sets is defined as the mean of the middle two items. That means that you find the @rowCount/2 and @rowCount/2+1 items and average them out.
To find the 25th and 75th percentile items, adjust your formula to find items 1/4 and 3/4 of the way down the list.
If you're stuck using SQL server 2000, there's a nifty way of using SET RowCount and variable assignments to get the exact rows that you need. This articleexplains how to do it with Paging but the method can be used for finding the median as well.
Hope this helps!
Update: I just realized that there may be the case where the data is partitioned by some other value or values. This complicates the solution but it can still be done with SQL server 2005:
DECLARE @MedianValue as decimal DECLARE @rowCount as int
SELECT @rowCount = COUNT(*) FROM dataTable
SELECT dataValue as Median ,groupingValue FROM ( -- count of values partitioned by grouping Value SELECT count(dataValue) / 2 as medianLocation ,groupingValue FROM dataTable GROUP BY groupingValue ) as countTable INNER JOIN ( -- row counts table partitioned by grouping Value SELECT dataValue ,ROW_NUMBER() OVER (ORDER BY dataValue PARTITION BY groupingValue) as rowNumber ,groupingValue FROM dataTable ) as rowTable ON countTable.medianLocation = rowTable.rowNumber AND countTable.groupingValue = rowTable.groupingValue
The logic is fairly straightforward: the first subquery calculates the location of the median per grouped value. The second query calculates the row number for each set of values based on the grouped value. All we then do is join the two tables together so we can get the datavalue (median) for each grouping Value. A good basic introduction to Row_Number() and partitioning can be found here.
The query above will not be the fastest thing on the planet but I'm willing to bet it would still be faster than using a cursor. Unfortunately, I don't think there's an SQL server 2000 equivalent.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Monday, November 28, 2011 9:57 AM
Points: 3,
Visits: 12
|
|
| The advice about not writing more than five cursors in one's career does not make any sense. I think that a cursor is the perfect solution for certain scenarios and those scenarios may arise anytime in our careers.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 7:47 AM
Points: 3,
Visits: 41
|
|
One instance that springs to mind where cursors actually out-perform while key > last_key loops is once you get involved in Sync Framework and start using timestamps. Set-based doesn't handle timestamps all that well when you rely on your information being indexed and batched according to them. Rowversion also behaves in a similar way and certainly isn't suitable for use with Sync Framework.
The problem defines the solution - so don't be afraid to use all the tools you have at hand. Saying you should only use cursors 5 times during your career is like telling a mechanic they can't use a particular spanner more than 5 times - chances are it'll be the only spanner that'll work on that 6th engine.
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: 2 days ago @ 8:01 PM
Points: 408,
Visits: 688
|
|
It's quite disrespectful calling Mr Celko silly.
"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."
This is a correct statement, but probably just needs a comma after "electronics". In electronics yes its about power maximisation and avoiding reflections, but in computing it is certainly about the mismatch between relational and procedural ( which are themselves analogues for parallel and serial).
On that Dinosaur languages comment, a quick search on wikipedia reveals "Lisp is the second-oldest high-level programming language in widespread use today; only Fortran is older (by one year)." Maybe it should be LISP.
Joe Celko is a venerable statesman in the SQL community, he contributed to the SQL-89 and SQL-92 standards and has written and contributed to more than 8 books, not to mention this informative Stairway series. Attacking him on trite points and calling him silly just detracts from the point of all this.
Newbies and Seasoned pros alike can benefit from stairway installments. They suit my short attention span perfectly.
The comment about procedural languages providing an undercover cursor was something I'd never considered, but I suddenly recall all the csv / xls and other flat reports i've ever done that benefitted dramatically from not processing the dataset in the procedural language, and shifting the bulk work back to SQL Server.
Where is Jeff Moden's comment in all this? If you still don't believe cursors are slow just ask him!
|
|
|
|
|
SSCommitted
      
Group: General Forum Members
Last Login: Tuesday, January 15, 2013 11:11 AM
Points: 1,945,
Visits: 2,782
|
|
arp 24442 (11/2/2011) Hi all, I am all for new knowledge. I have come to SQL from a programmer background and still find it hard NOT to think sequentially through a set. Recently, I had to find the 25th, 50th (median) and 75th percentiles (or 1st, 2nd, 3rd quartiles) for a box and whisker plot. After much digging I found a post from a guy using a cursor to find the median and tweaked it to give me the other two stats. [/code]
A to ne point decdes ago, the Median was the "SQL Proble,m du Jour"; Chris date published one answer in his column, I replied with another answer, and it bounced around the community for a few months. The last edition of SQL FOR SMARTIES has six ways.
The approach I like today is to use SELECT x ROW_NUMBER() OVER (ORDER BY x ASC) AS up_cnt, ROW_NUMBER() OVER (ORDER BY x DESC) AS dn_cnt FROM Foobar;
the look for up_cnt IN ( dn_cnt, dn_cnt -1, dn_cnt +1)
But you have to wqth out for duplicate values. lay with it see if you like it.
Books in Celko Series for Morgan-Kaufmann Publishing Analytics and OLAP in SQL Data and Databases: Concepts in Practice Data, Measurements and Standards in SQL SQL for Smarties SQL Programming Style SQL Puzzles and Answers Thinking in Sets Trees and Hierarchies in SQL
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Monday, May 20, 2013 8:01 AM
Points: 254,
Visits: 373
|
|
david.howell (11/2/2011) It's quite disrespectful calling Mr Celko silly.
"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."
This is a correct statement, but probably just needs a comma after "electronics". In electronics yes its about power maximisation and avoiding reflections, but in computing it is certainly about the mismatch between relational and procedural ( which are themselves analogues for parallel and serial).
On that Dinosaur languages comment, a quick search on wikipedia reveals "Lisp is the second-oldest high-level programming language in widespread use today; only Fortran is older (by one year)." Maybe it should be LISP.
Joe Celko is a venerable statesman in the SQL community, he contributed to the SQL-89 and SQL-92 standards and has written and contributed to more than 8 books, not to mention this informative Stairway series. Attacking him on trite points and calling him silly just detracts from the point of all this.
Newbies and Seasoned pros alike can benefit from stairway installments. They suit my short attention span perfectly.
The comment about procedural languages providing an undercover cursor was something I'd never considered, but I suddenly recall all the csv / xls and other flat reports i've ever done that benefitted dramatically from not processing the dataset in the procedural language, and shifting the bulk work back to SQL Server.
Where is Jeff Moden's comment in all this? If you still don't believe cursors are slow just ask him!
I'm all about Jeff Moden and anti-RBAR. I just work with and have always worked with and been surrounded by people who consider OO programming to be greater than functional programming by default and it bothers me. The reason I left Lisp out of the dinosaur category is because it was basically so brilliant that it's become timeless.
I have no issue with his SQL knowledge and I have no issue with cutting down on cursors. I just didn't like the implied outdated-ness of functional programming.
But again, straying from topic.
My company has me writing alllllllll of the business logic for a program that's going to be a huge part of our business cycle in SQL. And it's driving me nuts because I've had to write cursors for it in more than one place.
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: Wednesday, January 16, 2013 4:23 PM
Points: 415,
Visits: 2,333
|
|
david.howell (11/2/2011) It's quite disrespectful calling Mr Celko silly.
"This is called an impedance mismatch, a borrowed term from electronics that refers to parallel versus serial data transmission."
This is a correct statement
If you really want to insist on that statement's correctness, I doubt I can convince you otherwise.
Anyways, statesman or not, his writing style is very silly to me. Sorry!
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: Wednesday, January 16, 2013 4:23 PM
Points: 415,
Visits: 2,333
|
|
So one use of cursors I would consider would be running totals based on a date ordering of a result set. It seems to me that using the set oriented solutions I've would get progressively slower as the result set got larger. Are there set oriented solutions that wouldn't increase in execution time at more than a linear rate in proportion to the number of records in the set getting the running total calculated?
I can't conceptually figure out how to accumulate and list a total for each record without constantly recalculating it against all the previous records preceding it in the ordered result set.
|
|
|
|
|
SSCrazy
      
Group: General Forum Members
Last Login: Thursday, May 16, 2013 9:16 AM
Points: 2,236,
Visits: 6,486
|
|
|
|
|
|
Old Hand
      
Group: General Forum Members
Last Login: Wednesday, May 22, 2013 3:59 PM
Points: 316,
Visits: 400
|
|
A cursor does have its place because some things (e.g., DDL, procedure execution) can't be done in a set-based query. Imagine using a SELECT to execute a procedure to send email? Execution of T-SQL select and update statements stored in a column as part of a set-based query in a single transaction? Needing a cursor in or procedural code this case is probably a good thing.
As to avoiding a cursor, any form of a procedural work around (e.g., WHILE loops) still misses the point. SQL Server can't optimize procedural code well - if at all. If you can specify the problem as a set-based query declaration, SQL server might still end up with a horrid execution plan. Even so, I bet it does a better job of it. How many of you will use multiple threads to process something procedurally? (However, if SQL is going parallel, the query probably has not been optimized.)
With a reservation system, the problem often is that the record to be processed depends upon completing the processing of the prior record. This might seem to require a cursor or procedural code, but there are options to do it set based. Even so, I notice some will do everything with the cursor once a cursor is "required" - even if most of the problem can be solved via set-based steps without much thought.
PS I have to look up the cursor syntax every time I use it. I think that's as it should be.
|
|
|
|