A Lookup Strategy Defined

Question

A Lookup Strategy Defined

Viewing 15 posts - 16 through 30 (of 34 total)

You must be logged in to reply to this topic. Login to reply

Frank Kalis SSC Guru Points: 111183 More actions · Answer 1

I am not going with the author or the mainstream here.

As I'm too tired to write a pamphlet and in addition have only a very limited english vocabulary compared to the author of the reply in the link below, read this

http://groups.yahoo.com/group/sql_server7_forum/message/5731

It's a reply to this question

http://groups.yahoo.com/group/sql_server7_forum/message/5726

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Paul Thornett Right there with Babe Points: 790 More actions · Answer 2

Joe Celko is very much of a purist when it comes to relational databases. If you have ever tried to adopt the code examples he provides in his books, you'll soon realize that they tend to run very slowly if at all. In his "SQL for Smarties" book he provides several different ways of deriving median values. Some of this code doesn't work at all, while other code runs excruciating slowly (he doesn't approve of the use of cursors, which IMHO is the only practical way of addressing this particular problem).

Frank Kalis SSC Guru Points: 111183 More actions · Answer 3

Use of a cursor for calculation of the median (financial or statistical ?) ???

CREATE TABLE #dist (c1 int)

INSERT INTO #dist VALUES (2)

INSERT INTO #dist VALUES (3)

INSERT INTO #dist VALUES (1)

INSERT INTO #dist VALUES (4)

INSERT INTO #dist VALUES (8)

INSERT INTO #dist VALUES (9)

SELECT Median=CASE COUNT(*)%2

WHEN 0 THEN

(d.c1+MIN(CASE WHEN i.c1>d.c1 THEN i.c1 ELSE NULL END))/2.0

ELSE d.c1 END -- Odd number

FROM #dist d CROSS JOIN #dist i

GROUP BY d.c1

HAVING COUNT(CASE WHEN i.c1 <= d.c1 THEN 1 ELSE NULL END)=(COUNT(*)+1)/2

DROP TABLE #dist

Median

-------------------

3.500000

or if there are duplicate values

set nocount on

CREATE TABLE #dist (c1 int)

INSERT #dist VALUES (2)

INSERT #dist VALUES (1)

INSERT #dist VALUES (5)

INSERT #dist VALUES (9)

SELECT Median=ISNULL((CASE WHEN COUNT(CASE WHEN i.c1<=d.c1 THEN 1 ELSE NULL END) > (COUNT(*)+1)/2 THEN 1.0*d.c1 ELSE NULL END)+COUNT(*)%2,

(d.c1+MIN((CASE WHEN i.c1>d.c1 THEN i.c1 ELSE NULL END)))/2.0)

FROM #dist d CROSS JOIN #dist i

GROUP BY d.c1

HAVING (COUNT(CASE WHEN i.c1 <= d.c1 THEN 1 ELSE NULL END)>=(COUNT(*)+1)/2)

AND (COUNT(CASE WHEN i.c1 >=d.c1 THEN 1 ELSE NULL END) >= COUNT(*)/2+1)

GO

drop table #dist

Median

----------------

3.5

There are many other ways possible. None of them utillizes a cursor.

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

gdefi SSC-Addicted Points: 493 More actions · Answer 4

I have a lot of respect for Joe Celko. However, I bend the rules sometimes. And look-up tables are precisely where I bend the rules, and the resulting CODES table, being used for the past four years, has stayed at 105 rows, has 14 types of codes, and has rarely changed. Its holds things like hearing types, action types, court types, plea types, legislation types, jurisdiction types, and the like (you've guessed it - it is an application to take care of legal processes). Each one to his taste, I suppose.

Paul Thornett Right there with Babe Points: 790 More actions · Answer 5

I was over-simplifying when I wrote "deriving median values" . I have written a stored procedure which does any of the following, depending on the parameters passed: Medians, Quartiles, Quintiles, Deciles, Percentiles. Moreover these measures are reported grouped by any number of columns (and I have never seen this capability anywhere else). In this context, the use of a cursor is mandatory!

Frank Kalis SSC Guru Points: 111183 More actions · Answer 6

Medians, Quartiles, Quintiles, Deciles, Percentiles. Moreover these measures are reported grouped by any number of columns (and I have never seen this capability anywhere else). In this context, the use of a cursor is mandatory!

Median as a measure of central tendency is okay when calculated in SQL Server. But as the median alone is not terribly useful, I generally prefer the use of a specialized software called Prophet or simply download the data to Excel for any statistical stuff I do. Because, as you have implied the code will become messy and might require a cursor (actually I never have thought about calculating the things you've mentioned in SQL Server)

As for gdefi:

I agree, each one to his taste.

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Paul Thornett Right there with Babe Points: 790 More actions · Answer 7

"or simply download the data to Excel for any statistical stuff I do". Be very careful. First off, Excel can't handle more than 64,000 rows, so in many cases is not at all useful (whereas my SQL procedure is not constrained by numbers of rows or numbers of columns, and performs very fast for anything I have been able to throw at it). Second, take a look at this URL for far more learned opinions than I am qualified to give on Excel's infamous habit of giving wrong answers in many situations. Read the article, and consult the references section at the end to see what I mean. http://www.practicalstats.com/Pages/excelstats.html

Frank Kalis SSC Guru Points: 111183 More actions · Answer 8

I know Excel's limitation pretty well

I didn't mean I download hundreds of thousands of rows. The degree of accuracy doesn't grow significantly when you look at the whole population or take a sample beyond of say 30+ observations. That's quite handable for Excel.

As for Excel's habit of giving wrong answers. I made the experience that nobody's really interested in precise numbers as long as the wanted trend is there

In fact, try to explain to a marketing oriented, and therefore visually thinking CEO concepts like confidence intervals, standard error of the sample mean or finite population correction factors. At a best he doesn't listen to you. I know because I've tried (once)

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Frank Kalis SSC Guru Points: 111183 More actions · Answer 9

Sorry, forgot to add that when I mean I use Excel I didn't mean that I also use Excel's built-in functions, which I agree are very limited. We use a third-party function library.

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Paul Thornett Right there with Babe Points: 790 More actions · Answer 10

The education database contains around 65,000 rows for each of 14 years. The users need the ability to compare, say, quartiles for several years by school type, and/or electorate, etc.

Frank Kalis SSC Guru Points: 111183 More actions · Answer 11

We're drifting off-topic.

Last one from me. When we run projections on the whole portfolio of our insured persons we deal with tens of millions of records. Yes, records, not rows, because none of the systems we've looked at utilizes a database but rather small flat files on the filesystem. These are highly specialized systems for asset liability management. Nobody expects response time <1 second. Or even one hour. Usually we start the run before we leave and it finishes somewhere during the night, so the results are available the next morning. What I mean to say, sometimes one is better of thinking off-database and consider the use of a dedicated software package.

--
Frank Kalis
Microsoft SQL Server MVP
Webmaster: http://www.insidesql.org/blogs
My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Gabor Nyul SSCrazy Eights Points: 9459 More actions · Answer 12

Just a small question:

How are you handling the different datatypes (eg. int, datetime money etc..) what you application would require?

Either you will use several columns of each required datatype (it means a lot of NULL values within your table or you will have to cast or convert the string value to the required datatype.

Bye
Gabor

Paul Thornett Right there with Babe Points: 790 More actions · Answer 13

I'd have thought most money and datetime values are not relevant to lookup tables. All my lookup codes are int (or some variation thereof), char or varchar, or yes/no. Data types like money and datetime do not need to be looked up - the data value is what the user wants to see. Or am I missing something?

gdefi SSC-Addicted Points: 493 More actions · Answer 14

I think some people have got a wrong idea of a lookup table. It is not an entity per se. It holds discrete values and their meanings that are generally used for populating screens and reports, i.e. expanding on codes used within the system, where the codes may be "meaningful" to a few people but mumbo-jumbo to the rest of the populace. e.g. CustomerType=1 may mean something to the cheque-clearance personnel at the bank back-office, but be utterly meaningless to the customer himself.

Therefore, in line with Paul, all a lookup table really needs is varchar.

bill_twomey SSCrazy Points: 2301 More actions · Answer 15

What are you doing with these lookups? Are you modelling business processes and/or enforcing referential integrity? Or are you populating drop downs?

You simply cannot sensibly combine all objects (OO sense, but not necessarily strict) into one table. You could not combine int and char types. Would you mingle apples and cats?

Are you populating a series of text drop downs, but not validating real relationships? I suppose commingling would be OK. But as soons as you want to validate, you're going back to the object sense and they should be separate.

Drop downs and lookups are relationships-light. As soons as you want real relationships, combined lookup and code tables are a nightmare.

Even if the DRI is reasonable, what's worse a bunch of tables and relationshisp tieing to one table, or a bunch of tables tieing to their own tables? I think people make too much of additional tables. Additional, cross-the-diagram relationships are worse. A decent modeling tool will allow you to have diagrams omitting low importance tables, so you won't be bothered by too many tables and their lookup tables.

Bill