A Lookup Strategy Defined

  • I am not going with the author or the mainstream here.

    As I'm too tired to write a pamphlet and in addition have only a very limited english vocabulary compared to the author of the reply in the link below, read this

    http://groups.yahoo.com/group/sql_server7_forum/message/5731 

    It's a reply to this question

    http://groups.yahoo.com/group/sql_server7_forum/message/5726

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Joe Celko is very much of a purist when it comes to relational databases. If you have ever tried to adopt the code examples he provides in his books, you'll soon realize that they tend to run very slowly if at all. In his "SQL for Smarties" book he provides several different ways of deriving median values. Some of this code doesn't work at all, while other code runs excruciating slowly (he doesn't approve of the use of cursors, which IMHO is the only practical way of addressing this particular problem).

  • Use of a cursor for calculation of the median (financial or statistical ?) ???

    CREATE TABLE #dist (c1 int)

    INSERT INTO #dist VALUES (2)

    INSERT INTO #dist VALUES (3)

    INSERT INTO #dist VALUES (1)

    INSERT INTO #dist VALUES (4)

    INSERT INTO #dist VALUES (8)

    INSERT INTO #dist VALUES (9)

    SELECT Median=CASE COUNT(*)%2

       WHEN 0 THEN

        (d.c1+MIN(CASE WHEN i.c1>d.c1 THEN i.c1 ELSE NULL END))/2.0

       ELSE d.c1 END  -- Odd number

    FROM #dist d CROSS JOIN #dist i

    GROUP BY d.c1

    HAVING COUNT(CASE WHEN i.c1 <= d.c1 THEN 1 ELSE NULL END)=(COUNT(*)+1)/2

    DROP TABLE #dist

    Median             

    -------------------

    3.500000

    or if there are duplicate values

    set nocount on

    CREATE TABLE #dist (c1 int)

    INSERT #dist VALUES (2)

    INSERT #dist VALUES (2)

    INSERT #dist VALUES (1)

    INSERT #dist VALUES (5)

    INSERT #dist VALUES (5)

    INSERT #dist VALUES (9)

    SELECT Median=ISNULL((CASE WHEN COUNT(CASE WHEN i.c1<=d.c1 THEN 1 ELSE NULL END) > (COUNT(*)+1)/2 THEN 1.0*d.c1 ELSE NULL END)+COUNT(*)%2,

                  (d.c1+MIN((CASE WHEN i.c1>d.c1 THEN i.c1 ELSE NULL END)))/2.0)

    FROM #dist d CROSS JOIN #dist i

    GROUP BY d.c1

    HAVING (COUNT(CASE WHEN i.c1 <= d.c1 THEN 1 ELSE NULL END)>=(COUNT(*)+1)/2)

    AND (COUNT(CASE WHEN i.c1 >=d.c1 THEN 1 ELSE NULL END) >= COUNT(*)/2+1)

    GO

    drop table #dist

    Median          

    ----------------

    3.5

    There are many other ways possible. None of them utillizes a cursor.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I have a lot of respect for Joe Celko. However, I bend the rules sometimes. And look-up tables are precisely where I bend the rules, and the resulting CODES table, being used for the past four years, has stayed at 105 rows, has 14 types of codes, and has rarely changed. Its holds things like hearing types, action types, court types, plea types, legislation types, jurisdiction types, and the like (you've guessed it - it is an application to take care of legal processes). Each one to his taste, I suppose.

  • I was over-simplifying when I wrote "deriving median values" . I have written a stored procedure which does any of the following, depending on the parameters passed: Medians, Quartiles, Quintiles, Deciles, Percentiles. Moreover these measures are reported grouped by any number of columns (and I have never seen this capability anywhere else). In this context, the use of a cursor is mandatory!

  • Medians, Quartiles, Quintiles, Deciles, Percentiles. Moreover these measures are reported grouped by any number of columns (and I have never seen this capability anywhere else). In this context, the use of a cursor is mandatory!

    Median as a measure of central tendency is okay when calculated in SQL Server. But as the median alone is not terribly useful, I generally prefer the use of a specialized software called Prophet or simply download the data to Excel for any statistical stuff I do. Because, as you have implied the code will become messy and might require a cursor (actually I never have thought about calculating the things you've mentioned in SQL Server)

    As for gdefi:

    I agree, each one to his taste.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • "or simply download the data to Excel for any statistical stuff I do". Be very careful. First off, Excel can't handle more than 64,000 rows, so in many cases is not at all useful (whereas my SQL procedure is not constrained by numbers of rows or numbers of columns, and performs very fast for anything I have been able to throw at it). Second, take a look at this URL for far more learned opinions than I am qualified to give on Excel's infamous habit of giving wrong answers in many situations. Read the article, and consult the references section at the end to see what I mean. http://www.practicalstats.com/Pages/excelstats.html

  • I know Excel's limitation pretty well

    I didn't mean I download hundreds of thousands of rows. The degree of accuracy doesn't grow significantly when you look at the whole population or take a sample beyond of say 30+ observations. That's quite handable for Excel.

    As for Excel's habit of giving wrong answers. I made the experience that nobody's really interested in precise numbers as long as the wanted trend is there

    In fact, try to explain to a marketing oriented, and therefore visually thinking CEO concepts like confidence intervals, standard error of the sample mean or finite population correction factors. At a best he doesn't listen to you. I know because I've tried (once)

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Sorry, forgot to add that when I mean I use Excel I didn't mean that I also use Excel's built-in functions, which I agree are very limited. We use a third-party function library.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • The education database contains around 65,000 rows for each of 14 years. The users need the ability to compare, say, quartiles for several years by school type, and/or electorate, etc.

  • We're drifting off-topic.

    Last one from me. When we run projections on the whole portfolio of our insured persons we deal with tens of millions of records. Yes, records, not rows, because none of the systems we've looked at utilizes a database but rather small flat files on the filesystem. These are highly specialized systems for asset liability management. Nobody expects response time <1 second. Or even one hour. Usually we start the run before we leave and it finishes somewhere during the night, so the results are available the next morning. What I mean to say, sometimes one is better of thinking off-database and consider the use of a dedicated software package.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Just a small question:

    How are you handling the different datatypes (eg. int, datetime money etc..) what you application would require?

    Either you will use several columns of each required datatype (it means a lot of NULL values within your table or you will have to cast or convert the string value to the required datatype.



    Bye
    Gabor

  • I'd have thought most money and datetime values are not relevant to lookup tables. All my lookup codes are int (or some variation thereof), char or varchar, or yes/no. Data types like money and datetime do not need to be looked up - the data value is what the user wants to see. Or am I missing something?

  • I think some people have got a wrong idea of a lookup table. It is not an entity per se. It holds discrete values and their meanings that are generally used for populating screens and reports, i.e. expanding on codes used within the system, where the codes may be "meaningful" to a few people but mumbo-jumbo to the rest of the populace. e.g. CustomerType=1 may mean something to the cheque-clearance personnel at the bank back-office, but be utterly meaningless to the customer himself.

    Therefore, in line with Paul, all a lookup table really needs is varchar.

  • What are you doing with these lookups?  Are you modelling business processes and/or enforcing referential integrity?  Or are you populating drop downs?

    You simply cannot sensibly combine all objects (OO sense, but not necessarily strict) into one table.  You could not  combine int and char types.  Would you mingle apples and cats?

    Are you populating a series of text drop downs, but not validating real relationships?  I suppose commingling would be OK. But as soons as you want to validate, you're going back to the object sense and they should be separate.

    Drop downs and lookups are relationships-light.   As soons as you want real relationships,  combined lookup and code tables are a nightmare.

    Even if the DRI is reasonable, what's worse a bunch of tables and relationshisp tieing to one table, or a bunch of tables tieing to their own tables?  I think people make too much of additional tables.  Additional, cross-the-diagram relationships are worse.  A decent modeling tool will allow you to have diagrams omitting low importance tables, so you won't be bothered  by too many tables and their lookup tables.

    Bill

Viewing 15 posts - 16 through 30 (of 34 total)

You must be logged in to reply to this topic. Login to reply