April 27, 2005 at 1:40 pm
Hi everyone.
I'm having trouble to get a MDX-statement to work and I hope somebody around here may point me to some helpful tutorials or similar stuff.
To start with, I'm a statistican who is quite familar with SQL queries but I've not managed yet to create a MDX-statement to calculate a specific computed measure. Let me explain shortly the setup of my (very simple) data-cube. It consists of only 6 dimensions. Those are:
- region (9 parameter values, star-scheme)
- age (10 parameter values, star-scheme)
- education (4 parameter values, star-scheme)
- nationality (4 parameter values, star-scheme)
- labour-force status (4 parameter values, star-scheme)
- sex (2 parameter values, star-scheme)
The population is divided in a lot of subgroups, which means every possible combination of parameter values of the dimensions may be considered as such a subgroup. Example: Subgroup "d" is defined as all objects, that live in region '1', their labour force status is '2' and their sex is '1'.
With data coming from a sample survey, the total domain size N_d in any of the possible subgroups G_d is estimated with the help of some (known) weights w. To make a long story short, the values in the cells of the data-cube are point estimates \hat N_d for N_d, the total domain size of group "d" in the population.
My task is now, to compute a variance estimator with the following formuar (1) by using a MDX-Statement. The variance for a small domain "d" within a specific region "r" is given by:
Var(\hat N_d) = \sum_{r=1}^{9} (N_r)^2 * (1-f_r) * (pd (1-pd))/(n_r-1) (1)
where
- N_r ... total population size (estimated domain size for region r)
- n_r ... sample size for region r
- f_r ... "sample fraction", which can be calculated as: n_r / N_r
- nsd_r ... number of sample objects in the selected subgroup "d" in region r
- pd ... nsd_r / n_r
I should note, that all values (except nsd_r) could be hardcoded into formula (1). Anyway, my problem is that I have no clue how to create an MDX-dtatement that calculates those values simultanely for all the cells that are returned as a result set from the data-cube when it is processed. The biggest problem that I have is, that I don't know if I can use information stored in a table that is neither a dimension, nor a measure to compute calculated measures. If this was possible, I think implementing formula (1) isn't that easy. I could store all the constants in a separate table and most of the work would be to get the number nsd_r out of the cube...
I'd be very glad if someone has some hints for me. Really, and I'd happily pay lots of beer of stuff for the one who guides me to a successful MDX-statement
April 28, 2005 at 12:48 am
Mmmm, beer....
Can you dumb it down a little for me? One thing you haven't given is the measures within your cube.
Assuming (always dangerous) that you have:
1. Total Population size (int?)
2. Sample Size (int)
3. Count of sample objects (int?)
then I *think* it shouldn't be "too" hard to do.
a pseudo code approach could be (assuming you're testing in the MDX sample app with a view to after it's right putting it into the cube as a perm calc member):
create calc member for [Measures].[f_r] = [measures].[n_r] / [measures].[N_r]
create calc member for [Measures].[pd] = [measures].[nsd_r] / [measures].[n_r]
create calc member for variance = {I don't know how to read the math symbols } so roughly
{not sure of the \sum_{r=1} ^ 9 bit} * Power([measures].[N_r], 2) * (1 - [measures].[f_r]) * ([measures].[pd] * (1 - [measures].[pd])) / ([measures].[n_r] - 1)
The premise here is that if you can create the general calculation, as you "filter" to particular regions 9r) in the region dim this will then "input" the correct measures values to the equation for that region. Same thing happens as you filter the other dimensional attributes.
BTW, are you in Aus?
Cheers,
Steve.
April 28, 2005 at 3:42 am
Hi Steve, first off, thanks a lot for your answer.
Sorry about the measures. I currently have two measures.
- estimated domain size (FLOAT) --> could be cut off to be an integer values for sure
- nsd_r (as INTEGER). I managed to use get those values by adding a column to the fact table with default value '1' and defining the measure as sum over the current subgroup. this works quite well..
As for the assumptions.
1) The total population size may be hardcoded since it never changes. (lets put it 8000000)
2) The sample size for region_r doesn't change either and may be hardcoded. I have one general problem. When I aggregate the key and create a view by choosing specific dimensional values I do get the needed values. For example. If I use only the dimension variable region, then the measure nsd_r gives me the sample size n_r in region r. However, I don't know how I can access this value within an mdx-statement because for the variance-formula I have to check if a certain subgroup belongs to region r (if yes -> fetch the values nsd_r, N_r, n_r and calculate the variance. if not -> look to which region nsd_r belongs, then fetch the needed values for computation of the variance). I know this is a very general problem, I didn't find anything where this was explained, though
3) yep. these values are given by measure nsd_r. If I view the cube, this measure returns the number of sample objects in subgroup d.
As for your suggestion. Do you mean I should first try to create measures n_r, N_r and f_r that contain the needed values independent from the currently applied filter? that means, that N_1 always returns the total population size of region 1 regardless what filters are applied as well?
And do you mean that once this measures are created, I can calculate the members as you suggested and try to implement the variance formula?
To answer your last question, I'm currently here in Austria (damn good beer by the way... ) but have quite a lot of connections to aus and i've been there quite a few times already...
April 28, 2005 at 12:40 pm
update:
I created 3 columns in my fact table: N_r, n_r, f_r
so that the setup of my fact table is as follows:
-----------------------------------------------------------------------------------------------------------------------------------------------
| fact_id | fk_dim_region | fk_dim_age | fk_dim_lfc | fk_dim_lfc | fk_dim_nationality | fk_dim_education | weight | counter | N_r | n_r | f_r |
-----------------------------------------------------------------------------------------------------------------------------------------------
- the column counter is all 1, and when used as measure, it returns the number of objects in subgroup d (=nsd)
- the column weight contains the sample weight for each object and is used to calculate the point estimate. (= SUM over all weights for all objects in a certain group)
- the column N_r is filled with the total domain size of region r whenever fk_dim_region has value 'r',...
- the column n_r is filled with the total sample size of region r whenever the foreign key fk_dim_region has the value 'r'
- the column f_r is filled with the value n_r / N_r
Then I added these colums of the fact_table as new dimensions N_r, n_r and f_r in the cube using the wizard within the analysis-manager. Note that I had to select "MAX" as aggregate function and not the default setting "SUM".
So far so good. With these newly created measures, I was able to compute some more measures, needed for the estimator of the variance as described above. Thus, I created a measure called "pd" as follows
[Measures].[counter] / [Measures].[n_r]
Using the measure "ps" I could easily create the measure "var" as:
([Measures].[N_r]*[Measures].[N_r]) * ((1 - [Measures].[f_r])/[Measures].[n_r]) *
([Measures].[pd] * (1 - [Measures].[pd]))
The measure "var" contains almost the values I wanted to calculate. Almost, 2 problems are still existing.
1. problem: The ultimate goal is to compute the "rMSE in %" as
100 * (sqrt([Measures].[var]) - [Measures].[weight])
This would be quite easy IF a function sqrt() was available. Unfortunately it seems, that such a function is not available when you try to compute a new measure within the analysis manager. Anyone knows how to archieve this goal?
2. problem: While the values computed in measure "var" are correct whenever the dimension "region" is used as a filter, it doesn't return the correct value variance when region is not set as filter or dimension variable. The true value (already formulated in rMSE in %) for the domain total (summation over all regions r) should be:
(1) 100 * (sqrt([Measures].[var].[region1]) + ... + sqrt([Measures].[var].[regionr]) / [Measures].[weight])
My problem is, that I don't know how to formulate
[Measures].[var].[regionr]
correctly. The interpretation what this value should be becomes clear when a simple example is considered:
Example: Task is to estimate the variance of the number of all male people in the population that are unemployed.
The variance estimator would be.
var (males unemployed in region 1) + ... + var (males unemployed in region r) =
[Measures].[var].[region1] + ... + [Measures].[var].[regionr]
The estimator for the rmse (%) would be:
(2) 100 * (sqrt([Measures].[var].[region1] ) + ... + sqrt([Measures].[var].[regionr] ) / [Measures].[weight] )
My problem is, that I don't know how to combine formulas (1) and (2). formula (1) may be used, whenever a region r is used as a filter or when region is used as dimension variable.
Formula (2) should be used, when the variance is calculated for the overall domain size (summation over all regions r). I have no clue, if this is even possible....
Thanks a lot for your support. And by the way, the beer here in Austria is really, really good
April 28, 2005 at 5:04 pm
Will look to provide ore MDX later today but in te meantime, you can use the SQRT function "natively" within MDX. If you open BOL and use the index, type in "excel functions" this will gie you the list of Excel functions that are loaded with the excel function library on the AS server when the server starts.
La Biere de Belgie c'est magnifique!
Steve.
Viewing 5 posts - 1 through 5 (of 5 total)
You must be logged in to reply to this topic. Login to reply