I have source data (health data) across three tables with columns as follows:
The data lengths range from varchar(56) to varchar(400), with most as varchar(80) and varchar(150).
Data volumes range from 201M, 18M, and 2M rows.
If I select distinct on these columns the data volumes significantly decrease (say 200K - 1.8M rows depending on the column).
I'd like advice on modelling this as a normalized data store (NDS) and dimensional data store (DDS)...
Option 1: Put ALL data in a single name column with SK. But the column would be the length of the longest column (varchar(400)).
Option 2: Create NDS tables for FIRST/MIDDLE/GIVEN/LAST names. Combine data across tables as appropriate. I'd prefer to break up given names into first/middle names, but that may be prone to error - perhaps I could just treat that as another first name? Keeping these columns separate, the column length could be better matched to the source data, i.e. "short" first names and "long" last names.
Option 1: Using either Option 1 or Option 2, create a DIM_PERSON table containing FIRST/MIDDLE/GIVEN/LAST name, using the NDS SK, plus other demographics such as GENDER, DOB, etc. However, if I use the NDS SKs, then the fact table will have a snowflake rather than star schema.
Option 2: Using either option 1 or Option 2, create a DIM_PERSON table containing FIRST/MIDDLE/GIVEN/LAST name, using the names rather than SKs, plus other demographics as above. There would be one row per unique combination of columns. But if I use this approach, I'm not sure of the benefit of the NDS? Perhaps I just create a staging table with the combination of names matching the schema of the DIM table, then upsert into that table?
DIM_PERSON will not be SCD2 - it is just the unique combination of all names + demographics across all the source tables.
Use DIM_PERSON_SK to bind the SK to the fact table based on the existing data.
From the dim and fact tables, I'll eventually create very wide data mart table(s), either indexed view or physical table. The DM requirements are query results (data exports) rather than ad hoc analytics.
The design goals for this exercise is data storage optimization, query performance improvements, and maintenance/enhancement capabilities as new data sources come online or old ones change.
Your thoughts? The fact that Option 1 is atypical makes me wonder if it's a really bad idea ;-). But if I break it down, a name is a name is a name; in the NDS, should I really care whether it's first/middle/given/last/certifier/maiden name? Should I care whether "Alex" or "Sam" or "Smith" or "Jones" is the father's or mother's name within the NDS? I'll reconstitute the data using SKs in the DDS dim and fact tables.
P.S.: I note that the Kimball forums are closed down. I hope Ralph and family are enjoying retirement. Are there other forums you can recommend for posting data modelling questions? Or is this one as good as any?