Healthcare Data Warehouse - Starting new - small scale

  • Are there any members here in Healthcare? I work a lot with files that we get monthly from insurance companies and CMS. We get files on a monthly basis. Some files already have a column to indicate the month/date of the report file.

    I want to start loading these monthly files into a database here are my questions:

    1. Would you build separate DB's for each source (Blue Cross, CMS, Aetna, etc)?
    2. Would you build one DB and then have tables prefixed with each payor and then what the table represents? (such as BCBSM_Claims, BCBSM_Gaps, etc)
    3. For files that come monthly but do not have a column with month/date, I was going to just add one.
    4. Do I need an additional column to act as a primary key/ID?

    Right now all the source data is either in comma or tab delimited files.

    Thank you in advance.

    -Mark
    MSSQL 2019 Standard, Azure Hosted. Techie/Sysadmin by trade; Three years as a "DBA" now.

  • Barring any additional information, I would be inclined to go with a third option that you didn't mention.  Have one table with a field that indicates source.  Otherwise you are going to have essentially the same tables for each possible source.  That's a lot of duplication of effort and also makes it more troublesome to query.

    If files don't already have a month/date, I would add one.

    The primary key depends on what information you are receiving from the sources.  If they all have a key that can be used as a primary key, then go ahead and use that as part of your primary key.  You will have to include the source as the rest of the primary key if you do put them all into the same table.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    -Mark
    MSSQL 2019 Standard, Azure Hosted. Techie/Sysadmin by trade; Three years as a "DBA" now.

  • usererror - Tuesday, August 28, 2018 10:48 AM

    Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    yeah, Drew's a very smart guy, but I actually started laughing at the thought of the different insurers and CMS giving you the same format. They can't even do that when they're trying to.

    Think about what you're trying to achieve. Are you managing a population of membership? Then staging area for each file you get, same or different DBs depends on how long you keep the file archived there, how much traffic, etc. Pull all that data together into one location where it makes sense to do comparisons. For example, you can't enable someone to find the high-utilizers of some particular drug if you don't have Rx data somewhere that can be compared to membership. If your claims are separated from the pharmacy, then you can't see drug patterns for re-admission members in the hospital, or folks who shop around from ER to ER to find narcotics.

    Find out what your users are going to look for, and design for that, plus ease of import.

    And decide on a scope up front, this could easily get out of control, make sure you know what you're getting yourself into.

    -------------------------------------------------------------------------------------------------------------------------------------
    Please follow Best Practices For Posting On Forums to receive quicker and higher quality responses

  • jonathan.crawford - Tuesday, August 28, 2018 11:04 AM

    ...I actually started laughing at the thought of the different insurers and CMS giving you the same format. They can't even do that when they're trying to...

    I completely agree! One of my biggest frustration as a newbie to this realm is the lack of consistency on file formats, headers, anything. Right now I just have all of the CCLF files, for example, from CMS in their original text source and I import them when I need to. However, according to CMS you're not allowed to store the CCLF files in their original format for more than 1 year.

    I'd like to start with just one payor who is small. I get the following files:

    Payor A:
    Members
    Facility Claims
    Professional Claims
    Rx Claims
    Gaps

    The payors seem to follow that pattern, but others send significantly more files.  Would it make sense to do a database for each payor then?

    -Mark
    MSSQL 2019 Standard, Azure Hosted. Techie/Sysadmin by trade; Three years as a "DBA" now.

  • usererror - Tuesday, August 28, 2018 10:48 AM

    Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    They do not need to have the same set of columns to be stored in the same table.  You just store whatever information you have for each of the sources and leave the rest of the columns NULL.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • 1. Would you build separate DB's for each source (Blue Cross, CMS, Aetna, etc)?
    2. Would you build one DB and then have tables prefixed with each payor and then what the table represents? (such as BCBSM_Claims, BCBSM_Gaps, etc)

    Possibly start with one DB for the raw data, I would use schemas instead of table name prefixes though.  Using schemas instead of prefixes makes organization, access rights etc... much easier.  How is the data going to be used long term?  Are people just looking at the raw data, are you going to build some kind of reporting on top of it or merge it with internal data?

    3. For files that come monthly but do not have a column with month/date, I was going to just add one.

    Yeah that's fine, even a date stamp with the load date would be helpful.

    4. Do I need an additional column to act as a primary key/ID?

    That depends, does the data have natural keys in it?

  • drew.allen - Tuesday, August 28, 2018 3:03 PM

    usererror - Tuesday, August 28, 2018 10:48 AM

    Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    They do not need to have the same set of columns to be stored in the same table.  You just store whatever information you have for each of the sources and leave the rest of the columns NULL.

    Drew

    Interesting, I had not thought of that. I thought that would possibly be frowned upon because there would be "holes" in rows, but it would keep all the data together in one table.

    -Mark
    MSSQL 2019 Standard, Azure Hosted. Techie/Sysadmin by trade; Three years as a "DBA" now.

  • usererror - Wednesday, August 29, 2018 9:48 AM

    drew.allen - Tuesday, August 28, 2018 3:03 PM

    usererror - Tuesday, August 28, 2018 10:48 AM

     Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    They do not need to have the same set of columns to be stored in the same table.  You just store whatever information you have for each of the sources and leave the rest of the columns NULL.

    Drew

    Interesting, I had not thought of that. I thought that would possibly be frowned upon because there would be "holes" in rows, but it would keep all the data together in one table.

    It really depends on the data, how much over lap is there in the data?  Is it even structured the same, for example if one source has denormalized data it might make no sense what so ever?

  • usererror - Wednesday, August 29, 2018 9:48 AM

    drew.allen - Tuesday, August 28, 2018 3:03 PM

    usererror - Tuesday, August 28, 2018 10:48 AM

    Drew,

    I should have added that each source set of data has a different set of columns. They do not have any kind of primary key already.

    They do not need to have the same set of columns to be stored in the same table.  You just store whatever information you have for each of the sources and leave the rest of the columns NULL.

    Drew

    Interesting, I had not thought of that. I thought that would possibly be frowned upon because there would be "holes" in rows, but it would keep all the data together in one table.

    Yeah, depends on the DBA and use cases. In theory, it's nice to go to a structure where multiple data sources are being cleaned and conformed into one dataset where all the keys can be logged and re-assigned through the DW itself. This is ideally the goal, but often, you find that not ever dataset contains the same common dimensions. Thus, you are left with a wide table with lots of nulls depending on the dataset. 

    For example, I've seen 100 field tables where like 1 million records will have 20 out of the 50 dimensions filled, rest NULL, then another 1 million records where 10 of the 50 are filled, rest NULL. You can tell whoever designed this could not find any common dimensions and just kept stacking data sources into the table. It will grow in fields every new data source they added. This is bad.

    In another example like with my data warehouse, we have 100 fields. Majority of each field has a value. Sometimes with certain data sources, they may have 5 out of 100 that are NULL depending on the conditions of the capture. It's very sporadic, where the NULL's actually have more value to the end user versus it's 100% always going to be NULL. This is fine.

    When approaching this, look towards what that final data model would look like if you can conform it all to one table. If you have a lot of common dimensions and can conform them all to one Fact, see if that makes sense and see if you can answer the data questions you want to answer with it. In cases where you have a lot of diverse data sources, then think about separate Fact's per source or if you can, try to categorize them or bin them.

    For example, I work in advertising. We store data mostly by advertising channel. CMS is included. CMS is just another source for us. It's not uncommon to have like Paid Search (Google Search) in one table, Display (Banner Ads) in another table, and CMS in another table. Course, now that data analytics is so common and people look at things across channel, most of these channels have the same common dimensions now so they can all fit into one nice model with fewer NULL's. But that's not always the case.

  • Thanks, xsevensinzx, and others.

    I think as my first step I need to do is look at the headers of the source files and see what overlaps. Would others agree?

    -Mark
    MSSQL 2019 Standard, Azure Hosted. Techie/Sysadmin by trade; Three years as a "DBA" now.

  • usererror - Friday, August 31, 2018 8:21 AM

    Thanks, xsevensinzx, and others.

    I think as my first step I need to do is look at the headers of the source files and see what overlaps. Would others agree?

    Yes doing some basic analysis of the files would be a required first step.  As well as talking to the sources if possible.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply