Data Warehouse Design Question

  • Hello All,

    I have a design question.

    We currently host a data warehouse solution for a single EMR application. We have approx. 15 customers total and that is growing quickly. Each customer has their own EMR database that we extract data from. Our current structure is that we have two databases one for staging and one house.

    Each fact and dimension table along with the staging tables separates the customers by CustomerID. We secure the customers data via a filter when they connect. No customer should see any other customers data. We do not aggregate any of the data across the tables for comparisons. It is just simply stored in the same tables the filter initiates and the BI tool shows there dashboards.

    My question is would it be better to create a Stage and House for each customer instead of the single method we are doing now? Some of our fact tables are approaching 100 million rows and we are seeing slow down with the BI tool (Tableau) being used.

    Better meaning Performance, Maintenance and Security

    Thanks,

    Thomas

    ***SQL born on date Spring 2013:-)

  • thomashohner (8/18/2016)


    My question is would it be better to create a Stage and House for each customer instead of the single method we are doing now? Some of our fact tables are approaching 100 million rows and we are seeing slow down with the BI tool (Tableau) being used.

    My opinion:

    I usually like to have a separate staging table for each source, even if the schema is exactly the same. This allows me to change load schedules for specific sources, and potentially processing the data from the different sources in parallel quite easily.

    If you're using stored procedures to load your data, the only drawback is that your code-base will increase because of the multiple different staging tables, potentially making it more cumbersome to maintain. There are of course ways around that, for instance creating another step in the process to combine all the data from the different staging tables into one (temp tables could also be used for this purpose).

    There isn't necessarily a wrong or right here. I typically find that there are different nuances in the data coming from the different sources, even if the schemas are identical...and therefore making the additional maintenance overhead acceptable to retain visibility with different lookup rules etc.

    To answer your question about splitting the fact table, my opinion is that it would be better to partition the table by customer (and possibly date if that is typically used as a reporting parameter as well). If you align your indexes with the partition strategy, queries would be able to access only the relevant partitions...improving performance significantly.

    Hope this helps.

    EDIT: Looking at regulatory requirements for medical data, it may be required for you to have the data completely separate for security and/or auditing purposes. This is something you would need to research as it can be different depending on exactly what data you're dealing with.

  • My question is would it be better to create a Stage and House for each customer instead of the single method we are doing now? Some of our fact tables are approaching 100 million rows and we are seeing slow down with the BI tool (Tableau) being used.

    Is the size of the fact tables and slower performance what is motivating this idea or are their other problems you are trying to solve by splitting up the the databases? Are you monitoring your DW for slow or I/O intensive queries?

    If performance and the number of rows is your concern I think there are easier ways to correct the issue. Perhaps a better index strategy (e.g. filtered covering indexes for each customer and indexed viewes paritioned by customer) and/or partitioning strategy would be a more viable solution.

    Are you writing Ad-hoc SQL or using the Tableau GUI to generate the queries that retrieve your report data? I know Tableau recommends you use their GUI but I have found that it can be done faster writing your own SQL provided you have someone who knows how to write fast T-SQL.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • It's mainly a performance reason. I thought filtered index works best with columns with columns with a lot of NULLS and would a filtered index get used by the query Tableau GUI generates. For some reason I was under the impression filtered indexes had to use a literal value in the WHERE clause and would be missed by being set to a parameter in the query ?

    Tableau GUI queries are pretty horrid and for the most part that is what we have to use.

    I started thinking about a partitioning strategy but as i read the forums the main point was managing data and not so much faster queries so I did not dig much further. However I guess its something I can test by making a copy of the table and partitioning it on Customer and Year as you suggested and try to run some queries against it.

    I have also run the Data base tuning advisory and applied its recommendations. I checked the query plans on the Tableau generated queries and applied indexes where needed and that helped some but i'm still nervous if we double our customers what that speed would look like. Going to look at the Partitioning tomorrow and run some tests.

    Thanks gentlemen

    ***SQL born on date Spring 2013:-)

  • thomashohner (8/18/2016)


    It's mainly a performance reason. I thought filtered index works best with columns with columns with a lot of NULLS and would a filtered index get used by the query Tableau GUI generates. For some reason I was under the impression filtered indexes had to use a literal value in the WHERE clause and would be missed by being set to a parameter in the query ?

    Tableau GUI queries are pretty horrid and for the most part that is what we have to use.

    I started thinking about a partitioning strategy but as i read the forums the main point was managing data and not so much faster queries so I did not dig much further. However I guess its something I can test by making a copy of the table and partitioning it on Customer and Year as you suggested and try to run some queries against it.

    I have also run the Data base tuning advisory and applied its recommendations. I checked the query plans on the Tableau generated queries and applied indexes where needed and that helped some but i'm still nervous if we double our customers what that speed would look like. Going to look at the Partitioning tomorrow and run some tests.

    Thanks gentlemen

    I think the suggestions from the previous posters are good ones. I also think when most of the community talks about table partitioning, they are mostly emphasizing to not only do table partitioning for performance reasons. There are other benefits to table partitioning than performance benefits and performance benefits are not always guaranteed if you partitioned tomorrow. There are a many other factors that go into improving performance with table partitioning.

    I have a similar scenario to what you are describing. A few fact tables that are scaling as the business scales with clients. The more clients that onboard, the bigger those facts grow. Obviously table partitioning is a good option in those cases where you can partition on ClientID along with good index strategies.

    It seems you are not leveraging data marts for your clients. Have you thought about taking that approach? For example, pushing all data to one data warehouse and individual data mart (databases) per client with summary level data that feeds into a reporting front-end such as Tableau? This allows you to relieve the pressure from the data warehouse where each data mart is refreshed once a day. Then you can focus more on improving the performance on each data mart per the clients needs? It opens up a lot of different options. The same applies if you use SSAS or other third-party cubes in replace of that data mart to help separate data across environments and services.

    I took the option of physical data marts across servers. One server for the data warehouse that populates another server that hosts all the data marts. This allows me to move the data mart at will. I can push it to Azure and not miss a beat with the client reporting and they would never know.

  • If there is no need for someone to have access to all the records, my recommendation would be to break this out into separate cubes for each client. That is the easiest way to do the security. If EMR is emergency medical records, this might be a legal requirement. Security at the cube level is much easier to implement. The dimension isn't hard. But in a similar situation to you that didn't involve medical records, I created a separate data warehouse with just the information the customer would want to see.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply