Looking for suggestions on improving our "data warehouse" approach

  • I've come to find myself inheriting a loose "data warehouse" approach of sorts.  I'm quite new to data warehousing concepts, ETL, etc, but I have a feeling some of the things we're doing are less than optimal, so I'd appreciate any feedback on what we should be doing differently.  We're a low-to-mid size company that has started by building a single Power BI report that only relies on data from a single ERP database, but we're looking to expand our data warehousing over time.

    Here is the current setup:

    • We have a server with a SQL instance on it dedicated to providing data for a Power BI report.
    • Since we only need the data on the report to be updated on a daily basis, we currently have a process in which a SQL job takes a backup of our ERP DB (about 130 gb data file size), a batch process copies the backup file over to the Power BI server, and a SQL job on that server restores over the copy of the ERP DB using the aforementioned backup.  This all happens during off hours.
    • A couple of hours after this DB restore happens, a Power BI data refresh is scheduled for this report.
    • The Power BI report uses embedded SQL queries against the ERP DB copy from the Power BI server to create its individual dimension and fact "tables" for the report.  It also defines the relationships between them through the Model screen.

    My biggest questions:

    • Rather than defining the dimension and fact tables within the individual Power BI report through embedded SQL queries, would we be better off defining these queries as SQL views (or stored procedures if possible where needed) on the Power BI server to allow other reports to have access to them, and for easier maintainability?  And maybe pivot toward having a separate staging server where these views would live, then have them convert the results into true dimension tables for the Power BI server down the road?
    • Is there a better approach for refreshing the data on the read-only copy of the ERP database that the Power BI report reads from then the current approach of (1) take a SQL backup via SQL agent job, (2) copy the file over via batch process, and (3) restore the backup via SQL agent job?
    • Is there a reliable tool others are commonly using to help automate the ETL processes?  Would using such tools and doing extensive data transformation into a true data warehousing data structure potentially be overkill for a mid-sized company with a 130 gb ERP DB and potentially a couple of other DBs around that size?

    Thank you!

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Just my opinion, but do you need ALL of the ERP data?

    The reason I ask is if you aren't needing all 130 GB of data, it MAY be faster to use SSIS to just pull across the data you require.  SSIS can handle data transformations as well, so it MAY be able to handle the entire process for you.  My preferred approach is if the source table is required by more than 1 transformation, the source table is copied 1:1 from source to destination with a few extra columns added in (package name that copied the data over and the timestamp that the package ran.  This is to make troubleshooting easier).  If the table is used by ONLY 1 transformation, then keep it in memory and off disk (as disk is USUALLY the slowest operation in the chain).

    Two problems that come with SSIS are licensing and memory.  Licensing is an issue because you NEED a SQL Server license on the server that SSIS is installed on.  What I mean is that if Server A is where you are hosting your main SQL instances and Server B is where your data warehouse is and that is all you have for servers with SQL licenses, SSIS needs to be installed on Server A or Server B (doesn't need to be on both) OR you need to purchase another SQL Server license to install SSIS on Server C.

    With memory - SSIS operates OUTSIDE of SQL Server, so it needs its own memory.  What I mean is if you have Server A configured to use 90% of the memory for SQL Server and you install SSIS on there, when you run an SSIS package, it will request memory outside of that 90% and MAY end up paging to disk which will make your SSIS package performance take a HUGE hit.  So you need to make sure SSIS has enough memory to do its magic.

    One other issue that CAN come from SSIS is package sprawl.  What I mean is that a simple SSIS package becomes more and more complex as business logic increases and eventually you get to a point where the SSIS package "just works" and is a huge mess that nobody wants to touch.  If you do go with SSIS, I encourage you to keep SSIS packages simple.  If it starts to look complex or scary, look at what can be broken out into a second SSIS package.  Keeping them simple also decreases support time.  If an SSIS package has 10 steps and runs in 1 minute, testing it takes 1 minute.  If the package has 1000 steps and runs in 1 hour, testing takes an hour.  The more simple the package is and the faster it can run, the faster and easier it is to test and debug.

    There are also a lot of tips and tricks related to SSIS that you can do.  My 2 big ones are:

    1 - avoid "SELECT *" like the plague as it will almost always bite you in the end.

    2 - Data sources, where possible, use TSQL for the data pulls rather than from a table or view.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Mr. Brian Gale,

    Your feedback is much appreciated, this is some awesome info on SSIS, a tool I've had limited exposure to, but I've seen it come up on some ETL-related searches.  I'll definitely look into it more.

    Being able to restore the entire DB does save future effort in case we want to bring in more information later, and is simpler for my current backup-restore approach, but if I were to switch to SSIS, I imagine you're right in saying that I would need to be careful in only bringing over data that is needed.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply