Question about Data Warehouse

  • Hello.  I just rejoined SQL Server Central.  I'm here because I'm a solo reports developer for a small business.  I use SSRS, Tableau and QlikSense to develop reports and dashboards.  I also write complex stored procedures to produce the datasets that I need.  These Forums  are the only places where I can ask questions or find solutions to problems.     Here are my questions about Data Warehouses.   Are Data Warehouses an older technology?   Is an ETL (SSIS) process only used with a Data Warehouse?   Is SSIS difficult to learn?

    The small business at which I work has a time entry system for contracted therapists who treat children with special needs.    One of our clients has it's own time entry system.  Therapists who work for this client must enter their time in the client's system AND our system.  I help our billing specialist compare the data from the 2 systems to make sure they match.  The billing entries must match in order for the client to accept our invoice.   I build a detailed MS excel spreadsheet for the billing specialist to compare the 2 sets of data.  I do this for her since I am a more skilled Excel user.    One of the problems in comparing the data from the 2 systems is matching up the student names and the therapist names.     I know there has to be a better way to do this;  sometimes the spreadsheet as to be built many times in the month.     Would a data warehouse solution help with this situation?   We have at least 1 other client with a similar situation.    Can anyone offer any suggestions?

    I would also like to improve my skills as a Reports Developer (I think my type of position is referred to as Business Intelligence Developer) on job sites.

    Thanks in advance for reading my post.

    Roxy

  • Hello, and welcome to the party!

    There's a lot to unpack there, but I'll try to help a little. These are the questions I spotted.

    1. Are data warehouses older technology?

    At this point in time, yeah, the concept is more than 30 years old. However, that doesn't mean you can't use a data warehouse to solve certain kinds of problems. In fact, it means that the data warehouse solves certain kinds of problems, directly around data reporting & analysis, extremely well. It's possible that a warehouse is needed in your case, but let's talk about that a little more farther down.

    2. Is ETL, specifically SSIS, only used with data warehouses?

    Nope. I'm not an ETL expert, but I've worked with SSIS and other processes over the years in support of those who are. No, ETL is about data movement. Yes, that might be data into a data warehouse, but the key is data movement. I've also seen SSIS & ETL processes used for moving data out of warehouses and into OLTP systems and data marts. It's just about the movement, not the source or the destination.

    3. Is SSIS difficult to learn?

    Again, not an SSIS expert. However, the basics, yeah, they're pretty easy to pick up. Then it gets hard. At least in my opinion. Others may disagree.

    4. Can a data warehouse help with disparate data issues?

    And here things get interesting. So, spreadsheets are good for consuming data, data analysis, stuff like that. The problem comes when spreadsheets are used for data generation. They're just not great. If we're looking at a situation where people are literally inputting information, names you use as an example, into a spreadsheet, it can be very problematic. A better interface for data entry can be built. It can use lookups against existing data to ensure that the names being input are names that are in the system, or, conversely, that the name isn't in the system and needs to be added. This part of your question is a HUGE topic and could involve a lot of discussion.

    Personally, I'd say, no, a data warehouse is probably not the right solution here (although, I'm more than willing to listen to why I'm wrong on this one). It seems more like you need a front-end database with well designed input screens to ensure that you have referential integrity on your data.

    I hope some of this helps. Welcome again to SSC.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Oh, and if you have specific questions for follow up, don't hesitate to open another thread if necessary. More people will see the question that way. However, as much as possible, we try to keep discussion on one topic on one thread. However, it's Liberty Hall, so have fun.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • RoxyLightning wrote:

    Therapists who work for this client must enter their time in the client's system AND our system. I help our billing specialist compare the data from the 2 systems to make sure they match.

    Herein lies a serious problem with data integrity, as you're finding out.

    I guess my question would be... if you have enough access to the client data to build a comparison between the entries made in both systems, why not make it easier on everyone including the folks having to make double entries?

    In other words, stop requiring the people making time entries to make identical time entries on the two systems.  Have them make their entries on the required client system and import/distribute the data to your system.  That would eliminate a whole bunch of work for everyone involved and make the process much less prone to error.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Grant, Thank You for your thoughtful reply.  This will help me formulate better questions on my next post.  My goal is to find better solutions and learn new techniques at the same time.

     

    to Jeff Moden:  Thank you.  I've tried to contact someone on the IT staff for the other "billing system" and they didn't seem to want to help me.  In the beginning just wanted them to query the data and send it to me in a spreadsheet, but they said No.

    Your idea makes complete sense and I think it is do-able, but it is not in my power to get the cooperation.

    What I do now, is download one of their reports to a .csv file and save only the fields I need.  I then have to do subtotaling etc. - it is time consuming.

    Since I want so much to improve my skills (and my resume), I'm trying to think of ways to do this that would force me to learn something new; that is why I brought up the data warehouse.

     

  • You're approaching it all in the perfect manner. 1) find solutions to the problems at hand 2) use those solutions as a way to improve your skills.

    Keep going. Don't hesitate to reach out here. We'll try to help.

     

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • A data warehouse is used to enforce business rules. It also normalizes the data.

    Like Jeff points out, entry into one system, then populating the other, insures it is both done and consistent.

    If you have access directly to the data sources, and then are manually manipulating data in Excel, you may want to look at automating. At a bare minimum, the first thing I see is an audit query daily to ensure you have entries in both systems. VBA or PowerBI and M might be worth a look.

    If names are an issue, any way to control names are chosen from the same lists would be helpful. What appears to be the challenge is who owns each system. Both have needs of their own, but both also benefit from systems being in synch. Your leverage to get change lies in selling the benefits.

  • Ah, yes. You don’t have any control over the front-end data collection or the third-party data feeds you’re getting. You must work with whatever is sent to you.

    Rather than to build a data warehouse, it sounds like you’re looking to create an ETL process and data manipulation that consolidates and compares the data in a reporting database.

    Here are some ideas for handling this:

    1. Create a database for receiving, manipulating, consolidating and reporting the data from the source systems.

    • Within the database, you’ll want to have a staging table for each data source, with columns that match the data that’s being received. The corresponding staging table will be truncated (or all rows deleted) and repopulated each time you import data from a data source.
    • Next, you’ll have one or more secondary tables for consolidating the data from the staging tables and manipulating the consolidated data. Use stored procedures to move the data from the staging tables to the secondary table(s), apply business rules, validate the data, etc.
    • Include a “source” column in the secondary table(s) so you can track the data source for each row.
    • Use schemas in the database to identify objects (tables, stored procedures, etc.) related to each data source or process. For example, you will have a schema for the objects related to your time entry system, another schema for the client time entry system, and a third schema for the consolidated data. If a new client comes online, create a new schema and objects for that client, and merge that client’s data into the consolidated data.

    2.  Build an SSIS process to import the data from each data source to its associated staging table, but don’t do extensive transformations or data manipulation in SSIS.

    • Grant Fritchey is 100 percent correct, in my opinion, that SSIS basics are easy to pick up, and then it can get hard. I suggest using SSIS only to truncate the staging tables and load the data into them. You can use the SSMS data import wizard to generate a basic SSIS package that truncates and loads a staging table, and then save the package that the Wizard generates to see how it works. Recognize that this is typical “Wizard code”, which means that it’s not what you would write for a proper SSIS process (the Wizard has less-than-friendly naming conventions, for starters).
    • Review SSIS tutorials for ideas about building truncate and load SSIS packages. Compare their suggestions to the “Wizard code” and decide how you want to build your packages.
    • Start with one SSIS package for each data-source to staging-table load, rather than building one large, “does lots of things” package that loads all data.
    • You can skip columns when importing a csv file in SSIS. After you’ve converted your report to a csv, the SSIS package can be coded to import only those columns you need, rather than having to choose the columns as you’re creating the csv from the report.
    • Determine how you’re going to deploy and execute your SSIS packages. This can range from storing them in the file system on your PC and running them from the SSIS designer, to using DTExec.exe to execute them on your PC, to deploying them to an SSIS catalog on a SQL Server instance and executing them with SQL Agent jobs. Strive to keep it simple and easy to execute whenever needed.

    3.  Use T-SQL for data manipulation after the data is staged in the database.

    • In my opinion, SSIS is great for loading tables, basic data mapping and transformations, and overall control of a load process. After that, I’ve found that T-SQL is usually a better tool for extensive data manipulation. I say this, in part, because I find T-SQL to be easier to maintain than SSIS packages, more so if the SSIS packages are large and complicated. As Grant says, others may disagree.
    • Build a stored procedure for each data operation you want to execute. For example, you will have a stored procedure to insert or merge data from a staging table to the consolidation table, and another procedure to execute your logic for identifying and flagging therapists that don’t appear to match across the systems.
    • Consider building one or more “control” stored procedures that execute a group of individual stored procedures in the desired sequence.

    4.  Design for your desired output.

    • The good news is that you are intimately (painfully?) familiar with the desired output. Think in terms of how you want to query, summarize, and present the final data, and have your final data table(s) structured to support this.
    • Build the process so it can be repeated on demand, whenever you get new data and need to process it.

    5.  Decide if you need to archive.

    • Your process may only need to be executed on-demand with no need to save results for historical purposes. This is good because it keeps things simpler.
    • If there is a need to keep historical results, don’t keep them in your processing tables. Instead, create archive tables into which the results are inserted, and an archive batch table that stores an archive batch ID and corresponding archive date. Build an archive process that generates a new archive batch ID and includes the batch ID in each row that’s inserted into the archive tables, so you can tell when the rows were archived.
  • Each system likely has it’s own unique key for a Therapist and Student. Your system likely has a Client key.

    If a Therapist is entering a record into your system, is the record tagged with their ID? Or is someone else entering the record at times, and can only enter transactions for this client?

    Surrogate keys are a way we normalized pulling together data into a corporate database.

    It sounds also like there is a bit more than Therapist and Student being matched up.

    And I imagine it is not only matching those, but identifying a ‘missing’ entry.

  • yes

  • Thank you ALL for these responses!  This is great.   I'm going to carefully read through all of this and then return with more questions.

    RL

  • The client time system - if they had a trigger to write new time records to a file, that you had access to, would simplify a lot.

    We used triggers to capture new or changed records to build our data warehouse nightly. Triggers also isolated us from changes to the ERP system during upgrades. This was on an AS400 and was similar to the CDC functionality of SQL server, but much more targeted.

    We tended to keep things simple, but gave a lot of thought to changes. Data needs change over time, and the more data they can get easily, the more they tend to want.

    One of the main differences between data warehousing and applications is applications are concerned with transactions and performance, while data warehouses are designed to normalize and structure data for reporting purposes. Good design is key to scalable performance.

    Also the memory needs show quite a different pattern. Transactional tends to be steady, while data warehousing peaks very high during the build, and you need to be very aware of this peak need. I remember the network admin trying to average our usage over 24 hours. I told him to try to log into our server during our nightly process, then get back to me. The next day, I said should there be an issue where we got a late start - either from something on the ERP side or ours, our window was gone and everyone from the shop floor to the CEO would have to wait 2 or 3 hours for or build to finish. He then asked me if I wanted more server than I was requesting.

    If you have Enterprise Edition of SQL, SSRS allows you to create parameterized reports to email out on a schedule. We used this to email custom reports to over 100 sales offices in about 1 minute. Only issue I had with this was when we changed to a remote data center. Reports failed randomly, and it took me several days to prove to the Outlook Admin we were being throttled. No, the volume we produced was not spam, and the load was indeed intense.

    You seem bright, inquisitive, and willing to learn. Great qualities to have in this field, it will get you far.

  • RoxyLightning wrote:

    Grant, Thank You for your thoughtful reply.  This will help me formulate better questions on my next post.  My goal is to find better solutions and learn new techniques at the same time.

    to Jeff Moden:  Thank you.  I've tried to contact someone on the IT staff for the other "billing system" and they didn't seem to want to help me.  In the beginning just wanted them to query the data and send it to me in a spreadsheet, but they said No.

    Your idea makes complete sense and I think it is do-able, but it is not in my power to get the cooperation.

    What I do now, is download one of their reports to a .csv file and save only the fields I need.  I then have to do subtotaling etc. - it is time consuming.

    Since I want so much to improve my skills (and my resume), I'm trying to think of ways to do this that would force me to learn something new; that is why I brought up the data warehouse.

    You don't need a "data warehouse" for this.  Just a couple of tables to do a comparison.

    You don't need SSIS for this.  It can all be done using T-SQL.

    You may not even need to manually download the file.  With a little T-SQL prestidigitation, we can have SQL Server do the download for you.

    And, you certainly don't need to do all that "sub-totaling" manually.  It can all be done in T-SQL using BULK INSERT to load the data into a staging table and then have T-SQL do all of the necessary "sub-totaling" and comparisons with the other source, which would also go through a similar process.

    Of course, some of that is dependent on the .CSV file.  We can do some serious magic to handle just about anything you need to but it's a whole lot easier if it's consistent in format each time it's downloaded and the rows are consistent in have the same number of "fields".

    Interested?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply