Creating a small banking datawarehouse

  • Hi all,

    Let me first introduce myself. I'm new to the site. My reason to join is that I have just started a BI job and am learning about how to create datawarehouses. As an exercise I'd like to create one of my own and I'm hoping that you will guide me in the process. I have chosen to build a small datawarehouse containing my banking transactions on which I will be able to run some reports.

    Input:

    - CSV file with columns: Date, Name, Accountnr, Accountnr2, TransactionType, Comment, Debit/Credit, Amount

    Now I have to create something in the database that will be evaluating every line on a combination of the available fields to determine which category the transaction belongs to. For example householding, rent, saving etc.

    Instead of just adding all rules in one uge case when. I'm looking for an optimal method to design this and maintain it. Are there some examples I can read into that you guys know of?

    Thanks!

  • First, it's unclear whether you know the basic concepts of dimensional modeling, requirements gathering, or ETL at this point. So before you start building anything, I highly recommend you read this book.

    The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling

    3rd Edition

    by Ralph Kimball

    https://www.amazon.com/Data-Warehouse-Toolkit-Definitive-Dimensional/dp/1118530802

    Looking ahead, the SQL Server Standard or Enterprise Edition stack contains all the essential tools you need for a basic DW implementation. For loading facts and slowly changing dimensions from source data, you can do this with SSIS.

    Handling Slowly Changing Dimensions in SSIS

    https://blogs.msdn.microsoft.com/mattm/2009/11/13/handling-slowly-changing-dimensions-in-ssis/

    You can also search YouTube or Pluralsight for videos on dimensional modeling and SSIS.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • First, Welcome - you've definately come to the right place. SQLServerCentral.com is a goldmine.

    Ditto everything Eric said. I would also highly recommend:

    Star Schema The Complete Reference

    Next, regarding the best way to maintain your DW, the best advice I can give is make sure to have a good naming standard ready; I can't stress this enough. It's one of those data warehousing topics that get's too little attention IMHO. Spending a little extra time determining how to name your tables, columns, views, procs, etc will make your life much easier moving forward.

    Regarding the CASE statement - you are correct, you don't want your business logic to live in CASE statements. There's so much wrong with that. For one, let's say you need to add new business rules/logic... If it's living ina stored proc then you need to update it, redeploy to DEV, QA... PROD... That's too much work for a simple change.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Hi thanks for the advice. I am a little acquainted with concepts such as dimensional modelling and have basic SQL skills. However I've noticed that it is difficult to fully understand the theory without having some practice. This is the main reason for me to start this exercise. For now I'd like to omit SSIS whenever possible since strengthening my SQL skills has a higher priority. I am hoping to find out some limitations of SQL compared to SSIS for some tasks (E.g. slowly changing dimensions).

    For this model I see the following dimensions:

    1. Account

    2. Transaction Type

    3. Spending category (household, rent, income etc)

    What have I done:

    - I've created a staging table called stg_transactions. I'm planning to use SSIS later on to load the table, but for now the stg table is an exact replica of the file which I can import using the import wizard in sql server. Datatypes are only varchar.

    - I've created a datastore table called ds_transactions. A stored procedure inserts the data based on a NOT EXISTS and then truncates the staging table. Next to varchar for names, accounts and comments I've used date to store the date and money for the amounts.

    I am wondering the following:

    - Am I right to believe that the name/description and comment column are not part of the dimensions?

    - The amounts in the export are always positive, another column indicates whether the amount is added or subtracted from the account. In the stored procedure loading the final fct_transactions I'll be generating an additional column with the appropriate positive or negative values.

    - Is this a proper path to loading the fct?

    - Now how should I go about generating the dimensions?

    1. Account can be generated from a select distinct, but I suppose determining who the beneficiary is something that needs to be done manually since the Name/description field is not unique for every account.

    2. Transaction type. The file shows a code and name for each transaction type. I should be able to generate this one automatically.

    3. The spending categories should be created manually in a table. But where and how should I create the rules to determine which transaction belongs to which category? Each transaction should be evaluated differently, for example the account column for a public transport transaction is NULL, therefore I should determine the category based on the comment field. This comment field is NULL for other transactions such as transfers to saving accounts etc.

  • momentueel (12/16/2016)


    Hi all,

    Let me first introduce myself. I'm new to the site. My reason to join is that I have just started a BI job and am learning about how to create datawarehouses.

    Hello and welcome aboard!

    I am a bit curious... how did you get through an interview and land a BI job without much knowledge of datawarehouses?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Haha thanks for asking 😀 It's a traineeship. The focus of the company was to find trainees from business studies that can be schooled to become BI consultants. I'm trying to do some studying on the side to speed up that process a little. Until now I've been mostly given reporting tasks, using databases which are created by others.

  • momentueel (12/16/2016)


    Haha thanks for asking 😀 It's a traineeship. The focus of the company was to find trainees from business studies that can be schooled to become BI consultants. I'm trying to do some studying on the side to speed up that process a little. Until now I've been mostly given reporting tasks, using databases which are created by others.

    Absolutely awesome in two areas... 1) not many companies do such a thing and I think that's great and 2) you've certainly embraced the idea that some side study will be incredibly worthwhile. My hat's off to you. Wish there were more in the world like you. You'll do very well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • momentueel (12/16/2016)


    Hi all,

    Let me first introduce myself. I'm new to the site. My reason to join is that I have just started a BI job and am learning about how to create datawarehouses. As an exercise I'd like to create one of my own and I'm hoping that you will guide me in the process. I have chosen to build a small datawarehouse containing my banking transactions on which I will be able to run some reports.

    Input:

    - CSV file with columns: Date, Name, Accountnr, Accountnr2, TransactionType, Comment, Debit/Credit, Amount

    Now I have to create something in the database that will be evaluating every line on a combination of the available fields to determine which category the transaction belongs to. For example householding, rent, saving etc.

    Instead of just adding all rules in one uge case when. I'm looking for an optimal method to design this and maintain it. Are there some examples I can read into that you guys know of?

    Thanks!

    You could create a "rules" table with unique Accountnr, Accountnr2, TransactionType combinations and the transaction category and then do a join to the original data and let the join sort it out. That would also make it so no code changes were necessary. Just add to the "rules" table and you're done. I might even make the lookup a persisted computed column in the transaction table so that you're not having to recalculate that which will never change.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/16/2016)


    Absolutely awesome in two areas... 1) not many companies do such a thing and I think that's great and 2) you've certainly embraced the idea that some side study will be incredibly worthwhile. My hat's off to you. Wish there were more in the world like you. You'll do very well.

    Thanks Jeff! Both for the compliment and the advice 😉

    You could create a "rules" table with unique Accountnr, Accountnr2, TransactionType combinations and the transaction category and then do a join to the original data and let the join sort it out. That would also make it so no code changes were necessary.

    This sounds great. The thing I do not understand yet is how to implement some rules like when I have to determine the transaction from the comment field, based on a partial hit of the comment. For example all my public transport transactions feature NS somewhere in the comment. Would I then create the comment value in the rule as follows: %NS%? Meanwhile the accounts for all transactions could be different. Should I create a rule for every account then? Because if I use an OR in the join I might run into trouble with other rules, that should determine the category based on a combination of comment and account.

    I might even make the lookup a persisted computed column in the transaction table so that you're not having to recalculate that which will never change.

    Ok this one is yet over my head 😀 Have not yet heard from a persisted computed column, but from a small google I understand this is done within the table. My initial idea was to setup a couple of stored procedures. First, one that loads the data from stg to ds, making sure no duplicates exist and while fixing datatypes. Second, a stored procedure that first empties the fct, then will select the data from ds, computes the category and finally again inserts to the fct table. This way the fct is always up to date (even the history) with the latest categories.

    Since you're bringing up the computed column I wonder what the preferred method would be.

  • I ran into a first example while creating a query to test whether I can apply the rules via a join.

    Select *

    From dbo.ds_banktransacties ds

    LEFT JOIN dbo.dim_Rules r onISNULL(ds.Name_Description, 0) like ISNULL(r.Name_Description, ISNULL(ds.Naam_Omschrijving,0))

    AND ISNULL(ds.Account, 0)like ISNULL(r.Rekening, ISNULL(ds.Rekening, 0))

    AND ISNULL(ds.ContraAccount, 0)like ISNULL(r.ContraAccount, ISNULL(ds.ContraAccount, 0))

    AND ISNULL(ds.Code, 0)like ISNULL(r.Code, ISNULL(ds.Code, 0))

    AND ISNULL(ds.Add_Sub, 0)like ISNULL(r.Add_Sub, ISNULL(ds.Add_Sub, 0))

    AND ISNULL(ds.MutationType, 0)like ISNULL(r.MutationType, ISNULL(ds.MutationType, 0))

    AND ISNULL(ds.Comment, 0)like ISNULL(r.Comment, ISNULL(ds.Comment,0))

    I have created the rules table as follows. Since the set of columns that are needed are different I have added all columns. The NULL values are not joined properly, therefore I've added the ISNULLs to the query. This way, for an empty rule column the ISNULL resolves to the transaction value which makes sure that the join will work.

    INSERT INTO dbo.dim_Rules

    (Cat_ID, Name_Description, Account, ContraAccount, Code, Add_Sub, MutationType, Comment)

    VALUES('4', NULL, NULL, NULL, NULL, NULL, NULL, 'NS%')

    ,('4', 'NS%', NULL, NULL, NULL, NULL, NULL, NULL)

    The issue arises for the following examples from the data.

    1: The transaction can be recognised only by the first 2 characters of the Name_Description field. -> This goes well since the second rule takes care of that.

    2: The transaction can be recognised only by the first 2 characters of the Comment field. -> This goes well since the first rule takes care of that.

    3: The transaction contains both NS in the first 2 characters of the Name_Description as well as the Comment field. -> Now both rules apply, resulting in a duplicate transaction.

    Am I going about this the wrong way? What could be options to solve this issue?

    Thanks!

  • momentueel (12/16/2016)


    For this model I see the following dimensions:

    1. Account

    2. Transaction Type

    3. Spending category (household, rent, income etc)

    What have I done:

    - I've created a staging table called stg_transactions...Datatypes are only varchar.

    - I've created a datastore table called ds_transactions. A stored procedure inserts the data based on a NOT EXISTS and then truncates the staging table. Next to varchar for names, accounts and comments I've used date to store the date and money for the amounts.

    Sounds like a good start. Something else to consider is that the vast majority of data warehouses will have a date dimension table, with one row per calendar day in it. While it may not be as necessary in your case because you probably don't have a separate accounting month or calendar hierarchy from the standard calendar, it's good practice to have one so it will be more familiar when you do need a more complex one:

    http://www.kimballgroup.com/2004/02/design-tip-51-latest-thinking-on-time-dimension-tables/

    I am wondering the following:

    - Am I right to believe that the name/description and comment column are not part of the dimensions?

    Dimension tables are for any descriptive information, so name and description DO belong in them. Comment column probably depends on what kind of information is in it. You usually want to keep your fact table with fewer columns.

    - The amounts in the export are always positive, another column indicates whether the amount is added or subtracted from the account. In the stored procedure loading the final fct_transactions I'll be generating an additional column with the appropriate positive or negative values.

    - Is this a proper path to loading the fct?

    Yes you want to do that so that the amounts are easily added together in your fact table. You don't want to have to interpret on the fly if the amount is a debit or credit.

    - Now how should I go about generating the dimensions?

    1. Account can be generated from a select distinct, but I suppose determining who the beneficiary is something that needs to be done manually since the Name/description field is not unique for every account.

    Data cleansing can be one of the trickier parts of your data warehouse. If you don't have good consistent names for the beneficiary you may want some kind of mapping table that you can use to translate those multiple name/descriptions to a specific single beneficiary.

    2. Transaction type. The file shows a code and name for each transaction type. I should be able to generate this one automatically.

    3. The spending categories should be created manually in a table. But where and how should I create the rules to determine which transaction belongs to which category? Each transaction should be evaluated differently, for example the account column for a public transport transaction is NULL, therefore I should determine the category based on the comment field. This comment field is NULL for other transactions such as transfers to saving accounts etc.

    This is probably another data cleansing situation. In most cases you may be able to tell the category by the beneficiary, e.g. electric bill is in category Utilities, but there may be some things that are harder to classify, such as paying a credit card bill that may cover food, entertainment, and other categories in one bank transaction, unless you're getting that broken down as well.

  • Thanks Chris! And good to hear I'm somewhat on the right track. Good point regarding the date dimension. I realize that indeed all databases I have worked with have one, therefore it will be good practice to add one myself.

    Regarding the data cleansing. I believe that will be my main issue here. I have come across some issues, which I have described in a previous post. I am hoping to find out what the appropriate strategy is for me to determine the transaction categories and where and how I should implement that.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply