Consideration for Azure for Bisness intelligence

  • Im just started a new position as BI developer and planning out my strategie for implementing a BI solution. This will as I see it most definitely involve a dimensional Data warehouse populated with SSIS with SSRS and SSAS running reports and cubes with possibly powerpivot and clikview etc.

    They company are now looking to move there database to the cloud and are getting a consultant in next week to disscuss the options. As I know nothing really about Azure im looking for things I should be aware of that will impact my plans regarding the DW and Azure.

    is this straight forward or are there issues I should be aware of

    Thanks

  • What will be the size of the database?

    ---------------------------------------------------
    "Thare are only 10 types of people in the world:
    Those who understand binary, and those who don't."

  • Hard to be specific as our systems are changing state just now but I would definitely say on the smaller side 10-20gb

  • The future for data warehousing is column-store and in-memory tables. Most organisations with a BI store of under 2TB will not need to bother with complexities such as SSAS because in-memory can do the aggregations on the fly within acceptable response times.

    All of the big 3 vendors now have variations of column-store and in-memory tables, and there are a number of enterprise-quality BI niche databases also with these functions.

    This adds up to saying that any data warehouse based on a tabular database is no longer fit for purpose. This does not mean that it has suddenly broken. It means that tabular is to steam trucks what column-store is to diesel trucks. Both can carry stuff. But diesel can carry an order of magnitude more than steam for an order of magnitude less overhead cost.

    One of the impacts we will see over (say) the next 4 years to 2018 is a drastic simplification of BI stacks. Most of the work we have traditionally done to give end users the performance they need is no longer necessary. In-memory can take the load and give the aggregates on demand. Another impact is likely to be puncturing the big-data bubble, as organisations realise that 2TB is too small to count as big data, and the shift to these techniques is not really needed.

    Relating all this to the OP, as SQL2014 goes into RTM look out for what Microsoft announces for Azure. I have no inside track to Microsoft plans, but anyone who has had the time has seen what Hekaton can do within SQL2014, and can work out what capabilities this will add to Azure. Regarding whether you should host your data in your normal DC or in the cloud, treat both options as equally worthy from a technical viewpoint and look at what is most cost effective for you.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Informative response thanks 🙂

    When you say "column-store and in-memory tables" what exactly do you mean, im somewhat new to DW.

  • Column-store, in-memory, and tabular are all about how the data is physically stored. You may also see people talking about shared-nothing and shared-disk, which is also about how data is stored but at a higher level.

    Tabular format is where data is physically stored in rows. In a table with 10 columns and 10 rows, you will have 100 data items physically stored. Data compression can reduce the amount of space needed to store this data, but you still have 100 data items stored.

    Column format is where the row is broken down into columns, and only unique values for each column are stored. So with the table with 10 columns and 10 rows, is all values are unique you still get 100 data items stored, but if all rows have the same value then you only have 10 data items stored. You also get the same compression capabilities as row format on top of this to save space.

    On space savings alone you typically see a 3:1 saving for compressed tabular format data but typically you get a 10:1 saving for column format data. This means that if it took 10 separate disk read requests to get your uncompressed tabular data, it would take only 3 read requests to get compressed tabular data and only 1 read request to get compressed column data. In other words, column format data typically gets your data faster even without any further cleaver bits.

    However, column format data is (for almost all types of column format DBMSs) automatically indexed on every column as part of the data storage process. This is needed so that the DBMS can quickly reassemble rows for presentation to the user. This means that you no longer have to worry about which columns to index to improve performance - they are all indexed.

    For a simple SELECT * FROM table you do not see the benefits of column format storage, but as soon as you add a WHERE clause things start to hum. Add a SUM(column) and a GROUP BY and column format means you no longer get a coffee break between starting your query on a 50 million row table and getting the result.

    At my place we are still in the early stages of exploiting column format but typically get a simple SUM(column) with a GROUP BY query on a 25 million row table returning in 1 to 2 seconds. More complex stuff takes a bit longer, but our users are happy about waiting 15 seconds for a detailed breakdown of our market penetration month by month over the past 2 years. When this data was in tabular format they had to plan other work to do while the query was running.

    In-memory takes things a step further. All the big 3 (SQL, Oracle, DB2) take the same approach, which is also shared by Hana, Redshift, Postgres and many other DBMSs, in that only column format tables are held in memory. Holding a table in memory is different to having a portion of it in the bufferpool, as the access mechanisms are completely different and have a shorter CPU instruction path.

    A cleaver guy in SAP worked out in the mid noughties that if you make your data transfer size the same as your CPU cache size then transfer between memory and cache can be done in a single CPU instruction. It also takes less CPU instructions to query the data if everything that is needed for a given compare is held in the CPU cache. This led to the Hana database, and the other vendors have taken this idea on for their own products. (Don't ask me about patents and licensing, if anything was being ripped off we would know).

    The end result is that if you combine column-format with in-memory you typically get a response time between 1 and 2 orders of magnitude faster than you get from tabular format. The impact of this is that aggregation can almost always be done at query time while producing results fast enough to avoid the need to pre-aggregate into a cube.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • This is very interesting 🙂

    Im doing some reading now on this now, i hope you don't mind answering any other questions this produces. Bu wanted to thank you anyway for the info so far. 🙂

  • Thanks EdVassie for that response. Very interesting stuff.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply