What's your infrastructure?

  • I work for a small company that's starting to get more into analytics with SQL Server 2008 R2 and even NoSQL solutions. We built and manage our own infrastructure in-house. The question is starting to come up with "cloud" solutions as opposed to what we are currently using. However, there are so many options at various costs and there doesn't seem to be that much information on popular solutions for what we specifically do with number crunching versus generic applications.

    So, I wanted to ask the community for their input or helpful articles. Does anyone have any good papers on articles about infrastructure solutions for companies that do analytics, statistics (machine learning) or similar?

    From what I've gathered thus far, high performance in terms of input and output with reading and writing to memory still belong on bare-metal servers or private setups. Mainly because the costs of scaling up to those high performance solutions are very expensive for most generic "cloud" solutions. Others have said that databases in the cloud, especially for analytics, really haven't got there yet unless you're going to shell out the $$$.

    Thoughts?

  • What will be size of Analytic databases? As clound is having some limitations handeling large database in cloud.

    ---------------------------------------------------
    "Thare are only 10 types of people in the world:
    Those who understand binary, and those who don't."

  • I think you need to define more about what type of analytics you want to do, and the product set you currently use to do it. There are many ways to slice and dice analytics, so it is hard to give specific advice without knowing the question you are asking.

    To give some general advice, the IT industry is currently in a phase of disruption over how BI can best be delivered. The introduction of column-orientated and in-memory features by the big 3 - SQL Server, Oracle, DB2 - in 2014 has made these technologies mainstream.

    It can be said that row-orientated BI database is no longer fit for purpose. It still works just the same as before, but in-memory is the diesel truck compared to the row-orientated steam truck. You can shift stuff in both a steam truck and a diesel truck, but you get order of magnitude more speed along with order of magnitude less hardware investment with diesel compared to steam. Also, because in-memory is often 'fast enough' when calculating aggregates at query time, the need to have SSAS cubes in the mix is often eliminated.

    However, the big 3 still ask big money for these features, leaving a big opening for products that just provide this BI capability at a lower cost. My organisation has chosen Redshift for its future BI database because we get much the same column-orientated and in-memory performance provided by SQL 2014 Enterprise, but at lower than Standard Edition costs.

    As a simplistic example of Redshift performance, a GROUP BY query combining SUM, MAX, and COUNT returns in under 2 seconds from a 28 million row Redshift table on a 2-node XL cluster. SQL 2014 EE using in-memory should give about the same performance, but at a higher cost. If you find the right offering, cloud can be the cheapest way to go.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • Thanks for the feedback.

    Right now, I'm dealing with an OLAP database that will grow to about 1 TB per year. This goes up or down depending on business growth and stacks year-after-year. Archiving data is planned, but we want to see how much we can deal with first, before packing up data and shoving it away.

    This is just one data source we don't control. As it's in OLAP structure, it's pretty much a data mart that could potentially be dumped into a data warehouse when more data sources are defined. But for now, we opted on simply building a reporting database where we simply process the raw data, cleanse it and apply business logic in a denormalized fashion to increase query and aggregation performance on top of doing OLAP Cubes for reporting.

    Similar to a traditional warehouse, when new data arrives, it's processed in batch overnight. As we do attribution, the bottleneck is because we have to match all data together across 3 fact tables and 30 dimensions for reporting.

  • we have to match all data together across 3 fact tables and 30 dimensions for reporting

    I feel your pain. It is because we have similar problems that we are moving to column-store and in-memory - it is so much faster.

    We are also looking to redesign some of our schemas to make them more column-store frrendly. In many cases it is possible to roll dimension tables into the fact table to give a much wider table that requires few or no joins for reporting. The very low penalty in column-store of doing this rollup is more than offset by avoiding the need to do joins.

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • EdVassie (5/21/2014)


    we have to match all data together across 3 fact tables and 30 dimensions for reporting

    We are also looking to redesign some of our schemas to make them more column-store frrendly. In many cases it is possible to roll dimension tables into the fact table to give a much wider table that requires few or no joins for reporting. The very low penalty in column-store of doing this rollup is more than offset by avoiding the need to do joins.

    We are thinking about that too in order to make the ETL process smoother, but at that point, I think it may just be easier using a NoSQL solution. I could just dump it into one document and feed that into SQL Server for analyzing. It would be completely denormalized and the start of a data warehouse.

    My only problem with doing that is the fact that I have no experience with NoSQL. This is just how I understand the tech. :w00t:

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply