March 3, 2014 at 9:12 am
We are designing a reporting data warehouse that will consolidate data from numerous on-premise SQL Server 2000 & 2008 systems into a single warehouse database in a hosted SQL Server 2012 environment.
The hosted environment is home primarily to our new SAAS product: a web-based software solution built using Service-Oriented- (SOA) and Message Bus architecture.
The new system receives its data from the legacy systems from a message queue via polling; small programs installed on the legacy servers run select queries every few seconds to check for changes in the data. If changes are detected a message is placed on the queue and the data makes its way into the new system.
So far this method is working as a short-term solution, but I have concerns about this technique; the legacy systems are already struggling with concurrency - the data model is poorly designed and the systems suffer from excessive blocking. Adding NOLOCK hints to the queries is out of the question due to the possibility of getting dirty reads.
The reporting data warehouse that we are going to be building requires we start capturing more data than we are today which in turn increases my concerns with our current approach.
The architects want the warehouse to receive all its data through the message bus like the individual services are currently. This doesn't sound like the best solution to me, but I need some strong evidence to support any alternative solutions I might suggest to them.
My questions are as follows:
1. What alternative methods would you recommend for capturing changes in the legacy SQL Server 2000 databases that would be less impactful on the production systems? i.e. triggers, SQL Server Replication, etc.
2. Is message queuing a good choice for replicating large amounts of data into a warehouse for reporting in near real-time? What are some alternative approaches and what are the trade-offs?
March 4, 2014 at 4:48 am
You are severely limiting the capabilities of what you can do by continuing with the use of SQL 2000 and SQL 2008. You will end up with a system with a design that is obsolete before it is implemented. One of the characteristics of your system will be high duration and cost of change and poor performance compared to a greenfield design starting today.
I realise that you have to start with the reality of what exists in your organisation, but if your organisation thinks it is saving money by staying on really old software it has got its sums wrong.
The general availability of column-orientated databases, both from the traditional big three vendors and from smaller players such as Amazon with Redshift, is causing a point of major disruption in BI design. Some of the reasons for this are:
a) The query workload that can be performed on a single column-orientated DB instance is probably two orders of of magnitude greater than could be performed on a single tabular BI DB instance.
b) Given an equivalent amount of processing power and workload, a column-orientated DB can typically return BI-type query results in about two orders of magnitude less time than a tabular BI instance.
c) A typical column-orientated DB will collapse most dimensions into the fact tables, resulting in wide fact tables that seldom need to be joined to other data and very few dimensions to join with. This can drastically simplify the ETL logic needed to populate the database.
c) Take all of the above items together, and you are looking at a massive reduction in the cost and complexity of hosting a BI database. For most situations under the 2TB database size, a column-orientated DB instance will give satisfactory performance without the need for SSAS or any other form of pre-aggregation.
It is worth doing a search on 'Agile BI' to research techniques on how to exploit the new BI landscape. The end result may be a system that can move your organisation ahead of its competition, rather then a system that through its fundamental design is acting as a brake on what it can do.
Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.
When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara
November 18, 2014 at 6:53 pm
Quick feedback on "The architects want the warehouse to receive all its data through the message bus like the individual services are currently" situation. I think this is a pretty bad idea. Data Warehousing systems are usually tuned to ingest data by bulk imports, or at least "micro-batches" for closer-to-real-time reporting scenarios. With the one-record-at-a-time inserts (per message), the record will need to contribute net new data to dimensions and facts, this is gonna be a severe overhead on a data warehouse. Perhaps the messages can be queued up in an OLTP-tuned environment, something like AWS Aurora, which can ingest 6+ million records per minute, then micro-batched into the data warehouse. That's my 5 cents on the subject.
Viewing 3 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply