This article is part of the Stairway Series: Stairway to StreamInsight
Microsoft StreamInsight™ is designed to assist in developing Complex Event Processing (CEP) applications in .NET This is appropriate for stream sources tsuch as those in manufacturing applications or financial trading applications. StreamInsight provides the means to monitor, manage, and mine several sources simultaneously for conditions,trends, exceptions, opportunities, and defects almost instantly. It is ideal for performing low-latency analytics on the events and triggering response actions, and for mining historical data to continuously refine and improve definitions of alerting conditions. Johan provides a simple explanation of the system in a series of practical articles.
This article is the first in a series of articles describing Microsoft SQL Server 2008 R2 StreamInsight. As the first article, it gives a background on the technology and the application areas where it could be used.
What really is StreamInsight?
StreamInsight is a platform for Complex Event Processing (CEP) that enables the processing of high speed data streams from multiple data sources. It is one of the new components of SQL Server 2008 R2. Example application areas for CEP include financial applications, fraud detection, manufacturing, surveillance of internet traffic and real-time business intelligence.
Why the need for StreamInsight?
Working with streaming data / complex event processing is quite a different task than working with traditional databases. Some characteristic differences are:
|Continuous data streams
||Finite stored data sets
|Continuous standing queries
Another way to look at stream processing versus traditional databases is represented by the figure below.
Stream processing and complex event processing has been a research area for academic institutions during the last decade, resulting in several new query languages and products during the last years. Large players within this field include StreamBase, IBM, Tibco and Oracle. With the release of StreamInsight as part of SQL Server 2008 R2, Microsoft will join this field.
Challenges that are faced within the field of stream processing and complex event processing include:
- Performance and scalability - including techniques such as clustering and distributed processing
- Out-of-order data arrival
- Data quality problems - duplicate data, retroactive corrections
- Time discontinuities such as daylight savings and leap seconds
- Combining real-time and historical data
A CEP Engine takes care of at least some of these problems, making building robust stream processing applications much easier.
CEP - Complex Event Processing
Streaming data can be seen as a series of events. The events can be organized in a hierarchy with lower-level primitive events directly emanating from the data sources and higher-level complex events describing inferred information from the lower-level events. For example the primitive events could be stock quotes. Patterns in the stock quotes could be represented as complex events and higher order "patterns of patterns" could be represented as complex events even further up in the hierarchy. Even the absence of primitive events, such as no stock quotes in a while for a particular symbol, could trigger the generation of a complex event.
Microsoft defines CEP as the continuous and incremental processing of event streams from multiple sources based on declarative query and pattern specifications with near-zero latency. I believe the meaning of "continuous and incremental processing" (as opposed to finite and set-based) is made clear earlier in this article. The requirement of "declarative query and pattern specifications" may need some more explanation. It means that the CEP engine should have a query language where you can easily and shortly express complex events (e.g. when any share price drops below its 100 day average).
EDA - Event Driven Architecture
Event Driven Architecture (EDA) is an architecture pattern complimentary to Service Oriented Architecture (SOA). Although there is no widely agreed upon definition of EDA (or SOA), what distinguishes EDA is that it is centered on an asynchronous push-based communications model. Events are produced by event emitters (or agents) and are received by event consumers (or sinks). Events can trigger the execution of services that may result in new events generated. The architecture is loose coupled because the emitters do not need to know anything about the consequences of their events to the consumers. It is also well distributed because an event can be almost anything and exist almost anywhere.
Events in EDA can be either primitive or complex events. The definition of an event in EDA is that an event is a "significant change in state", which I think is a good enough definition that there is no need to go deeper into philosophical subtleties. EDA makes use of both CEP and Simple Event Processing (SEP). Typical tasks for Simple Event Processing include filtering and routing of primitive events.
Currently there is a discussion about SOA 2.0, which is a combination of EDA and SOA - Event Driven SOA.
Business Intelligence and related technologies
Business Intelligence (BI) provides historical, current and predictive views of business information. The aim is to support better decision making. Historically, BI has been focused on historical information because the time it takes to extract data from multiple sources and ensure data consistency in the underlying data warehouse. An example of a BI architecture with a data warehouse is pictured below.
Building data warehouses used to be very expensive and took a long time. Many projects never finished due to the complexity of the task. Still I think data warehouses are the superior way to ensure that you get "a single version of the truth" and well-known definitions together with good query performance.
Using historical data together with data mining, Business Intelligence systems extrapolate patterns and trends to make predictions about the future. For example an insurance company could be predicting the death risk for customers based on demographic parameters. The success of such a prediction is dependent on having the right parameters and having a long enough history. That works very well with a data warehouse since there is no requirement for having the absolutely latest data. The extrapolation relies on patterns that are found in a longer history and that history repeats itself.
Because of the high costs and processing times, BI was earlier mostly aimed at senior management. The content was strategic information such as key performance indicators (KPIs) which would be reported on a weekly basis or even more seldom. This is typical for BI projects emanating from the Corporate Performance Management (CPM) idea of starting with business strategies and breaking it down in a top-down approach to measures that support the strategies. CPM uses historical data rather than CEP.
Operational Intelligence (OI) is complimentary to CPM because successful companies are dependent on both good strategies and good operations. If you have an accident you need to act fast, not wait for the next management meeting. Operational Intelligence is very similar to Real-Time Business Intelligence. In fact the only substantial difference I have heard of is that Operational Intelligence is activity-centric while Real-Time BI is data-centric. Operational Intelligence solutions commonly make use of CEP.
Business Activity Monitoring (BAM) deals with the monitoring of business processes. BAM originated from Business Process Management (BPM) which is the art of modeling, monitoring and automating business processes. BPM could be described as a process for improving other processes. BAM products have moved towards the use of CEP and complimentary information outside the processes they monitor. BAM is also very similar to OI and Real Time Business Intelligence. The only significant difference I can think of is that BAM is primarily focused on modeled processes.
An example of a BAM, OI or real-time BI solution (or whatever you call it...) using CEP is pictured below.
Historical data is recorded because it helps testing and tuning CEP queries. The StreamInsight architecture makes running queries on historical data instead of live data easy. Actually it is not a bad idea to let a data warehouse handle the historical and reference data as pictured below.
Why make StreamInsight a part of SQL Server?
One question I have seen in discussion forums is why StreamInsight is part of SQL Server. The simple answer I have heard from Torsten Grabs at Microsoft (during his speech at PDC 09) is that it is all about processing data.
If you compare StreamInsight with other SQL Server components such as Integration Services, you will find it is not so different. Both StreamInsight and Integration Services deal with data flows. Integration Services has data sources (like StreamInsight input adapters), transformations (queries) and destinations (output adapters). The main difference is that Integration Services is aimed at batch processing of finite data sets rather than continuously running queries. Maybe CEP-based ETL-processing will become a new buzz word in the future?
StreamInsight will be available in two versions: standard edition and premium edition. As a rough guideline, the premium edition is recommended when you have more than 5000 events per second.
The premium edition will be included as part of SQL Server 2008 R2 Data Center Edition. That edition will only be offered on a per processor pricing. Estimated retail price is around $60,000 per processor.
The standard edition of StreamInsight will be included in the Standard and Enterprise editions of SQL Server 2008 R2.
This is the first article, describing the background of Complex Event Processing and StreamInsight. As you can see, Complex Event Processing is complimentary to traditional data base applications. With StreamInsight you get a platform for easier development of high performance applications for algorithmic trading, manufacturing, click-stream processing and fraud detection using familiar tools like Visual Studio, C# and .NET.
The next article will go into more detail how to get started with StreamInsight.
About the Author
Johan Åhlén is an independent Business Intelligence professional, all-round data enthusiast, inventor, musician, serial entrepreneur and information archeologist. He has coded since he was 9, around the time MS DOS 2.0 was released. Johan has been working with SQL Server since version 6.5 and is one of the leaders of the Swedish SQL Server User Group. Although his work is often in the borderland between technology and business (or psychology), he is eagerly following new technologies with his computer full of geeky CTPs. Feel free to contact Johan through his blog.
This article is part of a series that is designed to help you quickly understand and begin using StreamInsight. You can view the entire series here.