The term ‘Data In Motion’ is widely used to represent the speed at which large volumes of data are moving into corporate applications in a broad range of industries. Big Data also needs to be Fast Data. Fast in terms of processing potentially millions of messages per second while also accommodating multiple data formats across all these sources.
Stream Processing refers to the ability to process these messages (typically sensor or machine data) as soon as they hit the wire with a high event throughput to query ratio.
Some common use-cases are processing set-top data from converged media products in the Cable & Telco space to analyze device performance or outage scenarios; fraud detection in the credit card industry, stock ticker data & market feed data regd other financial instruments that must be analyzed in split second time; smart meter data in the Utilities space, driver monitoring in transportation etc. The data stream pattern could be batch or real-time owing to the nature of the business event etc.
While the structure and the ingress velocity of all this data varies depending on the industry & the use case, what stays common is that these data stream must all be analyzed & reacted upon as soon as is possible with an eye towards business advantage.
The leading Big Data Stream processing technology is Storm. More on it’s architecture and design in subsequent posts. If you currently do not have a stream based processing solution in place, Storm is a great choice – one that is robust, proven in the field and well supported as part of the Hortonworks HDP.
A range of reference usecases listed here –
http://storm.apache.org/documentation/Powered-By.html
Now, at the surface a lot of this sounds awfully close to Complex Event Processing (CEP). For those of you who have been in the IT industry for a while, CEP is a mature application domain & is defined as the below by Wikipedia.
“Complex Event Processing, or CEP, is primarily an event processing concept that deals with the task of processing multiple events with the goal of identifying the meaningful events within the event cloud. CEP employs techniques such as detection of complex patterns of many events, event correlation and abstraction, event hierarchies, and relationships between events such as causality, membership, and timing, and event-driven processes.” |
However, these two technologies differ along the following lines –
- Target use cases –
CEP as a technology & product category has a very valid use-case when one has an ability to define analytics (based on simple business rules) on pre-modeled discrete (rather than continuous) events. Big Data Stream processing on the other hand enable one to apply a range of complex analytics that can infer across multiple aspects of every message. - Scale -Compared to Big Data Stream Processing, CEP supports a small to medium size scale. You are typically talking a few 10,000’s of messages while supporting latency of a magnitude of seconds to sub-seconds (based on the CEP product). Big Data stream processing on the other hand operates across 100,000’s of messages every second while supporting a range of latencies. For instance, Hortonworks has benchmarked Storm as processing one million 100 byte messages per second per node.Thus, Stream processing technologies have been proven at web scale versus CEP, which is still a niche and high-end capability depending upon the vertical (low latency trading) or the availability of enlightened architects & developers who get the concept.
- Variety of data -CEP operates on structured data types (as opposed to Stream Processing which is agnostic to the arriving data format) where the schema of the arriving events has been predefined into some kind of a fact model. E.g. a Java POJO.
The business logic is then defined using a vendor provided API. - Type of analytics – CEP’s strong suit is it’s ability to take atomic events and correlate them into compound events (via the API). Most of the commercially viable CEP engines thus provide a rules based language to define these patterns. Patterns which are temporal and spatial in nature. E.g. JBOSS Drools DRLBig Data Stream Processing on the other hand supports a super set of such analytical capabilities with map-reduce, predictive analytics, machine learning and advanced visualization etc etc.
- Scalability –CEP engines are typically small multi-node installs with vertical scalability being the core model of scaling clusters in production. Stream processing on the other hand provides scalable and auto-healing across typical 2 CPU dual core boxes. The model of scalability & resilience is horizontal.Having said all of this, it is important to note that these two technologies (Hadoop ecosystem & CEP) can also co-exist. Many customers are looking to build “live” data marts where data is kept in memory as it streams in, a range of queries are continuously applied and realtime views are then shown to end users.Microsoft has combined their StreamInsight CEP engine into Apache Hadoop where MapReduce is used to run scaled out reducers performing complex event processing over the data partitions.
Source – http://blogs.msdn.com/b/microsoft_business_intelligence1/archive/2012/02/22/big-data-hadoop-and-streaminsight.aspx
1 comment
Very well written. Would love for you to expand on this theme from a Spark standpoint.