Sunday, February 27, 2011

An Introduction to S4

What is S4 ?
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data, as described by S4.io. 

Need Assessment of S4
The emergence of commercial search engines and new applications such as real-time
search, high frequency trading, and social networks demands a streaming paradigm that can process the unbounded stream of events that flows into the system. Since the Map Reduce systems operate on static data by scheduling batch jobs, the renowned Hadoop platform delivers its best for batch processing systems. Therefore there is a clear need for a highly scalable stream computing solution that can operate at high data rates and process massive amounts of data.
Yahoo's open source product S4 (Simple Scalable Streaming System) is one such solution.

Digging a bit more into S4
S4 affirms to support many features and substantiates itself as :
• Decentralized : All nodes of an S4 cluster are symmetric and there is no single point of failure.
• Scalable : Throughput increases linearly with number of nodes.
• Partially fault-tolerant : Fault tolerant because
- cluster management system reroutes event from failed to other servers 
and Partially because
- state of the Processing Elements running on the failed server might be lost in case persistent
storage has not been used.
• Elastic : computing load automatically gets distributed
• Expandable : a simple API has been provided
•Object Oriented : POJOs used for internode communication.
  
Basic Components of S4 

• Processing Element (PE)
Basic computational unit which can send and receive messages called Events. Upon receiving any
event a PE can :
- emit one or more events which may be consumed by other PEs
- publish results, possibly to an external data store or consume.

• Processing Node (PN)
The logical hosts to PEs which can :
- Listen to events
- Dispatch events with the help of the communication layer
- Emit output events
- and execute operations on the incoming events.

• Adapter
The events being fed to the S4 cluster for processing need to be translated into S4
compatible events and similarly the events received from an S4 cluster have to be made
understandable to the client. The Client I/O Stub solves this purpose whereas the
Adapter injects events into the S4 cluster and receives from it via the Communication
Layer.

Existing solutions and S4
The S4 design is not new in the industry as it implements the Actor framework. Erlang
and Scala already have a similar implementation. But the power of mixing in Zookeeper
and a pluggeable architecture can set S4 apart from previous frameworks.
At the moment Yahoo has open sources a handful set of applications. My next post would be an attempt to help you out in creating an S4 application.

No comments:

Post a Comment