Monday, February 28, 2011

Running an S4 Application

As mentioned in my previous post, this post is an attempt to help you run the sample S4 appilcations open sourced by Yahoo.
 
Prerequisites
  • Linux (tried on ubuntu 9.04)
  • Java 1.6
  • Maven
  • S4 Communication Layer
  • S4 Core

Setting up the S4 environment

In order to build an S4 application of your own, you need to setup the S4 core , comm and examples(optional) directories as follows:

1. Create a directory /S4/s4image at some location of your choice (assuming  /home/abc)
  •  mkdir S4;cd S4/;mkdir s4image;cd s4image/
2.   In this s4image directory, we can either download the S4 core tarball or build S4 core ourself. Going by the latter option,
  •     git clone https://github.com/s4/core.git
  •     git clone https://github.com/s4/comm.git
   You can check core and comm folders created in the s4image directory.

3. Building the comm and core folder
  •     cd comm
  •      mvn install:install-file -DgroupId=org.apache.hadoop -DartifactId=zookeeper -Dversion=3.1.1 -Dpackaging=jar -Dfile=lib/zookeeper-3.1.1.jar
  •     mvn install
  •     cd ../core
  •      mvn install:install-file -DgroupId=com.esotericsoftware -DartifactId=kryo -Dversion=1.01 -Dpackaging=jar -Dfile=lib/kryo-1.01.jar
  •     mvn install:install-file -DgroupId=com.esotericsoftware -DartifactId=reflectasm -Dversion=0.8 -Dpackaging=jar -Dfile=lib/reflectasm-0.8.jar
  •      mvn install:install-file -DgroupId=com.esotericsoftware -DartifactId=minlog -Dversion=1.2 -Dpackaging=jar -Dfile=lib/minlog-1.2.jar
  •     mvn assembly:assembly install
After successfully running these, you can locate the /core/target/s4_core-0.3.0.0.tar.gz
  •     cd target/
  •     tar xzf s4_core-*.tar.gz
4. Building the examples folder
  •     cd /home/abc/S4/s4image
  •     git clone https://github.com/s4/examples.git
An examples directory should have been created along with comm and core.

Running the examples

To have an understanding of the structure of an S4 app, running a few examples is recommended. For a start, try to run the speech01 example:
  •     cd /home/abc/S4/s4image/examples/speech01
  •     mvn assembly:assembly install
 will create a target folder in speech01
  •     cd /home/abc/S4/s4image/core/target/s4_apps/
  •     tar xzf /home/abc/S4/s4image/examples/speech01/target/speech01-*.tar.gz
  •     cd /home/abc/S4/s4image/core/target/bin
  •     ./s4_start.sh &
  •     head -10 /home/abc/S4/s4image/examples/testinput/speeches.txt | \       ./generate_load.sh -x -r 2 -u /home/abc/S4/s4image/core/target/s4_apps/speech01/lib/speech01-0.0.0.1.jar -

In my next post, would be on "how to build your own S4 application". Any suggestions would be highly appreciated.




Sunday, February 27, 2011

An Introduction to S4

What is S4 ?
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data, as described by S4.io. 

Need Assessment of S4
The emergence of commercial search engines and new applications such as real-time
search, high frequency trading, and social networks demands a streaming paradigm that can process the unbounded stream of events that flows into the system. Since the Map Reduce systems operate on static data by scheduling batch jobs, the renowned Hadoop platform delivers its best for batch processing systems. Therefore there is a clear need for a highly scalable stream computing solution that can operate at high data rates and process massive amounts of data.
Yahoo's open source product S4 (Simple Scalable Streaming System) is one such solution.

Digging a bit more into S4
S4 affirms to support many features and substantiates itself as :
• Decentralized : All nodes of an S4 cluster are symmetric and there is no single point of failure.
• Scalable : Throughput increases linearly with number of nodes.
• Partially fault-tolerant : Fault tolerant because
- cluster management system reroutes event from failed to other servers 
and Partially because
- state of the Processing Elements running on the failed server might be lost in case persistent
storage has not been used.
• Elastic : computing load automatically gets distributed
• Expandable : a simple API has been provided
•Object Oriented : POJOs used for internode communication.
  
Basic Components of S4 

• Processing Element (PE)
Basic computational unit which can send and receive messages called Events. Upon receiving any
event a PE can :
- emit one or more events which may be consumed by other PEs
- publish results, possibly to an external data store or consume.

• Processing Node (PN)
The logical hosts to PEs which can :
- Listen to events
- Dispatch events with the help of the communication layer
- Emit output events
- and execute operations on the incoming events.

• Adapter
The events being fed to the S4 cluster for processing need to be translated into S4
compatible events and similarly the events received from an S4 cluster have to be made
understandable to the client. The Client I/O Stub solves this purpose whereas the
Adapter injects events into the S4 cluster and receives from it via the Communication
Layer.

Existing solutions and S4
The S4 design is not new in the industry as it implements the Actor framework. Erlang
and Scala already have a similar implementation. But the power of mixing in Zookeeper
and a pluggeable architecture can set S4 apart from previous frameworks.
At the moment Yahoo has open sources a handful set of applications. My next post would be an attempt to help you out in creating an S4 application.