Friday, November 11, 2011

Twitter's Storm : Real-time Hadoop

The data processing ecosystem started to experience scarcity of solutions that could process the rising volumes of structured and unstructured data. The traditional database management systems in no way could attain the required level of performance.
This massively growing data necessitated two kinds of processing solutions due to the nature of their source. One was “ batch processing ” that could perform functions on enormous volumes of stored data and the other was “ realtime processing ” that could continuously query the incoming data and stream out the results.

In this scenario, Hadoop proved to be a savior and brilliantly covered the first aspect of processing i.e. the batch processing but a reliable solution for realtime processing was yet to be conceived that could perform as well as hadoop does in its own sphere.

STORM somewhere seems to put an end to this search.

About Storm

Storm is a distributed, fault-tolerant stream processing system.” as stated by its developers. It can be called “Hadoop of Realtime” as it fulfills all the requirements of a realtime processing platform. Parallel realtime computation is now lot more easy with Storm in the picture. It is meant for :
  • Stream processing : process messages and update a variety of databases.
  • Continuous computation : do continuous computation and stream out the results as they're computed.
  • Distributed RPC : parallelize an intense query so that you can compute it in realtime.

  • Topology : It is a graph of computation. All nodes have a processing role to play. As we submit jobs in hadoop, in storm we submit topologies which continue executing until they are closed(shut down).

  • Modes of Operation :
    - Local Mode : When topologies are developed and tested on local machine.
    - Remote Mode : When topologies are developed and tested on remote cluster.

  • Nimbus : In case of a cluster, the master node is called Nimbus. To run topologies on the cluster our local machine communicates with nimbus which in turn assigns jobs to all the cluster nodes.

  • Stream : It is an unbounded stream of tuples which are processed in parallel in a distributed manner. Every stream has an id.

  • Spout : The source of streams in a topology. Generally takes obtains input streams from an external source and emits them into a topology. It can emit multiple streams each of a different definition. A spout can be reliable (capable of resending if the stream has not been processed by the topology) or unreliable (the spout emits and forgets about the stream).

  • Bolt : Consumes any number of streams from Spout and processes them to generate output stream. In case of complex computations there can be multiple bolts.

  • Storm client : Installing the Storm release locally gives a storm client which is used to communicate with remote clusters. It is run from /storm_setup/bin/storm


  1. nice summary. Here's d detailed version

  2. Nice tutorial., very help full for me., Thank you..:)

  3. ITaskHook can be quite useful in getting info about your spouts and bolts. For example, if you wish to track the event failures in tuples, if could be quite useful.