What is Oozie ?
“Oozie is a server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, Sqoop, HDFS operations, and sub-workflows” as stated by 'cloudera.com'. The term 'OOZIE' literally is a Burmese term for an elephant rider/controller more commonly known by the Indian term 'Mahout'. It is a very accurate mapping as this workflow engine is also a controller to all the hadoop jobs that are a part of its workflow.
Oozie has got two flavors
- Cloudera distribution for oozie
- Yahoo distribution for oozie
and you can use it in different combinations with Apache/Cloudera Hadoop as below :
- Cloudera oozie + Cloudera Hadoop
- Cloudera oozie + Apache Hadoop
- Yahoo oozie + Cloudera Hadoop
- Yahoo oozie + Apache Hadoop
The first combination works just fine with a very feasible installation using debian packages. One of the major difference between the two distributions is that Yahoo distribution of oozie has no support for running hive actions but certainly patches have been added that add support for the same, whereas Cloudera distribution of oozie supports running hive and sqoop jobs and also has a sample workflow applications for them included in its set of examples.
Need Assessment of Oozie
Lets start it with some stats, according to the Oozie presentation during Hadoop Summit in June – there are over 4800+ workflow applications deployed within Yahoo! at the moment, with largest workflow containing 2000 actions.
It is very difficult to manage and run such workflows repeatedly without having a workflow engine that can automate these jobs. Not only this, in many of our small applications too we need a controller which can smoothly execute a set of given jobs and notify only in conditions where there is need of user intervention or some failure. If these jobs are Hadoop jobs, Oozie is the best choice one can make.
- Allows to run a series of map-reduce, hive, pig, java & scripts actions a single workflow job
- Allows regular scheduling of workflow jobs
- Uses an XML file for writing workflows and Direct Acyclic Graph for expressing them
- It supports: mapreduce (java, streaming, pipes), pig, java, filesystem, ssh, hive, sqoop, sub-workflow
- Supports variables and functions for parameterization of workflows
- Supports decision nodes allowing the workflow to make decisions
- Oozie interval job scheduling is time & input-data-dependent based
- It runs as server (multi user, multi workflows)
- Oozie, actions run in the Hadoop cluster as the user that submitted the workflow
- Oozie uses a SQL/Derby database, a workflow state is in memory only when doing a state transition
- In case of fail-overs, running workflows continue running from their current state
Hope this post gave you some idea of what exactly Oozie is all about. In my next post, I'll be mentioning the steps to install oozie and getting started with it.