Jayati Tiwari: Introduction to Oozie

What is Oozie ?

“Oozie is a server-based workflow engine specialized in running workflow jobs with actions that execute Hadoop jobs, such as MapReduce, Pig, Hive, Sqoop, HDFS operations, and sub-workflows” as stated by 'cloudera.com'. The term 'OOZIE' literally is a Burmese term for an elephant rider/controller more commonly known by the Indian term 'Mahout'. It is a very accurate mapping as this workflow engine is also a controller to all the hadoop jobs that are a part of its workflow.

Oozie Flavors

Oozie has got two flavors

Cloudera distribution for oozie
Yahoo distribution for oozie

and you can use it in different combinations with Apache/Cloudera Hadoop as below :

Cloudera oozie + Cloudera Hadoop
Cloudera oozie + Apache Hadoop
Yahoo oozie + Cloudera Hadoop
Yahoo oozie + Apache Hadoop

The first combination works just fine with a very feasible installation using debian packages. One of the major difference between the two distributions is that Yahoo distribution of oozie has no support for running hive actions but certainly patches have been added that add support for the same, whereas Cloudera distribution of oozie supports running hive and sqoop jobs and also has a sample workflow applications for them included in its set of examples.

Need Assessment of Oozie

Lets start it with some stats, according to the Oozie presentation during Hadoop Summit in June – there are over 4800+ workflow applications deployed within Yahoo! at the moment, with largest workflow containing 2000 actions.

It is very difficult to manage and run such workflows repeatedly without having a workflow engine that can automate these jobs. Not only this, in many of our small applications too we need a controller which can smoothly execute a set of given jobs and notify only in conditions where there is need of user intervention or some failure. If these jobs are Hadoop jobs, Oozie is the best choice one can make.

Oozie Highlights

Allows to run a series of map-reduce, hive, pig, java & scripts actions a single workflow job
Allows regular scheduling of workflow jobs
Uses an XML file for writing workflows and Direct Acyclic Graph for expressing them
It supports: mapreduce (java, streaming, pipes), pig, java, filesystem, ssh, hive, sqoop, sub-workflow
Supports variables and functions for parameterization of workflows
Supports decision nodes allowing the workflow to make decisions
Oozie interval job scheduling is time & input-data-dependent based
It runs as server (multi user, multi workflows)
Oozie, actions run in the Hadoop cluster as the user that submitted the workflow
Oozie uses a SQL/Derby database, a workflow state is in memory only when doing a state transition
In case of fail-overs, running workflows continue running from their current state

Hope this post gave you some idea of what exactly Oozie is all about. In my next post, I'll be mentioning the steps to install oozie and getting started with it.

2 comments:

kumarkumarSeptember 28, 2012 at 9:47 AM
hi jayati i need sample program for Hbase+oozie please can u provide and also workflow how to specify all properties of hbase also....
TejutejuAugust 1, 2018 at 6:35 AM
It is nice blog Thank you provide important information and i am searching for same information to save my time Big data hadoop online Training

Tuesday, May 10, 2011

Introduction to Oozie

2 comments: