Wednesday, November 21, 2012

Setting up a Zookeeper Cluster

ZooKeeper is a distributed, open-source and high-performance coordination service for distributed applications. Setting up a running Zookeeper cluster is a prerequisite to kick off many installations.
This post would direct you on how to setup a Zookeeper Cluster.

 

Prerequisites

Java JDK installed on all nodes of the cluster.

 

Setting up the cluster

The following installation steps have been tested for 10.04, 10.10, 11.04, 12.04 versions of Ubuntu.

1. Obtain the zookeeper setup at some location. Setup can be downloaded from :
http://hadoop.apache.org/zookeeper/releases.html

2. Create a file with any name (eg. zoo.cfg) in the conf folder of the copied setup and write in it


dataDir=/var/zookeeper/                                                                   
clientPort=2181
initLimit=5
syncLimit=2
server.server1=zoo1:2888:3888                               
server.server2=zoo2:2888:3888
server.server3=zoo3:2888:3888                                                         

Here 2888 and 3888 ports cannot be modified but the server.id(server.server1) and the zkServerName(zoo1) can be changed by the user. Using the above entries as sample entries, next it is required that a file named “myid” be created in the path specified in dataDir which contains just one entry which is of the server id. So the first system of the cluster would have a file named "myid" created at the path specified in dataDir containing server1 and so on i.e.
To make it more clear, if we are using 3 systems with IP 192.192.192.191, 192, 193
and zoo1 would designate 192.192.192.191, zoo2 would designate 192.192.192.192, zoo3 would designate 192.192.192.193
then
the machine 192.192.192.191 should contain a file called myid at /var/zookeeper/ (or the value of dataDir specified in zoo.cfg) containing the following entry
server1
Similarly machines 192.192.192.192 and 192.192.192.193 should have entries server2 and server3 respectively.

3. Update the /etc/hosts file on each machine to add the host names being used in the zookeeper configuration. This is needed so as to make it understandable that zoo1, zoo2 and zoo3 refer to which systems.
Post-updation the /etc/hosts file on each system in the cluster would have a similar set of entries like :

192.192.192.191   zoo1
192.192.192.192   zoo2                                                                     
192.192.192.193   zoo3

4. This completes the configuration part, next cd to the zookeeper home and start the cluster by running

bin/zkServer.sh start                                                                          
command on each system.
And you have a running Zookeeper cluster at your disposal. 
Good Luck !!!

Monday, November 19, 2012

Building Java Action in Oozie


Java applications is one amongst the list of jobs that can be run as a part of the Oozie workflow. Here, I focus on how to create an oozie workflow that executes a java action. The oozie Java application folder has three components :

  1. lib folder
  2. workflow.xml file
  3. job.properties file

We shall take them one by one :

Steps to create the 'lib' folder :

The lib folder should consist of all the jars/files required to compile your java class and another jar that we would be creating here.
  1. To start with place your .java file in a directory structure mapping the package it belongs to,(For eg. place Fetch.java in /com/jayati/sampleapp/Fetch.java, if Fetch.java belongs to package com.jayati.sampleapp;) and compile it
  2. Create a folder with the desired application name (assuming appName). And create a lib folder in it. Copy the directory structure created in the step 1 to appName/lib. Remove the .java file from it. So now we have /appName/lib/com/jayati/sampleapp/Fetch.class
  3. Place all the jars/files that were required to compile your java class in the lib folder parallel to the com folder.
Workflow.xml

The workflow.xml defines a sequence of actions that would be executed in the workflow. In this example, we have just one java action to be executed.
In case of a java action, we need to specify the job tracker, name node java class name. Assuming the action name as 'java-node', the .xml file would look like :


<workflow-app xmlns="uri:oozie:workflow:0.1" name="appName-wf">
<start to="java-node"/>
<action name="java-node">
<java>
<job-tracker>localhost:9001</job-tracker>
<name-node>hdfs://localhost:9000</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<main-class>com.jayati.sampleapp.Fetch</main-class>
<java-opts>-Denv=stg -DPP=DB_PASSPHRASE</java-opts>
</java>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

You'll need to replace the jobTracker and nameNode port numbers in case they differ as per your hadoop configuration and then place this xml in appName/ folder.

job.properties

This file lists down values of all the variables used in workflow.xml such as jobTracker, nameNode etc. but since we have used direct values, our job.properties would consist of a single line of content and would look like :


oozie.wf.application.path=hdfs://localhost:9000/hadoopfs_path/appName            

where hadoopfs_path is the path of the folder in hdfs where this application folder would be placed. Copy the above file to appName/
This finishes the building of a workflow containing one java action and to run this application on oozie you can refer one of my previous blogs, "Try On Oozie".