Jayati Tiwari: September 2015

Monday, September 7, 2015

Deploying the Spark-0.7.x Cluster in Standalone Mode

To deploy the Spark Cluster in the Standalone Mode, run the following script present in the Spark Setup on the cluster's Master node

bin/start-all.sh

If everything is fine, the Spark Master UI should be accessible on port 8083 (or as per the specification provided in "spark-env.sh". In case of no explicit specification, default port is 8080) of the Master node.

Run a Spark Application

Simple Spark applications such as Word Count are available in "spark-0.7.x/examples/src/main/java/spark/examples".

Most of them expect as an argument the "master" on which to run the application, which needs to be "local" when running on a system locally and "spark://MASTER_IP:PORT" (for example spark://192.10.0.3:7077) when running in Standalone mode.

The job submitted should appear on the Spark Master's Web UI as well.

Deploy the Mesos Cluster

If the Spark is to be run over a cluster managed by Mesos. For this, a Mesos cluster needs to be deployed, steps for which are available in the "Setting up a Mesos Cluster" post.

To start a Mesos Cluster, execute the command below on the Mesos Master:
/usr/local/sbin/mesos-start-cluster.sh

If all goes well, Mesos's web UI should be visible on port 8080 of the master machine.

Run a Spark Application over Mesos

To run the Spark applications over the Mesos Cluster, specify "mesos://MASTER_IP:PORT" as the value of the "master" argument. A sample of the URL could be "mesos://192.10.0.3:5050".

The job submitted should appear on the Mesos Master's Web UI as well.

Friday, September 4, 2015

Setting up a Mesos-0.9.0 Cluster

Apart from running in Standalone mode, Spark can also run on clusters managed by Apache Mesos. "Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks."

In addition to Spark, Hadoop, MPI and Hypertable also support running on clusters managed by Apache Mesos.

Listed below are the steps for deploying a Mesos Cluster. These should to be run on the node supposed to be the Master node in the Mesos Cluster.

1. Mesos 0.9.0-incubating can be downloaded from:

http://archive.apache.org/dist/incubator/mesos/mesos-0.9.0-incubating/

2. Extract Mesos setup

tar -xvzf mesos-0.9.0-incubating.tar.gz

3. Change the current working directory to the extracted mesos setup for it's compilation.

cd mesos-0.9.0

The JAVA_HOME to be used needs to specified, while configuring Mesos. This can be done by specifying a command line option "--with-java-home" to the configure command as shown below:

./configure --with-java-home=/usr/lib/jvm/jre-1.6.0-openjdk.x86_64

After configuring, run the following two commands:

make
sudo make install

The following files and directories should have been created, if the above steps have been executed successfully:

/usr/local/lib/libmesos.so
/usr/local/sbin/mesos-daemon.sh
/usr/local/sbin/mesos-slave
/usr/local/sbin/mesos-start-masters.sh
/usr/local/sbin/mesos-stop-cluster.sh
/usr/local/sbin/mesos-stop-slaves.sh
/usr/local/sbin/mesos-master
/usr/local/sbin/mesos-start-cluster.sh
/usr/local/sbin/mesos-start-slaves.sh
/usr/local/sbin/mesos-stop-masters.sh
/usr/local/var/mesos/conf/mesos.conf.template
/usr/local/var/mesos/deploy

4. Add the MESOS_NATIVE_LIBRARY variable declaration to "conf/spark-env.sh" in Spark's "conf" directory as shown below:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

5. Copy the Mesos setup on all the nodes to be included in the Mesos cluster on the same location or simply setup Mesos on each of them by running:

cd mesos-0.9.0
sudo make install

This completes the process of setting up a Mesos Cluster.

Configure Mesos for deployment

1. On the Mesos Cluster's master node, edit the files "/usr/local/var/mesos/deploy/masters" to list down the IP of the Master node and "/usr/local/var/mesos/deploy/slaves" to list down the IPs of the slaves.

2. On all nodes of the Mesos Cluster, edit "/usr/local/var/mesos/conf/mesos.conf" and add the line master=HOST:5050, where HOST is the IP of the Mesos Cluster's master node.

This is the end of the Configuration Phase.

Good luck.

Setting up Spark-0.7.x in Standalone Mode

A Spark Cluster in Standalone Mode comprises of one Master and multiple Spark Worker processes. Standalone mode can be used both on a single local machine or on a cluster. This mode does not require any external resource manager such as Mesos.

To deploy a Spark Cluster in Standalone mode, the following steps need to be executed on any one of the nodes.

1. Download the spark-0.7.x setup from:
http://spark.apache.org/downloads.html

2. Extract the Spark setup
tar -xzvf spark-0.7.x-sources.tgz

3. Spark requires Scala's bin directory to be present in the PATH variable of the linux machine. Scala 2.9.3 for Linux can be downloaded from:
http://www.scala-lang.org/downloads

4. Extract the Scala setup
tar -xzvf scala-2.9.3.tgz

5. Export the Scala home by appending the following line into "~/.bashrc" (for CentOS) or "/etc/environment" (for Ubuntu)
export SCALA_HOME=/location_of_extracted_scala_setup/scala-2.9.3

6. Spark can be compiled "sbt" or can be built using Maven. This module states the former method, because of it's simplicity of execution. To compile change directory to the extracted Spark setup and execute the following command:
sbt/sbt package

7. Create a file (if not already present) called "spark-env.sh" in Sparkâ€™s "conf" directory, by copying "conf/spark-env.sh.template", and add the SCALA_HOME variable declaration to it as described below:
export SCALA_HOME=<path to Scala directory>

The Web UI port for the Spark Master and Worker can also be optionally specified by appending the following to "spark-env.sh"
export SPARK_MASTER_WEBUI_PORT=8083
export SPARK_WORKER_WEBUI_PORT=8084

8. To specify the nodes which would behave as the Workers, the IP of the nodes are to mentioned in "conf/slaves". For a cluster containing two worker nodes with IP 192.10.0.1 and 192.10.0.2, the "conf/slaves" would contain:
192.10.0.1
192.10.0.2

This completes the setup process on one node.

For setting up Spark on the other nodes of the cluster, the Spark and Scala Setup should be copied on same locations on the rest of the nodes of the cluster.

Lastly, edit the /etc/hosts file on all the nodes to add the "IP HostName" entries of all the other nodes in the cluster.

Hope that helps !!

Spark Overview

Spark is a cluster computing framework i.e. a framework which uses multiple workstations, multiple storage devices, and redundant interconnections, to form an abstract single highly available system.

Spark has been imparted the following features:

- open-source under BSD licence
- in-memory processing
- multi-language APIs in Scala, Java and Python
- rich array of parallel operators
- runnable on Apache Mesos, YARN, Amazon EC2 or in standalone mode
- best suitable for highly iterative jobs
- efficient for interactive data mining jobs

Need of Spark

Spark is a result of the fleeting ongoing developments in the Big-Data world. "Big-Data", a term that can be used to describe data which has got 3 V's to it, Volume, Variety and Velocity. To store and process Big-Data, specialized frameworks were in demand. Hadoop is a predominantly established software framework, composed of a file-system called Hadoop Distributed File System(HDFS) and Map-Reduce. Spark is a strong contender of Map-Reduce. Due to its in-memory processing capability, Spark offers lightening fast results as compared to Map-Reduce.

When to use Spark?

Apart from using Spark for common data processing applications, it should be used specifically for applications where in-memory operations are a major percentage to the processing. The following are the type of applications in which Spark specializes:

1. Iterative algorithms
2. Interactive data mining

Who is using Spark?

UC Berkeley AMPLab is the developer of Spark. Apart from Berkeley, which runs large-scale applications such as spam filtering and traffic prediction, 14 other companies including Conviva, Quantifind have contributed to Spark.