Friday, June 26, 2015

Start-up script for an installed Apache Storm Cluster

If you have installed a Storm cluster using my shell scripts in the previous blogs or even otherwise, this script will save you from manually visiting each node and starting the appropriate service(nimbus/supervisor/ui) there. All you have to do is grab a remote machine and run the script. The script will ask for the required information and your cluster would be up. Also, the script should work equally fine for both Ubuntu and CentOS.


#!/bin/bash

# Directory Name
echo "Enter the relative path to the storm setup on the machines (For example, /opt/stormSetup/storm-0.9.0-wip4):"
read -e stormDir

# Read usernames and ips of all the storm cluster nodes
echo "Enter the IP of nimbus machine :"
read -e stormNimbus;
clusterMachineIPs[0]=$stormNimbus

echo "Enter the username of nimbus :"
read -e usernameNimbus
clusterMachineUsernames[0]=$usernameNimbus

# Read supervisor
echo "Enter the number of supervisor machines";
read -e n;
for ((  i = 1 ;  i <= n;  i++  ))
do
echo "Enter the IP of storm supervisor machine $i:"
read -e stormSupervisor;
clusterMachineIPs[i]=$stormSupervisor
echo "Enter the username of machine ${clusterMachineIPs[i]}"
read -e username
clusterMachineUsernames[i]=$username
done

sshpass -p root ssh -o StrictHostKeyChecking=no  $usernameNimbus@$stormNimbus $stormDir/bin/storm nimbus&

# Start the supervisor nodes
for ((  i = 1 ;  i <= n;  i++  ))
do
sshpass -p root ssh -o StrictHostKeyChecking=no  ${clusterMachineUsernames[i]}@${clusterMachineIPs[i]} $stormDir/bin/storm supervisor&
done

# Start the UI on the nimbus machine
sshpass -p root ssh -o StrictHostKeyChecking=no  $usernameNimbus@$stormNimbus $stormDir/bin/storm ui&


Visit your UI on the browser after a few minutes. Hope it shows up fine. Cheers!

Installation Script for Apache Storm on CentOS

CentOS and Ubuntu and two famous Linux distribution used pretty widely. My last post shares an installation script for Storm cluster over Ubuntu machines and this one is for CentOS. The few usage rules are just the same as for Ubuntu. I'll recite here. Although the script should work for older versions of Apache Storm, it has been tested for storm-0.9.0-wip7. The script has embedded descriptive messages for each input it expects from you. The installation would be done in the '/opt' folder of the machines in a sub-directory of your choice. Make sure the user installing the cluster has admin rights on the /opt folder. The script also takes care of installing all the required dependencies. To use the script for versions other than the supported one, you need to make changes to the script and replace the "storm-0.9.0-wip7" occurrence with your storm version.


#!/bin/bash
# Local FS Setups location
echo "Enter the location of the setups folder. For example '/home/abc/storminstallation/setups'"
read -e setupsLocation
# Directory Name
echo "Enter the directory name"
read -e realTimePlatformDir
rtpLocalDir="\/opt\/$realTimePlatformDir\/storm\/storm_temp"
rtpLocalDirMake=/opt/$realTimePlatformDir/storm/storm_temp
echo $rtpLocalDir;
echo "Enter the IP of nimbus machine :"
read -e stormNimbus;
array[0]=$stormNimbus


# Read supervisor
echo "Enter the number of supervisor machines";
read -e n;
for ((  i = 1 ;  i <= n;  i++  ))
do
echo "Enter the IP of storm supervisor machine $i:"
read -e stormSupervisor;
array[i]=$stormSupervisor
done

# Read zookeeper
echo "Enter the number of machines in the zookeeper cluster";
read -e m;
for ((  i = 1 ;  i <= m;  i++  ))
do
echo "Enter the IP of zookeeper machine $i:"
read -e zkServer;
zkEntry="- \""$zkServer"\""
zKArray=$zKArray","$zkEntry
done

# Copy the required setups to all the storm machines
for ((  i = 1 ;  i <= n+1;  i++  ))
do
echo "Enter the username of machine ${array[i-1]}"
read -e username
echo "Username:"$username
if [ $username == 'root' ]; then
echo 'root';
yamlFilePath="/root/.storm";
else
echo $username;
yamlFilePath="/home/$username/.storm";
fi
echo "the storm.yaml file would be formed at : $yamlFilePath";
echo "Enter the value for JAVA_HOME to be set on the machine ${array[i-1]}"
read -e javaHome;
echo 'JAVA_HOME would be set to :'$javaHome;
ssh -t $username@${array[i-1]} "if [ ! -d /opt/$realTimePlatformDir ]; then
    sudo mkdir /opt/$realTimePlatformDir;
    sudo chown -R $username: /opt/$realTimePlatformDir;
    mkdir /opt/$realTimePlatformDir/storm;
    mkdir $rtpLocalDirMake;
    mkdir $yamlFilePath;
  fi"
     
scp -r -q $setupsLocation/storm-0.9.0-wip7 $username@${array[i-1]}:/opt/$realTimePlatformDir/storm/storm-0.9.0-wip7
     
ssh -t $username@${array[i-1]} "sed -i 's/ZOOKEEPER_IPS/$zKArray/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip7/conf/storm.yaml;
sed -i 's/,/\n/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip7/conf/storm.yaml;
sed -i 's/NIMBUS_IP/$stormNimbus/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip7/conf/storm.yaml;
sed -i 's/LOCAL_DIR/$rtpLocalDir/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip7/conf/storm.yaml;
cp /opt/$realTimePlatformDir/storm/storm-0.9.0-wip7/conf/storm.yaml $yamlFilePath;
sudo yum install git;
sudo yum install libuuid-devel;"

ssh -t $username@${array[i-1]} "cd /opt/$realTimePlatformDir/storm;
wget http://download.zeromq.org/zeromq-2.1.7.tar.gz;
tar -xzf zeromq-2.1.7.tar.gz
cd zeromq-2.1.7                    
./configure
make
sudo make install

cd ..
export JAVA_HOME=$javaHome;
echo $JAVA_HOME;
git clone https://github.com/nathanmarz/jzmq.git
cd jzmq
./autogen.sh
./configure
make
sudo make install"

done


Hope this helps. My next post shares small start-up script for the installed Storm cluster.

Installation Script for Apache Storm on Ubuntu

One of my blogs here, describes steps for manual installation of a Storm cluster. To intensify the convenience factor for you, here's an installation script that you can use for setting up a Storm cluster on Linux machines. Although the script should work for older versions of Apache Storm, it has been tested for storm-0.9.0-wip4. The script has embedded descriptive messages for each input it expects from you. The installation would be done in the '/opt' folder of the machines in a sub-directory of your choice. Make sure the user installing the cluster has admin rights on the /opt folder. The script also takes care of installing all the required dependencies. To use the script for versions other than the supported one, you need to make changes to the script and replace the "storm-0.9.0-wip4" occurrence with your storm version.


#!/bin/bash
# Local FS Setups location
echo "Enter the location of the setups folder. For example '/home/abc/storminstallation/setups'"
read -e setupsLocation
# Directory Name
echo "Enter the directory name"
read -e realTimePlatformDir
rtpLocalDir="\/opt\/$realTimePlatformDir\/storm\/storm_temp"
rtpLocalDirMake=/opt/$realTimePlatformDir/storm/storm_temp
echo $rtpLocalDir;
echo "Enter the IP of nimbus machine :"
read -e stormNimbus;
array[0]=$stormNimbus


# Read supervisor
echo "Enter the number of supervisor machines";
read -e n;
for ((  i = 1 ;  i <= n;  i++  ))
do
echo "Enter the IP of storm supervisor machine $i:"
read -e stormSupervisor;
array[i]=$stormSupervisor
done

# Read zookeeper
echo "Enter the number of machines in the zookeeper cluster";
read -e m;
for ((  i = 1 ;  i <= m;  i++  ))
do
echo "Enter the IP of zookeeper machine $i:"
read -e zkServer;
zkEntry="- \""$zkServer"\""
zKArray=$zKArray","$zkEntry
done

# Copy the required setups to all the storm machines
for ((  i = 1 ;  i <= n+1;  i++  ))
do
echo "Enter the username of machine ${array[i-1]}"
read -e username
echo "Username:"$username
if [ $username == 'root' ]; then
echo 'root';
yamlFilePath="/root/.storm";
else
echo $username;
yamlFilePath="/home/$username/.storm";
fi
echo "the storm.yaml file would be formed at : $yamlFilePath";
echo "Enter the value for JAVA_HOME to be set on the machine ${array[i-1]}"
read -e javaHome;
echo 'JAVA_HOME would be set to :'$javaHome;
ssh -t $username@${array[i-1]} "if [ ! -d /opt/$realTimePlatformDir ]; then
           sudo mkdir /opt/$realTimePlatformDir;
           sudo chown -R $username: /opt/$realTimePlatformDir;
           mkdir /opt/$realTimePlatformDir/storm;
           mkdir $rtpLocalDirMake;
           mkdir $yamlFilePath;
        fi"
     
scp -r -q $setupsLocation/storm-0.9.0-wip4 $username@${array[i-1]}:/opt/$realTimePlatformDir/storm/storm-0.9.0-wip4
     
ssh -t $username@${array[i-1]} "sed -i 's/ZOOKEEPER_IPS/$zKArray/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip4/conf/storm.yaml;
sed -i 's/,/\n/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip4/conf/storm.yaml;
sed -i 's/NIMBUS_IP/$stormNimbus/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip4/conf/storm.yaml;
sed -i 's/LOCAL_DIR/$rtpLocalDir/g' /opt/$realTimePlatformDir/storm/storm-0.9.0-wip4/conf/storm.yaml;
cp /opt/$realTimePlatformDir/storm/storm-0.9.0-wip4/conf/storm.yaml $yamlFilePath;
sudo apt-get install git;
sudo apt-get install uuid-dev;"

ssh -t $username@${array[i-1]} "cd /opt/$realTimePlatformDir/storm;
                        wget http://download.zeromq.org/zeromq-2.1.7.tar.gz;
                        tar -xzf zeromq-2.1.7.tar.gz
                        cd zeromq-2.1.7                    
                        ./configure
                        make
                        sudo make install

                        cd ..
                        export JAVA_HOME=$javaHome;
                        echo $JAVA_HOME;
                        git clone https://github.com/nathanmarz/jzmq.git
                        cd jzmq
                        ./autogen.sh
                        ./configure
                        make
                        sudo make install"

done


Yep Done! Hope it helped. My next post shares the installation script for CentOS and the next to it a small start up script for the installed Storm cluster.

Start-up script for an installed Apache Zookeeper Cluster

If you have an installed Zookeeper-3.3.5 cluster, this script will save you from manually visiting each node and starting the zkServer there. All you have to do is grab a remote machine and run the script. The script will ask for the required information and your cluster would be up. Also, the script should work equally fine for both Ubuntu and CentOS.


#!/bin/bash

# Directory Name
echo "Enter the zookeeper setup location :"
read -e zookeeperSetupLocation

# Read the ips and the usernames of all the machines in the zookeeper cluster
echo "Enter the number of machines in the cluster";
read -e n;
for ((  i = 1 ;  i <= n;  i++  ))
do
echo "Enter the IP of cluster machine $i:"
read -e zkServer;
zkServerIPs[i]=$zkServer
echo "Enter the username of machine ${zkServerIPs[i]}"
read -e username
zkServerUsernames[i]=$username
done

# Start up the cluster
for ((  i = 1 ;  i <= n;  i++  ))
do
ssh ${zkServerUsernames[i]}@${zkServerIPs[i]} "$zookeeperSetupLocation/zookeeper-3.3.5/bin/zkServer.sh start;"
done


Thanks !!

Installation Script for Apache Zookeeper-3.3.5 on Linux

One of my previous blogs describes how to setup a Zookeeper cluster manually. Here's a quick fix: an installation script for the same. You need to run the following script (after storing the content in a .sh file) on your machine and you can install a zookeeper cluster on a set of remote machines. All you need as prerequisite on those machines is the zookeeper-3.3.5 setup at a common location. You can also use this script for other versions, with a bit of modification to the script(replacing the version used in the script to yours. It should have not been hard coded, I know.. My bad).



#!/bin/bash

zkServerEntryPart1="server.";
zkServerEntryPart2="=zoo";
zkServerEntryPart3=":2888:3888NEW_LINE";
zkServerEntry="";

# Local FS Setup location
echo "Enter the path of the folder in which the zookeeper setup is stored. For example '/home/abc/setups'"
read -e setupsLocation
# Directory Name
echo "Enter the directory path where zookeeper is to be installed : "
read -e zookeeperSetupLocation

# Read the number of zookeeper servers
echo "Enter the number of machines in the zookeeper cluster : ";
read -e n;

# Read the zookeeper server details
for ((  i = 1 ;  i <= n;  i++  ))
do
# obtain the zookeeper server ips of all machines in the cluster
echo "Enter the IP of zookeeper machine $i:"
read -e zookeeperServer;
zookeeperServerIPList[i]=$zookeeperServer
temp=$zkServerEntryPart1""$i""$zkServerEntryPart2""$i""$zkServerEntryPart3;
zkServerEntry=$zkServerEntry""$temp;

# obtain the usernames for all the zookeeper servers
echo "Enter the username of machine ${zookeeperServerIPList[i]}"
read -e username
userNameList[i]=$username
done

# Copy the setup to all the storm machines
for ((  i = 1 ;  i <= n;  i++  ))
do
echo "Enter the data directory location of zookeeper for the machine ${zookeeperServerIPList[i]}"
read -e dataDir
     
# create the required folders on the machines
ssh -t ${userNameList[i]}@${zookeeperServerIPList[i]} "if [ ! -d $zookeeperSetupLocation ]; then
sudo mkdir $zookeeperSetupLocation;
sudo chown -R ${userNameList[i]}: $zookeeperSetupLocation;
fi
if [ ! -d $dataDir ]; then
sudo mkdir $dataDir;
sudo chown -R ${userNameList[i]}: $dataDir;
fi"

# copy the zookeeper setup at the specified location on the machines
scp -r -q $setupsLocation/zookeeper-3.3.5 ${userNameList[i]}@${zookeeperServerIPList[i]}:$zookeeperSetupLocation/zookeeper-3.3.5

# create and configure the 'zoo.cfg' and 'myid' files
ssh -t ${userNameList[i]}@${zookeeperServerIPList[i]} "touch $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
echo -e "dataDir=$dataDir" >> $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
echo -e "syncLimit=2" >> $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
echo -e "initLimit=5" >> $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
echo -e "clientPort=2181" >> $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
echo  "$zkServerEntry" >> $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg;
sed -i 's/NEW_LINE/\n/g' $zookeeperSetupLocation/zookeeper-3.3.5/conf/zoo.cfg
touch $dataDir/myid;                           
echo -e $i >> $dataDir/myid;"

# update the /etc/hosts file
hostFileEntry=${zookeeperServerIPList[i]}" zoo"$i;
for ((  j = 1 ;  j <= n;  j++  ))
do
ssh -t ${userNameList[j]}@${zookeeperServerIPList[j]} "sudo cp /etc/hosts /etc/hosts.bak;
sudo cp /etc/hosts /etc/hosts1;
sudo chmod 777 /etc/hosts1;
sudo echo -e "$hostFileEntry" >> /etc/hosts1;
sudo mv /etc/hosts1 /etc/hosts;"
done
done

Hope it helped. My next blog post is about a small script to start-up the installed zookeeper cluster.

Monday, June 15, 2015

Install and run Augustus on CentOS

Hello Folks .. If you are visiting this blog you definitely know what Augustus is all about, but still for any exceptions, here’s its short introduction taken directly from its makers:

Augustus is an open source system for building and scoring statistical models designed to work with data sets that are too large to fit into memory

Although the Augustus documentation is an elaborate and wonderful source of guidelines and information, this blog presents a crisp and condensed bunch of steps you can use to install Augustus and try one of the examples. So open a terminal on your machine and try the following steps:

Step 1. Python 2.6 needs to be installed on the machine as a prerequisite. If already installed, check the version of your python by typing the command "python" on the terminal. If it is 2.6, the output should be:


Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.                                         
>>>                                                                                                                                                                                                                                           

Step 2. Run


sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm                                                                       

Step 3. Run 


sudo yum install numpy                                                                                                                                                                                                                                 

Step 4. Enter the python shell and import


> python2.6
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.                                         
>>> import numpy
>>> numpy.__version__
'1.4.1'
>>>                                                                                                                                                                                                                                         

Step 5. Execute


mkdir AUGUSTUS_INSTALLATION                                                                                                                                                                                                                  

Step 6. Execute


wget http://augustus.googlecode.com/files/Augustus-0.4.4.0.tar.gz                                                                                                                                                                                  

Step 7. Execute 


tar -xzvf Augustus-0.4.4.0.tar.gz                                                                                                                                                                                                      

Step 8. Execute


cd Augustus-0.4.4.0 (check that the setup has a "bin" folder containing files like "AugustusPMMLConsumer")                                                                                                           

Step 9. Run


sudo python2.6 setup.py install                                                                                                                                                                                                                                                                  

Last segment of the command output should resemble:


byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/baseline/tools/__init__.py to __init__.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/baseline/tools/userInitializeModels.py to userInitializeModels.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/baseline/tools/xml_fifo_io2.py to xml_fifo_io2.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/baseline/tools/fake_score_handler.py to fake_score_handler.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/baseline/tools/userMySQLInterface.py to userMySQLInterface.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/clustering/__init__.py to __init__.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/clustering/producer/Producer.py to Producer.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/clustering/producer/__init__.py to __init__.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/tree/__init__.py to __init__.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/tree/producer/Producer.py to Producer.pyc
byte-compiling /usr/lib/python2.6/site-packages/augustus/modellib/tree/producer/__init__.py to __init__.pyc
running install_scripts
copying build/scripts-2.6/__setpath.py -> /usr/bin
copying build/scripts-2.6/userInitializeConfigs -> /usr/bin
copying build/scripts-2.6/realpmml -> /usr/bin
copying build/scripts-2.6/unitable -> /usr/bin
copying build/scripts-2.6/AugustusBaselineProducer -> /usr/bin
copying build/scripts-2.6/fake_event_source -> /usr/bin
copying build/scripts-2.6/AugustusClusteringProducer -> /usr/bin
copying build/scripts-2.6/runfifo -> /usr/bin
copying build/scripts-2.6/AugustusTreeProducer -> /usr/bin
copying build/scripts-2.6/munge -> /usr/bin
copying build/scripts-2.6/fake_score_handler -> /usr/bin
copying build/scripts-2.6/userInitializeModels -> /usr/bin
copying build/scripts-2.6/AugustusPMMLConsumer -> /usr/bin
copying build/scripts-2.6/AugustusNaiveBayesProducer -> /usr/bin
copying build/scripts-2.6/userBuildMySQL -> /usr/bin
changing mode of /usr/bin/__setpath.py to 755
changing mode of /usr/bin/userInitializeConfigs to 755
changing mode of /usr/bin/realpmml to 755
changing mode of /usr/bin/unitable to 755
changing mode of /usr/bin/AugustusBaselineProducer to 755
changing mode of /usr/bin/fake_event_source to 755
changing mode of /usr/bin/AugustusClusteringProducer to 755
changing mode of /usr/bin/runfifo to 755
changing mode of /usr/bin/AugustusTreeProducer to 755
changing mode of /usr/bin/munge to 755
changing mode of /usr/bin/fake_score_handler to 755
changing mode of /usr/bin/userInitializeModels to 755
changing mode of /usr/bin/AugustusPMMLConsumer to 755
changing mode of /usr/bin/AugustusNaiveBayesProducer to 755
changing mode of /usr/bin/userBuildMySQL to 755
running install_egg_info
Writing /usr/lib/python2.6/site-packages/Augustus-0.4.4.0-py2.6.egg-info

Step 10. Running


python2.6                                                                                                                                                                                                                                                                              

should return


Python 2.6.5 (r265:79063, Apr 9 2010, 11:16:46)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.                                         
>>> import augustus.const as AUGUSTUS_CONST
>>> AUGUSTUS_CONST._AUGUSTUS_VER
’0.4.2.0’
>>>                                                                                                                                                                                                                                                                                                                      

Installation is complete.

Step 11. To run an example:


> cd AUGUSTUS_INSTALLATION/Augustus-0.4.4.0/examples/basic                                     
> python2.6 top-ten.py ../auto/data/training.nab                                                                                                                                                                            

Sample O/P:


Field: Date
      ( 5.38500%) '2000-12-14'
      ( 5.27600%) '2000-12-29'
      ( 5.10800%) '2000-12-22'
      ( 5.04100%) '2000-12-18'
      ( 4.99900%) '2000-12-04'
      ( 4.90000%) '2000-12-07'
      ( 4.83400%) '2000-12-03'
      ( 4.76000%) '2000-12-05'
      ( 4.71900%) '2000-12-11'
      ( 4.56300%) '2000-12-26'
Field: Color
      (28.86200%) 'Black'
      (24.45700%) 'Blue'
      (23.75900%) 'Green'
      (22.92200%) 'Red'
Field: Automaker
      (21.98000%) 'Mazda'
      (21.35700%) 'BMW'
      (19.88900%) 'Toyota'
      (18.56900%) 'Volvo'
      (18.20500%) 'Audi'                                                                                                                                                                                                                                                                                                                 

That’s it.  Hope it helped.

Installing SparkMLlib on Linux and Running SparkMLlib implementations

SparkMLlib is a machine learning library which ships with Apache Spark and can run on any Hadoop2/YARN cluster without any pre-installation. It is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

The key features of SparkMLlib include:

1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components

There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.

Running Logistic Regression on SparkMllib


Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").

Spark provides 'spark-submit.sh’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.

We shall be now running Logistic Regression as below:

Step-1: Export the required environment variables



export JAVA_HOME='your_java_home'                                                                                            
export SPARK_HOME='your_spark_home'

Step-2: Gather the dataset to run the algorithm on



mkdir ~/SparkMLlib
cd ~/SparkMLlib/
wget https://sites.google.com/site/jayatiatblogs/attachments/sample_binary_classification_data.txt                       

Now that you have the data set, copy it to HDFS.


hdfs dfs -mkdir -p /user/${USER}/classification_data
hdfs dfs -put -f $HOME/SparkMLlib/sample_binary_classification_data.txt /user/${USER}/classification_data/                                                                             

Step-3: Submit the job to run Logistic Regression using the 'spark-submit.sh’ script



$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master local[2]

$SPARK_HOME/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar --algorithm LR --regType L2 --regParam 1.0 /user/${USER}/classification_data/sample_binary_classification_data.txt          

If all works fine, you must see the following after a long log message:


Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0.                                                                                                                 

Let’s do some cleaning of your HDFS.


hdfs dfs -rm -r -skipTrash /user/${USER}/classification_data                                                          

You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.

Good luck.