Jayati Tiwari: May 2013

Sunday, May 26, 2013

Installing RabbitMQ over Ubuntu/CentOS

This post’s gonna walk you through steps on how to install RabbitMQ on a:

Ubuntu machine
Centos machine

Installing RabbitMQ on Ubuntu

Step-I: Get the setup from

http://www.rabbitmq.com/install-debian.html

Version: rabbitmq-server_3.0.2-1_all.deb

Step-II: Run the .deb file using

sudo dpkg -i rabbitmq-server_3.0.2-1_all.deb

Step-III: Start or Stop the rabbitmq server/broker using

/etc/init.d/rabbitmq_server start

/etc/init.d/rabbitmq_server stop

Step-IV: Check the status of the server using

rabbitmqctl status

Installing RabbitMQ on CentOS

Step-I: Get the setup from

http://www.rabbitmq.com/install-rpm.html

Version: rabbitmq-server-3.0.2-1.noarch.rpm

If the CentOS version on your machine is EL5:(For CentOS versions of the 5 series, Get to know that using the command "lsb_release -a") run the following commands:

su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm'

su -c 'yum install foo'

Else if its EL6: (For CentOS versions of the 6 series, Get to know that using the command "lsb_release -a") run the following commands:

su -c 'rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm'

su -c 'yum install foo'

Step-II: Get the Erlang repository using:

sudo wget -O /etc/yum.repos.d/epel-erlang.repo http://repos.fedorapeople.org/repos/peter/erlang/epel-erlang.repo

Step-III: Install Erlang using:

sudo yum install erlang

Step-IV: You need to import a signing key for RabbitMQ, using the command:

sudo rpm --import http://www.rabbitmq.com/rabbitmq-signing-key-public.asc

Step-V: Install the downloaded setup in Step-I using:

sudo yum install rabbitmq-server-3.0.2-1.noarch.rpm

Step-VI: Start/Stop rabbitmq server using

sudo /sbin/service rabbitmq-server start

sudo /sbin/service rabbitmq-server stop

Some Extra Notes

If you ever feel the need to clear all messages from a rabbitmq queue, run the following commands:

rabbitmqctl stop_app

rabbitmqctl force_reset

/etc/init.d/rabbitmq-server stop

/etc/init.d/rabbitmq-server start

If you need to configure some rabbitmq server parameters off the league, for example "disk_free_limit", create a file called “rabbitmq.config” and place it in “/etc/rabbitmq” for the server to read it at the time of startup. Here’s a sample config file for your ready reference:

[

{rabbit, [{disk_free_limit, 1000}]}

All the very best !!!

Running Weka's Logistic Regression using Command Line

Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest release of WEKA. One of useful links working at the time of writing this post is:

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.9/weka-3-6-9.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-9”. We would call it WEKA_HOME for reference in this blog post.

You might want to run Weka’s logistic regression algorithm on two types of input data.

One is the sample data files in ARFF format already available in “WEKA_HOME/data”
Other is over some data files you already have in CSV format with you. For example, donut.csv file provided by Mahout for running it’s Logistic Regression over it.

Running LR over ARFF files

We would be using the file “WEKA_HOME/data/weather.nominal.arff” for running the algorithm. Cd to WEKA_HOME and run the following command

java -cp ./weka.jar weka.classifiers.functions.Logistic -t WEKA_HOME/weather.nominal.arff -T WEKA_HOME/weather.nominal.arff -d /some_location_on_your_machine/weather.nominal.model.arff

which should generate the trained model at “/some_location_on_your_machine/weather.nominal.model.arff” and the console output should look something like:

Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                                    Class
Variable                              yes
=========================================
outlook=sunny                    -45.2378
outlook=overcast                  57.5375
outlook=rainy                     -5.9067
temperature=hot                   -8.3327
temperature=mild                  44.8546
temperature=cool                 -45.4929
humidity                         118.1425
windy                             72.9648
Intercept                        -89.2032

Odds Ratios...
                                    Class
Variable                              yes
=========================================
outlook=sunny                           0
outlook=overcast      9.73275593611619E24
outlook=rainy                      0.0027
temperature=hot                    0.0002
temperature=mild     3.020787521374072E19
temperature=cool                        0
humidity            2.0353933107400553E51
windy                4.877521304260806E31

Time taken to build model: 0.12 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

=== Error on test data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

Here the three arguments mean:

-t <name of training file> : Sets training file.
-T <name of test file> : Sets test file. If missing, a cross-validation will be performed on the training data.
-d <name of output file> : Sets model output file. In case the filename ends with '.xml', only the options are saved to the XML file, not the model.

For help on all available arguments, try running the following command from WEKA_HOME:

java -cp ./weka.jar weka.classifiers.functions.Logistic -h

Running LR over CSV files

For running Weka’s LR over a CSV file, you’ll need to convert it into ARFF format using a converter provided by WEKA. Using command line in linux, here are the steps:

Step-I: Convert the data into arff format, for converting from CSV to ARFF, run the following command from WEKA_HOME:

java -cp ./weka.jar weka.core.converters.CSVLoader someCSVFile.csv > outputARFFFile.arff

Step-II: Run the NumericToNominal filter over the arff file

java -cp ./weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i outputARFFFile.arff -o outputARFFFile.nominal.arff

Step-III: Run the classifier over the outputARFFFile.nominal.arff

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.nominal.arff -T outputARFFFile.nominal.arff -d outputARFFFile.nominal.model.arff

You might encounter an exception stating

"Cannot handle unary class!"

To resolve this, apply the attribute filter and eliminate the attribute which has same value for all the records in the file using:

java -cp ./weka.jar weka.filters.AttributeFilter -i outputARFFFile.nominal.arff -o outputARFFFile.filtered.nominal.arff -R 8

where the value of “–R” would vary depending upon your input file and the id of attribute to be eliminated in the input arff file.

After this, try running the classifier on the obtained “outputARFFFile.filtered.nominal.arff” file as in:

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.filtered.nominal.arff -T outputARFFFile.filtered.nominal.arff -d outputARFFFile.nominal.model.arff

The output should appear somewhat like we got when running the classifier over the provided sample data mentioned above.

With these steps, you are ready to play with WEKA. Go for it. Cheers !!!

Saturday, May 25, 2013

Running Mahout's Logistic Regression

Logistic Regression(SGD) is one the algorithms available in Mahout. This blog post lists and provides all that is required for the same. To install Mahout on your machine, you can refer to my previous post.

Logistic Regression executes in two major phases:

Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.

Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.

Steps for running Mahout’s LR

Step-I: Get the input data file called donut.csv, which is present in the mahout setup. But for your ready reference I have also shared it. You can download it from here.

Step-II: Next cd to the MAHOUT_HOME. Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic” class that would train the model for us using the “donut.csv” file what we would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:

bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5 --input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

If the Mahout version is 0.7 you are likely to face the error below:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver

    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 1 more

Don’t worry, all you need to do is:

export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar

After editing the CLASSPATH as mentioned above the command should run successfully and print something like:

color ~ -0.016*Intercept Term + -0.016*xy + -0.016*yy
      Intercept Term -0.01559
                  xy -0.01559
                  yy -0.01559
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.015590929     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000
13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098)

The most important parameters influencing the execution of the Training process are:

"--passes": the number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate

You can vary the values of these 3 variables and see the change in performance of the algorithm. Also, you can now see that the model would have been created at the location which you had specified in the command.

Step-III: Now it’s time to run the classifier using the model that has been trained in Step-II. As the test data we would be using the same donut.csv file that we used for training or you can even split the file in some ratio for eg. 70-30 and use the 70% file for training of the model and 30% file for testing. Here’s the command for testing the model and running the classifier:

bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv --model loc_of_model/donut.model --auc --scores --confusion

which should print an output something like:

"target","model-output","log-likelihood"
0,0.496,-0.685284
0,0.490,-0.674055
0,0.491,-0.675162
1,0.495,-0.703361
1,0.493,-0.706289
0,0.495,-0.683275
0,0.496,-0.685282
0,0.492,-0.677191
1,0.494,-0.704222
0,0.495,-0.684107
0,0.496,-0.684765
1,0.494,-0.705209
0,0.491,-0.675272
1,0.495,-0.703438
0,0.496,-0.685121
0,0.496,-0.684886
0,0.490,-0.672500
0,0.495,-0.682445
0,0.496,-0.684872
1,0.495,-0.703070
0,0.490,-0.672511
0,0.495,-0.683643
0,0.492,-0.677610
1,0.492,-0.708915
0,0.496,-0.684744
1,0.494,-0.704766
0,0.492,-0.677496
1,0.492,-0.708679
0,0.496,-0.685222
1,0.495,-0.703604
0,0.492,-0.677846
0,0.490,-0.672702
0,0.492,-0.676980
0,0.494,-0.681450
1,0.495,-0.702845
0,0.493,-0.679049
0,0.496,-0.684262
1,0.493,-0.706564
1,0.495,-0.704016
0,0.490,-0.672624
AUC = 0.52
confusion: [[27.0, 13.0], [0.0, 0.0]]
entropy: [[-0.7, -0.4], [-0.7, -0.5]]
13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079)

Similarly, you can try on a variety of data sets that you might have. I have seen upto 93% accuracy of results of classification on a different data set.

All the best !!!

Installing Mahout on Linux

Mahout is an acquisition of highly scalable machine learning algorithms over very large data sets. Although the real power of Mahout can be vouched for only on large HDFS data, but Mahout also supports running algorithm on local filesystem data, that can help you get a feel of how to run Mahout algorithms.

Installing Mahout on Linux

Before you can run any Mahout algorithm you need a Mahout installation ready on your Linux machine which can be carried out in two ways as described below:

Method I- Extracting the tarball

Yes, it is that simple. Just download the latest Mahout release of from

http://www.apache.org/dyn/closer.cgi/mahout/

Extract the downloaded tarball using:

tar –xzvf /path_to_downloaded_tarball/mahout-distribution-0.x.tar.gz

This should result in a folder with name /path_to_downloaded_tarball/mahout-distribution-0.x

Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted folder. For testing your installation, you can also run

bin/mahout

without any other arguments.

Method II- Building Mahout

1. Prerequisites for Building Mahout

- Java JDK 1.6

- Maven 2.2 or higher (http://maven.apache.org/)

Install maven and svn using following commands:

sudo apt-get install maven2

sudo apt-get install subversion

2. Create a directory where you would want to check out the Mahout code, we’ll call it here MAHOUT_HOME:

mkdir MAHOUT_HOME

cd MAHOUT_HOME

3. Use Subversion to check out the code:

svn co http://svn.apache.org/repos/asf/mahout/trunk

4. Compiling

cd MAHOUT_HOME

mvn -DskipTests install

5. Setting the environment variables

export HADOOP_CONF_DIR=$HADOOP_HOME/conf

export MAHOUT_HOME=/location_of_checked_out_mahout

export PATH=$PATH:$MAHOUT_HOME

After following either of the above methods, you can now run any of the available mahout algorithms with appropriate arguments. Also, note that you can run the algorithm over HDFS data or local file system data. In order to run algorithms over data on your local file system set an environment variable with the name “MAHOUT_LOCAL” to anything other than an empty string. That would force mahout to run locally even if HADOOP_CONF_DIR and HADOOP_HOME are set.

To plunge into Mahout by trying out running an algorithm, you can refer to my next post. Hope this proved to be a good starter for you.

All the best !!!

Wednesday, May 22, 2013

Kafka Monitoring using JMX-JMXTrans-Ganglia

Monitoring Kafka Clusters using Ganglia is a matter of a few steps. This blog post lists down those steps with an assumption that you have your Kafka Cluster ready.

Step-I: Setup JMXTrans on all the machines of the Kafka cluster as done on the Storm cluster in the previous post.

Step-II: In the kafka setup, edit “kafka-run-class.sh” script file by adding the following line to it:

KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false "

Step-III: Also, edit the “kafka-server-start.sh” script file present in the kafka setup to set the JMX port to 9999 by adding the following line:

export JMX_PORT=${JMX_PORT:-9999}

Now, on all the nodes of the cluster on which you have performed the above steps, you can run the following json file after which it should start reporting its metrics to the Ganglia server.

Sample JSON file

Run the code below in the form of a json file using the following command:

/usr/share/jmxtrans/jmxtrans.sh start /path_to_sample_json/example.json

Note: Please change the paths of output files in the code below to paths accessible on your cluster machines.

{

"servers" : [ {

"port" : "9999", <--- Defined Kafka JMX Port
"host" : "127.0.0.1", <--- Kafka Server

"queries" : [ {

"outputWriters" : [ {

"@class" :

"com.googlecode.jmxtrans.model.output.KeyOutWriter",

"settings" : {

"outputFile" : "/home/jayati/JMXTrans/kafkaStats/bufferPool_direct_stats.txt",

"v31" : false

}

} ],

"obj" : "java.nio:type=BufferPool,name=direct",

"resultAlias": "bufferPool.direct",

"attr" : [ "Count", "MemoryUsed", "Name", "ObjectName", "TotalCapacity" ]

}, {

"outputWriters" : [ {

"@class" :

"com.googlecode.jmxtrans.model.output.KeyOutWriter",

"settings" : {

"outputFile" : "/home/jayati/JMXTrans/kafkaStats/bufferPool_mapped_stats.txt",

"v31" : false

}

} ],

"obj" : "java.nio:type=BufferPool,name=mapped",

"resultAlias": "bufferPool.mapped",

"attr" : [ "Count", "MemoryUsed", "Name", "ObjectName", "TotalCapacity" ]

}, {

"outputWriters" : [ {

"@class" :

"com.googlecode.jmxtrans.model.output.KeyOutWriter",

"settings" : {

"outputFile" : "/home/jayati/JMXTrans/kafkaStats/kafka_log4j_stats.txt",

"v31" : false

}

} ],

"obj" : "kafka:type=kafka.Log4jController",

"resultAlias": "kafka.log4jController",

"attr" : [ "Loggers" ]

}, {

"outputWriters" : [ {

"@class" :

"com.googlecode.jmxtrans.model.output.KeyOutWriter",

"settings" : {

"outputFile" : "/home/jayati/JMXTrans/kafkaStats/kafka_socketServer_stats.txt",

"v31" : false

}

} ],

"obj" : "kafka:type=kafka.SocketServerStats",

"resultAlias": "kafka.socketServerStats",

"attr" : [ "AvgFetchRequestMs", "AvgProduceRequestMs", "BytesReadPerSecond", "BytesWrittenPerSecond", "FetchRequestsPerSecond", "MaxFetchRequestMs", "MaxProduceRequestMs" , "NumFetchRequests" , "NumProduceRequests" , "ProduceRequestsPerSecond", "TotalBytesRead", "TotalBytesWritten", "TotalFetchRequestMs", "TotalProduceRequestMs" ]

} ],

"numQueryThreads" : 2

} ]

}

}
Get high on the Ganglia graphs showing your Kafka Cluster metrics. :)
All the best !!!

Sunday, May 19, 2013

Storm Monitoring using JMX-JMXTrans-Ganglia

Though Storm supports a full-fledged UI, some applications where Ganglia is being used as a kind of universal tool for displaying the metrics of the nodes in clusters of various technologies present in the application, it's essential to have Storm cluster nodes also enabled to report their metrics to Ganglia.

Since as of yet, there is no in-built support in Storm like we have in Hadoop, HBase etc that might enable its monitoring using Ganglia, we need to do that using JMXTrans. This post is about how to setup a JMX cluster and configure Storm Cluster nodes so that the target can be achieved.

Setting up JMXTrans:

Follow the steps below to setup a JMXTrans cluster that would act as a bridge between the Storm Cluster nodes and Ganglia and form the reporting channel for Ganglia.

Obtain the setup jmxtrans_20121016-175251-ab6cfd36e3-1_all.deb and extract it on all the machines of the Storm cluster

Copy /path_to_extracted_jmx_setup/jmxtrans_20121016-175251-ab6cfd36e3-01_all/data/usr/share/jmxtrans to /usr/share/jmxtrans

Now any .json can be run using the following command

/usr/share/jmxtrans/jmxtrans.sh start /path_to_json/example.json

And jmxtrans can be stopped using

/usr/share/jmxtrans/jmxtrans.sh stop

Configure the storm daemons to report to JMXTrans, add the following to ~/.storm/storm.yaml

“storm.yaml”- Supervisor Nodes

Add the following to the "conf/storm.yaml" on all the supervisor nodes of the cluster

worker.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=1%ID%"

Now, json file can send the metrics from ports 16700, 16701, 16702, 16703. Also add the following to report the metrics of the jvm running the supervisor

supervisor.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=any_open_port_number"

“storm.yaml”- Nimbus

Specify just the following, however the presence of the above entries will not affect its performance.

nimbus.childopts: " -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=any_open_port_number"

Rest needs to be done in the json files. Storm cluster machines do not need ganglia monitoring daemons(gmond) to be running on all nodes, they can also report to a remote gmond. Finally, run the json files as mentioned above on each of the storm cluster nodes you want to monitor.

Sample JSON- Storm Workers

{
"servers" : [ {
    "port" : "16700", <--- Defined Storm JMX Port
    "host" : "127.0.0.1", <--- Storm Worker
    "queries" : [ {
      "outputWriters" : [ {
        "@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
        "settings" : {
          "groupName" : "workerMemory",
          "host" : "ip_of_gmond_server",
          "port" : 8649,
          "v3.1" : false
        }
      } ],
      "obj" : "java.lang:type=ClassLoading",
      "attr" : [ "LoadedClassCount", "UnloadedClassCount" ]
    } ],
    "numQueryThreads" : 2
} ]
}

Sample JSON- Storm Supervisors

{
"servers" : [ {
    "port" : "assigned_port_no for eg. 10000", <---Defined Storm JMX Port
"host" : "127.0.0.1", <--- Storm Supervisor
"queries" : [ {
      "outputWriters" : [ {
        "@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
        "settings" : {
          "groupName" : "SuperVisorMemory",
          "host" : "ip_of_gmond_server",
          "port" : 8649,
          "v3.1" : false
        }
      } ],
      "obj" : "java.lang:type=Memory",
      "resultAlias": "supervisor",
      "attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ]
    }],
    "numQueryThreads" : 2
} ]
}

Sample JSON- Storm Nimbus

{
"servers" : [ {
    "port" : "assigned_port_no", <---Defined Storm JMX Port
    "host" : "127.0.0.1", <--- Storm Nimbus
    "queries" : [ {
      "outputWriters" : [ {
        "@class" : "com.googlecode.jmxtrans.model.output.GangliaWriter",
        "settings" : {
          "groupName" : "NimbusMemory",
          "host" : "ip_of_gmond_server",
          "port" : 8649,
          "v3.1" : false
        }
      } ],
      "obj" : "java.lang:type=Memory",
      "resultAlias": "nimbus",
      "attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ]
    }],
    "numQueryThreads" : 2
} ]
}

With the help of the above JSON sample files your Storm Cluster nodes can start reporting their metrics onto the Ganglia Web UI.

All the best !!!