Jayati Tiwari: Machine Learning

Showing posts with label Machine Learning. Show all posts

Monday, June 15, 2015

Installing sparkling-water and Running sparkling-water's Deep Learning

Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on each node in the Spark cluster and access data stored in data structures of Spark and H2O.

Sparkling Water provides transparent integration for the H2O engine and its machine learning algorithms into the Spark platform, enabling:

1. Use of H2O algorithms in Spark workflow
2. Transformation between H2O and Spark data structures
3. Use of Spark RDDs as input for H2O algorithms
4. Transparent execution of Sparkling Water applications on top of Spark

To install Sparkling Water, Spark installation is a prerequisite. You can follow this link to install Spark in standalone mode if not already done.

Installing Sparkling Water

Create a working directory for Sparkling Water

mkdir $HOME/SparklingWater
cd $HOME/SparklingWater/

Clone Sparkling Water for linux

git clone https://github.com/0xdata/sparkling-water.git

Running Deep Learning on Sparkling Water

Deep Learning is a new area of Machine Learning research which is closer to Artificial Intelligence. Deep Learning algorithms are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. They are part of the broader machine learning field of learning representations of data. Also they learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

1. Download a prebuilt spark setup. This is needed since the Spark installation directory is read-only and the examples we shall run would need to write to the Spark folder.

wget http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0.tgz

2. Export the Spark home

export SPARK_HOME='$HOME/SparklingWater/spark-1.2.0-bin-hadoop2.3'

3. Run the DeepLearningDemo example from Sparkling Water. It runs DeepLearning on a subset of airlines dataset (see dataset here sparkling-water/examples/smalldata/allyears2k_headers.csv.gz).

bin/run-example.sh DeepLearningDemo

4. In the long logs of the running job, try to see the following snippets:

Sparkling Water started, status of context:
Sparkling Water Context:
* number of executors: 3
* list of used executors:
(executorId, host, port)
------------------------
(0,127.0.0.1,54325)
(1,127.0.0.1,54327)
(2,127.0.0.1,54321)
------------------------
Output of jobs

===> Number of all flights via RDD#count call: 43978
===> Number of all flights via H2O#Frame#count: 43978
===> Number of flights with destination in SFO: 1331
====>Running DeepLearning on the result of SQL query

To stop the job press Ctrl+C. Logs similar to the above provide a lot of information about the job. You can also try running other algorithm implementation likewise.

Good Luck.

Running Naive Bayes Classification algorithm using Weka

Wiki says, "Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable."

Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:

http://prdownloads.sourceforge.net/weka/weka-3-6-12.zip

or

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.12/weka-3-6-12.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.

We shall be proceeding step-by-step here onwards.

Step-1: Download a dataset to run the classification on

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

So, first we shall create a folder to store our dataset and then download it.

mkdir ~/WekaDataSet
cd ~/WekaDataSet
wget http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/bank.zip
unzip bank.zip

Step-2: Convert the data in CSV data format to ARFF

First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.

cd bank
head -1000 bank-full.csv >> bank-subset.csv
java -cp $WEKA_HOME/weka.jar weka.core.converters.CSVLoader bank-subset.csv > bank-subset-preprocessed.arff

You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.

Step-3: Convert the Numeric data to Nominal using Weka's utility

Weka's filter called 'NumericToNominal' is meant for turning numeric attributes into nominal ones. Unlike discretization, it just takes all numeric values and adds them to the list of nominal values of that attribute. Useful after CSV imports, to enforce certain attributes to become nominal, e.g., the class attribute, containing values from 1 to 5.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i bank-subset-preprocessed.arff -o bank-subset-preprocessed.nominal.arff

Step-4: Divide a part of the data as train and test data

Let's keep the entire 1000 records in the train dataset. We shall be using another utility from Weka called RemovePercentage. In the option -P we need to specify the percentage we wish to remove.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 0 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-train.nominal.arff

For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 60 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-test.nominal.arff

Step-5: Train the model

Using the Naive Bayes Classifier of Weka "weka.classifiers.bayes.NaiveBayes", we shall first train the model.
-t option: Specify the location of the train data file
-d option: Specify the name and location of the model file you wish to be generated

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -t bank-subset-preprocessed-train.nominal.arff -d bank-subset-preprocessed-model.arff

Step-6: Test the model

This is the final step. We would test the model for accuracy using the same classifier but with a different option set.
-T option: Specify the location of the test data file
-l option: Specify the location of the created model file

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -T bank-subset-preprocessed-test.nominal.arff -l bank-subset-preprocessed-model.arff

That's it. You can also try the same with different percentages and different datasets.

Hope it helped.

Installing H2O and Running ML Implementations of H2O

H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use.

As per it's description, it intelligently combines unique features not currently found in other machine learning platforms including:

1.Best of Breed Open Source Technology: H2O leverages the most popular OpenSource products like Apache Hadoop and Spark to give customers the flexibility to solve their most challenging data problems.
2.Easy-to-use WebUI and Familiar Interfaces: Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
3.Data Agnostic Support for all Common Database and File Types: Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
4. Massively Scalable Big Data Analysis: Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
5. Real-time Data Scoring: Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Installing H2O on Linux

Installling H2O on you Linux machine (this section is tested with Centos 6.6) is very straight forward. Follow the steps below:

#Create a local directory for installation
mkdir H2O
cd H2O
#Download the latest release of H2O
wget http://h2o-release.s3.amazonaws.com/h2o/rel-noether/4/h2o-2.8.4.4.zip
#Unzip the downloaded file
unzip h2o-2.8.4.4.zip
cd h2o-2.8.4.4
#Start H2O
java -jar h2o.jar

You must see a log like the below:

INFO WATER: ----- H2O started -----
INFO WATER: Build git branch: rel-noether
INFO WATER: Build git hash: 4089ab3911999c73dcb611ab2f51cfc9bb86898b
INFO WATER: Build git describe: jenkins-rel-noether-4
INFO WATER: Build project version: 2.8.4.4
INFO WATER: Built by: 'jenkins'
INFO WATER: Built on: 'Sat Feb 7 13:39:20 PST 2015'
INFO WATER: Java availableProcessors: 16
INFO WATER: Java heap totalMemory: 1.53 gb
INFO WATER: Java heap maxMemory: 22.75 gb
INFO WATER: Java version: Java 1.7.0_75 (from Oracle Corporation)
INFO WATER: OS version: Linux 2.6.32-504.3.3.el6.x86_64 (amd64)
INFO WATER: Machine physical memory: 102.37 gb

You can access the Web UI at http://localhost:54321

Running H2O's GLM function on R

We shall be running H2O's GLM on R here. We could also have done it without R using only the Linux command line. But I found it easier this way.

GLM is Generalized Linear Model, a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

If you don't have R already installed on your linux box, follow this link.

So we shall be performing couple of tasks to get GLM running on H2O.

Install H2O on R

You have installed H2O, then R and now we need to install H2O on R.

Open the R shell by typing "R" in your terminal and then enter the following commands there.

install.packages("RCurl");
install.packages("rjson");
install.packages("statmod");
install.packages("survival");
q()

Now in your linux terminal type:

cd /location_of_your_H2O_setup/h2o-2.8.4.4
R
install.packages("location_of_your_H2O_setup/h2o-2.8.4.4/R/h2o_2.8.4.4.tar.gz", repos = NULL, type = "source")
library(h2o)
q()

If all went fine, congratulate yourself. You have H2O and R and H2O on R installed :-)

Running a Demo

H2O packages examples to demostrate how its algorithm implementations work. The GLM is also a part of those demos. It would download the data called prostate.csv from authorized location on the web and use it as input. This demo would perform Logistic Regression of Prostate Cancer Data.

All you have to do is:

cd /location_of_your_H2O_setup/h2o-2.8.4.4
R
demo(h2o.glm)

You should see logs like the below:

demo(h2o.glm)

demo(h2o.glm)
---- ~~~~~~~
> # This is a demo of H2O's GLM function
> # It imports a data set, parses it, and prints a summary
> # Then, it runs GLM with a binomial link function using 10-fold cross-validation
> # Note: This demo runs H2O on localhost:54321
> library(h2o)
> localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
Successfully connected to http://localhost:54321
R is connected to H2O cluster:
H2O cluster uptime: 1 hours 45 minutes
H2O cluster version: 2.8.4.4
H2O cluster name: jayati.tiwari
H2O cluster total nodes: 1 H2O cluster total memory: 22.75 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
> prostate.hex = h2o.uploadFile(localH2O, path = system.file("extdata", "prostate.csv", package="h2o"), key = "prostate.hex")
|======================================================================| 100%
> summary(prostate.hex)
ID CAPSULE AGE RACE
Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000
1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000
Median :190.50 Median :0.0000 Median :67.00 Median :1.000
Mean :190.50 Mean :0.4026 Mean :66.04 Mean :1.087
3rd Qu.:285.25 3rd Qu.:1.0000 3rd Qu.:71.00 3rd Qu.:1.000
Max. :380.00 Max. :1.0000 Max. :79.00 Max. :2.000
DPROS DCAPS PSA VOL
Min. :1.000 Min. :1.000 Min. : 0.300 Min. : 0.00
1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.: 0.00
Median :2.000 Median :1.000 Median : 8.725 Median :14.25
Mean :2.271 Mean :1.108 Mean : 15.409 Mean :15.81
3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 17.125 3rd Qu.:26.45
Max. :4.000 Max. :2.000 Max. :139.700 Max. :97.60
GLEASON
Min. :0.000
1st Qu.:6.000
Median :6.000
Mean :6.384
3rd Qu.:7.000
Max. :9.000

> prostate.glm = h2o.glm(x = c("AGE","RACE","PSA","DCAPS"), y = "CAPSULE", data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)
|======================================================================| 100%
> print(prostate.glm)
IP Address: localhost
Port : 54321
Parsed Data Key: prostate.hex
GLM2 Model Key: GLMModel__ba962660a263d41ab4531103562b4422
Coefficients:
AGE RACE DCAPS PSA Intercept
-0.01104 -0.63136 1.31888 0.04713 -1.10896
Normalized Coefficients:
AGE RACE DCAPS PSA Intercept
-0.07208 -0.19495 0.40972 0.94253 -0.33707
Degrees of Freedom: 379 Total (i.e. Null); 375 Residual
Null Deviance: 512.3
Residual Deviance: 461.3 AIC: 471.3
Deviance Explained: 0.09945
Best Threshold: 0.328
Confusion Matrix:
Predicted
Actual false true Error
false 127 100 0.44053
true 51 102 0.33333
Totals 178 202 0.39737

AUC = 0.6887507 (on train)
Cross-Validation Models:
Nonzeros AUC Deviance Explained
Model 1 4 0.6532738 0.8965221
Model 2 4 0.6316527 0.8752008
Model 3 4 0.7100840 0.8955293
Model 4 4 0.8268698 0.9099155
Model 5 4 0.6354167 0.9079152
Model 6 4 0.6888889 0.8881883
Model 7 4 0.7366071 0.9091687
Model 8 4 0.6711310 0.8917893
Model 9 4 0.7803571 0.9178481
Model 10 4 0.7435897 0.9065831
> myLabels = c(prostate.glm@model$x, "Intercept")
> plot(prostate.glm@model$coefficients, xaxt = "n", xlab = "Coefficients", ylab = "Values")
> axis(1, at = 1:length(myLabels), labels = myLabels)
> abline(h = 0, col = 2, lty = 2)
> title("Coefficients from Logistic Regression\n of Prostate Cancer Data")
> barplot(prostate.glm@model$coefficients, main = "Coefficients from Logistic Regression\n of Prostate Cancer Data")

Great ! Your demo ran fine.

Starting H2O from R

Before we try running GLM from the R shell, we need to start H2O. We shall achieve this from within the R shell itself.

library(h2o)

localH2O <- data-blogger-escaped-br="" data-blogger-escaped-h2o.init="" data-blogger-escaped-ip="localhost" data-blogger-escaped-max_mem_size="4g" data-blogger-escaped-port="54321,">

You should see something like:

Successfully connected to http://localhost:54321

R is connected to H2O cluster:

H2O cluster uptime: 2 hours 3 minutes

H2O cluster version: 2.8.4.4

H2O cluster name: jayati.tiwari

H2O cluster total nodes: 1

H2O cluster total memory: 22.75 GB

H2O cluster total cores: 16

H2O cluster allowed cores: 16

H2O cluster healthy: TRUE

This starts H2O.

Running H2O's GLM from R

In the same R shell continue to run the GLM example now.

prostate.hex = h2o.importFile(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")

h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)

This command should output the following on your terminal

|======================================================================| 100%

IP Address: localhost

Port : 54321

Parsed Data Key: prostate.hex

GLM2 Model Key: GLMModel__8efb9141cab4671715fc8319eae54ca8

Coefficients:

AGE RACE DCAPS PSA Intercept

-0.01104 -0.63136 1.31888 0.04713 -1.10896

Normalized Coefficients:

AGE RACE DCAPS PSA Intercept

-0.07208 -0.19495 0.40972 0.94253 -0.33707

Degrees of Freedom: 379 Total (i.e. Null); 375 Residual

Null Deviance: 512.3

Residual Deviance: 461.3 AIC: 471.3

Deviance Explained: 0.09945

Best Threshold: 0.328

Confusion Matrix:

Predicted

Actual false true Error

false 127 100 0.44053

true 51 102 0.33333

Totals 178 202 0.39737

AUC = 0.6887507 (on train)

Cross-Validation Models:

Nonzeros AUC Deviance Explained

Model 1 4 0.6532738 0.8965221

Model 2 4 0.6316527 0.8752008

Model 3 4 0.7100840 0.8955293

Model 4 4 0.8268698 0.9099155

Model 5 4 0.6354167 0.9079152

Model 6 4 0.6888889 0.8881883

Model 7 4 0.7366071 0.9091687

Model 8 4 0.6711310 0.8917893

Model 9 4 0.7803571 0.9178481

Model 10 4 0.7435897 0.9065831

As you can see, you have predictions in place and the accuracy score as well.

Hope it helped !!

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.

2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.

2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O

2. SparkMLlib

3. Sparkling Water

4. Weka

In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.

Sunday, May 26, 2013

Running Weka's Logistic Regression using Command Line

Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest release of WEKA. One of useful links working at the time of writing this post is:

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.9/weka-3-6-9.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-9”. We would call it WEKA_HOME for reference in this blog post.

You might want to run Weka’s logistic regression algorithm on two types of input data.

One is the sample data files in ARFF format already available in “WEKA_HOME/data”
Other is over some data files you already have in CSV format with you. For example, donut.csv file provided by Mahout for running it’s Logistic Regression over it.

Running LR over ARFF files

We would be using the file “WEKA_HOME/data/weather.nominal.arff” for running the algorithm. Cd to WEKA_HOME and run the following command

java -cp ./weka.jar weka.classifiers.functions.Logistic -t WEKA_HOME/weather.nominal.arff -T WEKA_HOME/weather.nominal.arff -d /some_location_on_your_machine/weather.nominal.model.arff

which should generate the trained model at “/some_location_on_your_machine/weather.nominal.model.arff” and the console output should look something like:

Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                                    Class
Variable                              yes
=========================================
outlook=sunny                    -45.2378
outlook=overcast                  57.5375
outlook=rainy                     -5.9067
temperature=hot                   -8.3327
temperature=mild                  44.8546
temperature=cool                 -45.4929
humidity                         118.1425
windy                             72.9648
Intercept                        -89.2032

Odds Ratios...
                                    Class
Variable                              yes
=========================================
outlook=sunny                           0
outlook=overcast      9.73275593611619E24
outlook=rainy                      0.0027
temperature=hot                    0.0002
temperature=mild     3.020787521374072E19
temperature=cool                        0
humidity            2.0353933107400553E51
windy                4.877521304260806E31

Time taken to build model: 0.12 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

=== Error on test data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

Here the three arguments mean:

-t <name of training file> : Sets training file.
-T <name of test file> : Sets test file. If missing, a cross-validation will be performed on the training data.
-d <name of output file> : Sets model output file. In case the filename ends with '.xml', only the options are saved to the XML file, not the model.

For help on all available arguments, try running the following command from WEKA_HOME:

java -cp ./weka.jar weka.classifiers.functions.Logistic -h

Running LR over CSV files

For running Weka’s LR over a CSV file, you’ll need to convert it into ARFF format using a converter provided by WEKA. Using command line in linux, here are the steps:

Step-I: Convert the data into arff format, for converting from CSV to ARFF, run the following command from WEKA_HOME:

java -cp ./weka.jar weka.core.converters.CSVLoader someCSVFile.csv > outputARFFFile.arff

Step-II: Run the NumericToNominal filter over the arff file

java -cp ./weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i outputARFFFile.arff -o outputARFFFile.nominal.arff

Step-III: Run the classifier over the outputARFFFile.nominal.arff

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.nominal.arff -T outputARFFFile.nominal.arff -d outputARFFFile.nominal.model.arff

You might encounter an exception stating

"Cannot handle unary class!"

To resolve this, apply the attribute filter and eliminate the attribute which has same value for all the records in the file using:

java -cp ./weka.jar weka.filters.AttributeFilter -i outputARFFFile.nominal.arff -o outputARFFFile.filtered.nominal.arff -R 8

where the value of “–R” would vary depending upon your input file and the id of attribute to be eliminated in the input arff file.

After this, try running the classifier on the obtained “outputARFFFile.filtered.nominal.arff” file as in:

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.filtered.nominal.arff -T outputARFFFile.filtered.nominal.arff -d outputARFFFile.nominal.model.arff

The output should appear somewhat like we got when running the classifier over the provided sample data mentioned above.

With these steps, you are ready to play with WEKA. Go for it. Cheers !!!

Saturday, May 25, 2013

Running Mahout's Logistic Regression

Logistic Regression(SGD) is one the algorithms available in Mahout. This blog post lists and provides all that is required for the same. To install Mahout on your machine, you can refer to my previous post.

Logistic Regression executes in two major phases:

Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.

Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.

Steps for running Mahout’s LR

Step-I: Get the input data file called donut.csv, which is present in the mahout setup. But for your ready reference I have also shared it. You can download it from here.

Step-II: Next cd to the MAHOUT_HOME. Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic” class that would train the model for us using the “donut.csv” file what we would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:

bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5 --input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

If the Mahout version is 0.7 you are likely to face the error below:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver

    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 1 more

Don’t worry, all you need to do is:

export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar

After editing the CLASSPATH as mentioned above the command should run successfully and print something like:

color ~ -0.016*Intercept Term + -0.016*xy + -0.016*yy
      Intercept Term -0.01559
                  xy -0.01559
                  yy -0.01559
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.015590929     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000
13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098)

The most important parameters influencing the execution of the Training process are:

"--passes": the number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate

You can vary the values of these 3 variables and see the change in performance of the algorithm. Also, you can now see that the model would have been created at the location which you had specified in the command.

Step-III: Now it’s time to run the classifier using the model that has been trained in Step-II. As the test data we would be using the same donut.csv file that we used for training or you can even split the file in some ratio for eg. 70-30 and use the 70% file for training of the model and 30% file for testing. Here’s the command for testing the model and running the classifier:

bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv --model loc_of_model/donut.model --auc --scores --confusion

which should print an output something like:

"target","model-output","log-likelihood"
0,0.496,-0.685284
0,0.490,-0.674055
0,0.491,-0.675162
1,0.495,-0.703361
1,0.493,-0.706289
0,0.495,-0.683275
0,0.496,-0.685282
0,0.492,-0.677191
1,0.494,-0.704222
0,0.495,-0.684107
0,0.496,-0.684765
1,0.494,-0.705209
0,0.491,-0.675272
1,0.495,-0.703438
0,0.496,-0.685121
0,0.496,-0.684886
0,0.490,-0.672500
0,0.495,-0.682445
0,0.496,-0.684872
1,0.495,-0.703070
0,0.490,-0.672511
0,0.495,-0.683643
0,0.492,-0.677610
1,0.492,-0.708915
0,0.496,-0.684744
1,0.494,-0.704766
0,0.492,-0.677496
1,0.492,-0.708679
0,0.496,-0.685222
1,0.495,-0.703604
0,0.492,-0.677846
0,0.490,-0.672702
0,0.492,-0.676980
0,0.494,-0.681450
1,0.495,-0.702845
0,0.493,-0.679049
0,0.496,-0.684262
1,0.493,-0.706564
1,0.495,-0.704016
0,0.490,-0.672624
AUC = 0.52
confusion: [[27.0, 13.0], [0.0, 0.0]]
entropy: [[-0.7, -0.4], [-0.7, -0.5]]
13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079)

Similarly, you can try on a variety of data sets that you might have. I have seen upto 93% accuracy of results of classification on a different data set.

All the best !!!

Installing Mahout on Linux

Mahout is an acquisition of highly scalable machine learning algorithms over very large data sets. Although the real power of Mahout can be vouched for only on large HDFS data, but Mahout also supports running algorithm on local filesystem data, that can help you get a feel of how to run Mahout algorithms.

Installing Mahout on Linux

Before you can run any Mahout algorithm you need a Mahout installation ready on your Linux machine which can be carried out in two ways as described below:

Method I- Extracting the tarball

Yes, it is that simple. Just download the latest Mahout release of from

http://www.apache.org/dyn/closer.cgi/mahout/

Extract the downloaded tarball using:

tar –xzvf /path_to_downloaded_tarball/mahout-distribution-0.x.tar.gz

This should result in a folder with name /path_to_downloaded_tarball/mahout-distribution-0.x

Now, you can run any of the algorithms using the script “bin/mahout” present in the extracted folder. For testing your installation, you can also run

bin/mahout

without any other arguments.

Method II- Building Mahout

1. Prerequisites for Building Mahout

- Java JDK 1.6

- Maven 2.2 or higher (http://maven.apache.org/)

Install maven and svn using following commands:

sudo apt-get install maven2

sudo apt-get install subversion

2. Create a directory where you would want to check out the Mahout code, we’ll call it here MAHOUT_HOME:

mkdir MAHOUT_HOME

cd MAHOUT_HOME

3. Use Subversion to check out the code:

svn co http://svn.apache.org/repos/asf/mahout/trunk

4. Compiling

cd MAHOUT_HOME

mvn -DskipTests install

5. Setting the environment variables

export HADOOP_CONF_DIR=$HADOOP_HOME/conf

export MAHOUT_HOME=/location_of_checked_out_mahout

export PATH=$PATH:$MAHOUT_HOME

After following either of the above methods, you can now run any of the available mahout algorithms with appropriate arguments. Also, note that you can run the algorithm over HDFS data or local file system data. In order to run algorithms over data on your local file system set an environment variable with the name “MAHOUT_LOCAL” to anything other than an empty string. That would force mahout to run locally even if HADOOP_CONF_DIR and HADOOP_HOME are set.

To plunge into Mahout by trying out running an algorithm, you can refer to my next post. Hope this proved to be a good starter for you.

All the best !!!