Jayati Tiwari: Running Mahout's Logistic Regression

Saturday, May 25, 2013

Running Mahout's Logistic Regression

Logistic Regression(SGD) is one the algorithms available in Mahout. This blog post lists and provides all that is required for the same. To install Mahout on your machine, you can refer to my previous post.

Logistic Regression executes in two major phases:

Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.

Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.

Steps for running Mahout’s LR

Step-I: Get the input data file called donut.csv, which is present in the mahout setup. But for your ready reference I have also shared it. You can download it from here.

Step-II: Next cd to the MAHOUT_HOME. Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic” class that would train the model for us using the “donut.csv” file what we would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:

bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5 --input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

If the Mahout version is 0.7 you are likely to face the error below:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver

    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 1 more

Don’t worry, all you need to do is:

export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar

After editing the CLASSPATH as mentioned above the command should run successfully and print something like:

color ~ -0.016*Intercept Term + -0.016*xy + -0.016*yy
      Intercept Term -0.01559
                  xy -0.01559
                  yy -0.01559
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.015590929     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000
13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098)

The most important parameters influencing the execution of the Training process are:

"--passes": the number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate

You can vary the values of these 3 variables and see the change in performance of the algorithm. Also, you can now see that the model would have been created at the location which you had specified in the command.

Step-III: Now it’s time to run the classifier using the model that has been trained in Step-II. As the test data we would be using the same donut.csv file that we used for training or you can even split the file in some ratio for eg. 70-30 and use the 70% file for training of the model and 30% file for testing. Here’s the command for testing the model and running the classifier:

bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv --model loc_of_model/donut.model --auc --scores --confusion

which should print an output something like:

"target","model-output","log-likelihood"
0,0.496,-0.685284
0,0.490,-0.674055
0,0.491,-0.675162
1,0.495,-0.703361
1,0.493,-0.706289
0,0.495,-0.683275
0,0.496,-0.685282
0,0.492,-0.677191
1,0.494,-0.704222
0,0.495,-0.684107
0,0.496,-0.684765
1,0.494,-0.705209
0,0.491,-0.675272
1,0.495,-0.703438
0,0.496,-0.685121
0,0.496,-0.684886
0,0.490,-0.672500
0,0.495,-0.682445
0,0.496,-0.684872
1,0.495,-0.703070
0,0.490,-0.672511
0,0.495,-0.683643
0,0.492,-0.677610
1,0.492,-0.708915
0,0.496,-0.684744
1,0.494,-0.704766
0,0.492,-0.677496
1,0.492,-0.708679
0,0.496,-0.685222
1,0.495,-0.703604
0,0.492,-0.677846
0,0.490,-0.672702
0,0.492,-0.676980
0,0.494,-0.681450
1,0.495,-0.702845
0,0.493,-0.679049
0,0.496,-0.684262
1,0.493,-0.706564
1,0.495,-0.704016
0,0.490,-0.672624
AUC = 0.52
confusion: [[27.0, 13.0], [0.0, 0.0]]
entropy: [[-0.7, -0.4], [-0.7, -0.5]]
13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079)

Similarly, you can try on a variety of data sets that you might have. I have seen upto 93% accuracy of results of classification on a different data set.

All the best !!!

24 comments:

UnknownJune 28, 2013 at 12:33 AM
Hi,
I had tried to run the regression as per your steps. however i am not able to read the data from hadoop while mentioning in the input path. However the same file is being read from the OS. Can you please help on any steps of reading data from hadoop to mahout directly. I am using mahout 0.7 version on cloudera CDH4.2.

Thank you.
ReplyDelete
Replies
Aparnesh GauravAugust 12, 2013 at 5:53 AM
Thanks, it was quite helpful.
ReplyDelete
Replies
UnknownAugust 25, 2013 at 4:55 PM
Gracias Jayati!! Todo salio bien!!
ReplyDelete
Replies
Aparnesh GauravSeptember 2, 2013 at 4:04 AM
Jayati,

Here it's mentioned that the csv/model are being put in local file system.

Q1. When we run Mahout algorithms on it, does it internally move those files to hdfs or it is operating in non hadoop mode ?
Q2. In case it's not running in distributed fashion , why it asks for Hadoop jars on it's build path ?

Regards,
Aparnesh
ReplyDelete
Replies
UnknownSeptember 13, 2013 at 3:40 PM
I'm trying to understand the "scores" output.

"target","model-output","log-likelihood"
0,0.496,-0.685284

Does this mean the mode saw 0, the model predicted 0.496 (very very very slight lean to 0 over 1 or unfilled to filled) and 68.5% chance of being accurate? Or how do I understand these 3 column values?
ReplyDelete
Replies
Ndaru PurnomoSeptember 22, 2013 at 9:12 PM
Thanks for the tutorial, I managed to get the same output.

Just want to ask, what does the data inside donut.csv represent? Which one is the outcome (what are we predicting)?

ReplyDelete
Replies
AnshumanJanuary 13, 2014 at 5:56 AM
Whats new here ?? this example is already present in mahout in action in detail.
ReplyDelete
Replies
UnknownFebruary 3, 2014 at 7:49 AM
This comment has been removed by the author.
ReplyDelete
Replies
BigDataApril 8, 2014 at 3:15 PM
Nice tutorial Jayanti! The training command went through...when I open donut.model i see some garbled data like @E @^@^@^@^@^@AnB
When I run the STEP III, I see the following error message:
Unexpected mahout/trunk/logreg/donut.model while processing --help|--quiet|--auc|--scores|--confusion||--model

What do you think is happening here?
ReplyDelete
Replies
UnknownApril 12, 2014 at 3:35 PM
What if I wanna train and test a csv file with two columns "word" and "class" that each "class" shows the corresponding "word" is positive or negative. Can you give me those two terminal shells? Thank you so much!
ReplyDelete
Replies
UnknownApril 12, 2014 at 3:36 PM
Besides, I could successfully run this tutorial with donut.csv.
ReplyDelete
Replies
AnonymousAugust 12, 2014 at 10:31 PM
Hi,
Thank you very much for the tutorial! :) I got the same result.
I wonder if there is mahout command to custom split data to train and testing. Do you have any idea about it?
ReplyDelete
Replies
PadmamaniAugust 18, 2014 at 2:07 AM
Hi Jayati,

I have been trying to classify 20 newsgroup data using SGD algorithm but getting the following error.

[cloudera@localhost classify_scripts]$ mahout trainTestSGD.trainTestSGD.TrainNewsGroups ${WORK_DIR}/20news-bydate/20news-bydate-train/
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar
14/08/17 23:24:13 WARN driver.MahoutDriver: Unable to add class: trainTestSGD.trainTestSGD.TrainNewsGroups
java.lang.ClassNotFoundException: trainTestSGD.trainTestSGD.TrainNewsGroups
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
14/08/17 23:24:13 WARN driver.MahoutDriver: No trainTestSGD.trainTestSGD.TrainNewsGroups.props found on classpath, will use command-line arguments only
Unknown program 'trainTestSGD.trainTestSGD.TrainNewsGroups' chosen.

Can you please help me with this in anyway.
ReplyDelete
Replies
UnknownFebruary 2, 2015 at 12:36 AM
have you implemented any other algo in classification .
ReplyDelete
Replies
UnknownFebruary 27, 2015 at 10:38 AM
can we do logistic regression with input data in mahout ??can we integrate the result of it with hadoop in ubuntu??
ReplyDelete
Replies
UnknownApril 13, 2017 at 10:56 PM
After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

hadoop training in chennai adyar | big data training in chennai adyar
ReplyDelete
Replies

Add comment