## Saturday, May 25, 2013

### Running Mahout's Logistic Regression

Logistic Regression(SGD) is one the algorithms available in Mahout. This blog post lists and provides all that is required for the same. To install Mahout on your machine, you can refer to my previous post.

Logistic Regression executes in two major phases:
• Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.
• Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.

Steps for running Mahout’s LR

Step-I: Get the input data file called donut.csv, which is present in the mahout setup. But for your ready reference I have also shared it. You can download it from here.

Step-II: Next cd to the MAHOUT_HOME. Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic” class that would train the model for us using the “donut.csv” file what we would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:

 bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5 --input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

If the Mahout version is 0.7 you are likely to face the error below:

 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver     at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver     at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355)     at java.security.AccessController.doPrivileged(Native Method)     at java.net.URLClassLoader.findClass(URLClassLoader.java:354)     at java.lang.ClassLoader.loadClass(ClassLoader.java:423)     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 1 more Don’t worry, all you need to do is:  export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar

After editing the CLASSPATH as mentioned above the command should run successfully and print something like:

 color ~ -0.016*Intercept Term + -0.016*xy + -0.016*yy       Intercept Term -0.01559                   xy -0.01559                   yy -0.01559     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.015590929     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000  13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098)

The most important parameters influencing the execution of the Training process are:

"--passes": the number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate

You can vary the values of these 3 variables and see the change in performance of the algorithm. Also, you can now see that the model would have been created at the location which you had specified in the command.

Step-III: Now it’s time to run the classifier using the model that has been trained in Step-II. As the test data we would be using the same donut.csv file that we used for training or you can even split the file in some ratio for eg. 70-30 and use the 70% file for training of the model and 30% file for testing. Here’s the command for testing the model and running the classifier:

 bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv  --model loc_of_model/donut.model --auc --scores --confusion

which should print an output something like:

 "target","model-output","log-likelihood" 0,0.496,-0.685284 0,0.490,-0.674055 0,0.491,-0.675162 1,0.495,-0.703361 1,0.493,-0.706289 0,0.495,-0.683275 0,0.496,-0.685282 0,0.492,-0.677191 1,0.494,-0.704222 0,0.495,-0.684107 0,0.496,-0.684765 1,0.494,-0.705209 0,0.491,-0.675272 1,0.495,-0.703438 0,0.496,-0.685121 0,0.496,-0.684886 0,0.490,-0.672500 0,0.495,-0.682445 0,0.496,-0.684872 1,0.495,-0.703070 0,0.490,-0.672511 0,0.495,-0.683643 0,0.492,-0.677610 1,0.492,-0.708915 0,0.496,-0.684744 1,0.494,-0.704766 0,0.492,-0.677496 1,0.492,-0.708679 0,0.496,-0.685222 1,0.495,-0.703604 0,0.492,-0.677846 0,0.490,-0.672702 0,0.492,-0.676980 0,0.494,-0.681450 1,0.495,-0.702845 0,0.493,-0.679049 0,0.496,-0.684262 1,0.493,-0.706564 1,0.495,-0.704016 0,0.490,-0.672624 AUC = 0.52 confusion: [[27.0, 13.0], [0.0, 0.0]] entropy: [[-0.7, -0.4], [-0.7, -0.5]] 13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079)

Similarly, you can try on a variety of data sets that you might have. I have seen upto 93% accuracy of results of classification on a different data set.
All the best !!!

1. Hi,
I had tried to run the regression as per your steps. however i am not able to read the data from hadoop while mentioning in the input path. However the same file is being read from the OS. Can you please help on any steps of reading data from hadoop to mahout directly. I am using mahout 0.7 version on cloudera CDH4.2.

Thank you.

1. Is there any error message you could send?
That would help me understand your problem.

2. It says that the input file does not exist. But i can see the file when i execute hdfs dfs -tail command

3. By any chance, have you set the MAHOUT_LOCAL env. varialbe. If that is set, the command searches for the input path on the local fs instead of HDFS.

2. Thanks, it was quite helpful.

3. Gracias Jayati!! Todo salio bien!!

4. Jayati,

Here it's mentioned that the csv/model are being put in local file system.

Q1. When we run Mahout algorithms on it, does it internally move those files to hdfs or it is operating in non hadoop mode ?
Q2. In case it's not running in distributed fashion , why it asks for Hadoop jars on it's build path ?

Regards,
Aparnesh

1. Hi,

The steps in the blog are for running the algorithm on local fs.

The Hadoop Jars are required because the algorithm uses various hadoop data types and classes such as SequenceFile while execution, even though its not running on HDFS.

Jayati

5. I'm trying to understand the "scores" output.

"target","model-output","log-likelihood"
0,0.496,-0.685284

Does this mean the mode saw 0, the model predicted 0.496 (very very very slight lean to 0 over 1 or unfilled to filled) and 68.5% chance of being accurate? Or how do I understand these 3 column values?

1. Your interpretation is correct to a great extent."target" is what the output should have been, "model-output" is what the model predicted and for log-likelihood I've found the following on the web:

"The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is \mathcal{L}(\theta |x) = P(x | \theta)."

Hope that makes it clear.

6. Thanks for the tutorial, I managed to get the same output.

Just want to ask, what does the data inside donut.csv represent? Which one is the outcome (what are we predicting)?

1. The data in donut.csv is sample/practice data only. We are trying to predict the value of the field "color" based upon the values of other fields called predictor variables, which in our example are "x y xx xy yy a b c".

7. Whats new here ?? this example is already present in mahout in action in detail.

8. This comment has been removed by the author.

9. Nice tutorial Jayanti! The training command went through...when I open donut.model i see some garbled data like @E @^@^@^@^@^@AnB
When I run the STEP III, I see the following error message:
Unexpected mahout/trunk/logreg/donut.model while processing --help|--quiet|--auc|--scores|--confusion||--model

What do you think is happening here?

1. I am using Mahout 0.9

10. What if I wanna train and test a csv file with two columns "word" and "class" that each "class" shows the corresponding "word" is positive or negative. Can you give me those two terminal shells? Thank you so much!

11. Besides, I could successfully run this tutorial with donut.csv.

12. Hi,
Thank you very much for the tutorial! :) I got the same result.
I wonder if there is mahout command to custom split data to train and testing. Do you have any idea about it?

1. Thanks.

I am not aware of such a command in Mahout. Would have to check.

13. Hi Jayati,

I have been trying to classify 20 newsgroup data using SGD algorithm but getting the following error.

[cloudera@localhost classify_scripts]$mahout trainTestSGD.trainTestSGD.TrainNewsGroups${WORK_DIR}/20news-bydate/20news-bydate-train/
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar
14/08/17 23:24:13 WARN driver.MahoutDriver: Unable to add class: trainTestSGD.trainTestSGD.TrainNewsGroups
java.lang.ClassNotFoundException: trainTestSGD.trainTestSGD.TrainNewsGroups
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
14/08/17 23:24:13 WARN driver.MahoutDriver: No trainTestSGD.trainTestSGD.TrainNewsGroups.props found on classpath, will use command-line arguments only
Unknown program 'trainTestSGD.trainTestSGD.TrainNewsGroups' chosen.

14. have you implemented any other algo in classification .

15. can we do logistic regression with input data in mahout ??can we integrate the result of it with hadoop in ubuntu??

16. after reading this blog i learnt more useful information about mahout from this blog..

hadoop training in chennai adyar | big data training in chennai adyar

17. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

hadoop training in chennai adyar | big data training in chennai adyar

18. We at COEPD provides finest Data Science and R-Language courses in Hyderabad. Your search to learn Data Science ends here at COEPD. Here, we are an established training institute who have trained more than 10,000 participants in all streams. We will help you to convert your passion to learn into an enriched learning process. We will accelerate your career in data science by mastering concepts of Data Management, Statistics, Machine Learning and Big Data.

http://www.coepd.com/AnalyticsTraining.html