Logistic Regression(SGD) is one
the algorithms available in Mahout. This blog post lists and provides all that is
required for the same. To install Mahout on your machine,
you can refer to my previous post.
Logistic Regression executes in
two major phases:
- Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.
- Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.
Steps for running Mahout’s LR
Step-I: Get the input data file
called donut.csv, which is present in the mahout setup. But for your ready reference
I have also shared it. You can download it from here.
Step-II: Next cd to the MAHOUT_HOME.
Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic”
class that would train the model for us using the “donut.csv” file what we
would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:
bin/mahout
org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5
--input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model
--target color --categories 2 --predictors x y xx xy yy a b c --types n n
|
If the Mahout version is 0.7 you are likely to face the error below:
Exception in thread
"main" java.lang.NoClassDefFoundError:
org/apache/hadoop/util/ProgramDriver
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 1 more |
Don’t worry, all you need to do
is:
export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar
|
After editing the CLASSPATH as mentioned above the command should run successfully and print something like:
color ~
-0.016*Intercept Term + -0.016*xy + -0.016*yy
Intercept Term -0.01559 xy -0.01559 yy -0.01559 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 -0.015590929 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098) |
The most important
parameters influencing the execution of the Training process are:
"--passes": the
number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate
You can vary the values
of these 3 variables and see the change in performance of the algorithm. Also, you can now see
that the model would have been created at the location which you had specified
in the command.
Step-III: Now it’s time to run
the classifier using the model that has been trained in Step-II. As the test
data we would be using the same donut.csv file that we used for training or you
can even split the file in some ratio for eg. 70-30 and use the 70% file for
training of the model and 30% file for testing. Here’s the command for testing
the model and running the classifier:
bin/mahout
org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv --model loc_of_model/donut.model --auc
--scores --confusion
|
which should print an output something like:
"target","model-output","log-likelihood"
0,0.496,-0.685284 0,0.490,-0.674055 0,0.491,-0.675162 1,0.495,-0.703361 1,0.493,-0.706289 0,0.495,-0.683275 0,0.496,-0.685282 0,0.492,-0.677191 1,0.494,-0.704222 0,0.495,-0.684107 0,0.496,-0.684765 1,0.494,-0.705209 0,0.491,-0.675272 1,0.495,-0.703438 0,0.496,-0.685121 0,0.496,-0.684886 0,0.490,-0.672500 0,0.495,-0.682445 0,0.496,-0.684872 1,0.495,-0.703070 0,0.490,-0.672511 0,0.495,-0.683643 0,0.492,-0.677610 1,0.492,-0.708915 0,0.496,-0.684744 1,0.494,-0.704766 0,0.492,-0.677496 1,0.492,-0.708679 0,0.496,-0.685222 1,0.495,-0.703604 0,0.492,-0.677846 0,0.490,-0.672702 0,0.492,-0.676980 0,0.494,-0.681450 1,0.495,-0.702845 0,0.493,-0.679049 0,0.496,-0.684262 1,0.493,-0.706564 1,0.495,-0.704016 0,0.490,-0.672624 AUC = 0.52 confusion: [[27.0, 13.0], [0.0, 0.0]] entropy: [[-0.7, -0.4], [-0.7, -0.5]] 13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079) |
Similarly, you can try on a variety of data sets that
you might have. I have seen upto 93% accuracy of results of classification on a
different data set.
All the best !!!
Hi,
ReplyDeleteI had tried to run the regression as per your steps. however i am not able to read the data from hadoop while mentioning in the input path. However the same file is being read from the OS. Can you please help on any steps of reading data from hadoop to mahout directly. I am using mahout 0.7 version on cloudera CDH4.2.
Thank you.
Is there any error message you could send?
DeleteThat would help me understand your problem.
It says that the input file does not exist. But i can see the file when i execute hdfs dfs -tail command
DeleteBy any chance, have you set the MAHOUT_LOCAL env. varialbe. If that is set, the command searches for the input path on the local fs instead of HDFS.
DeleteThanks, it was quite helpful.
ReplyDeleteGracias Jayati!! Todo salio bien!!
ReplyDeleteJayati,
ReplyDeleteHere it's mentioned that the csv/model are being put in local file system.
Q1. When we run Mahout algorithms on it, does it internally move those files to hdfs or it is operating in non hadoop mode ?
Q2. In case it's not running in distributed fashion , why it asks for Hadoop jars on it's build path ?
Regards,
Aparnesh
Hi,
DeleteThe steps in the blog are for running the algorithm on local fs.
The Hadoop Jars are required because the algorithm uses various hadoop data types and classes such as SequenceFile while execution, even though its not running on HDFS.
Jayati
I'm trying to understand the "scores" output.
ReplyDelete"target","model-output","log-likelihood"
0,0.496,-0.685284
Does this mean the mode saw 0, the model predicted 0.496 (very very very slight lean to 0 over 1 or unfilled to filled) and 68.5% chance of being accurate? Or how do I understand these 3 column values?
Your interpretation is correct to a great extent."target" is what the output should have been, "model-output" is what the model predicted and for log-likelihood I've found the following on the web:
Delete"The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is \mathcal{L}(\theta |x) = P(x | \theta)."
Hope that makes it clear.
Thanks for the tutorial, I managed to get the same output.
ReplyDeleteJust want to ask, what does the data inside donut.csv represent? Which one is the outcome (what are we predicting)?
The data in donut.csv is sample/practice data only. We are trying to predict the value of the field "color" based upon the values of other fields called predictor variables, which in our example are "x y xx xy yy a b c".
DeleteWhats new here ?? this example is already present in mahout in action in detail.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteNice tutorial Jayanti! The training command went through...when I open donut.model i see some garbled data like @E @^@^@^@^@^@AnB
ReplyDeleteWhen I run the STEP III, I see the following error message:
Unexpected mahout/trunk/logreg/donut.model while processing --help|--quiet|--auc|--scores|--confusion||--model
What do you think is happening here?
I am using Mahout 0.9
DeleteWhat if I wanna train and test a csv file with two columns "word" and "class" that each "class" shows the corresponding "word" is positive or negative. Can you give me those two terminal shells? Thank you so much!
ReplyDeleteBesides, I could successfully run this tutorial with donut.csv.
ReplyDeleteHi,
ReplyDeleteThank you very much for the tutorial! :) I got the same result.
I wonder if there is mahout command to custom split data to train and testing. Do you have any idea about it?
Thanks.
DeleteI am not aware of such a command in Mahout. Would have to check.
Hi Jayati,
ReplyDeleteI have been trying to classify 20 newsgroup data using SGD algorithm but getting the following error.
[cloudera@localhost classify_scripts]$ mahout trainTestSGD.trainTestSGD.TrainNewsGroups ${WORK_DIR}/20news-bydate/20news-bydate-train/
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar
14/08/17 23:24:13 WARN driver.MahoutDriver: Unable to add class: trainTestSGD.trainTestSGD.TrainNewsGroups
java.lang.ClassNotFoundException: trainTestSGD.trainTestSGD.TrainNewsGroups
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
14/08/17 23:24:13 WARN driver.MahoutDriver: No trainTestSGD.trainTestSGD.TrainNewsGroups.props found on classpath, will use command-line arguments only
Unknown program 'trainTestSGD.trainTestSGD.TrainNewsGroups' chosen.
Can you please help me with this in anyway.
have you implemented any other algo in classification .
ReplyDeletecan we do logistic regression with input data in mahout ??can we integrate the result of it with hadoop in ubuntu??
ReplyDeleteAfter reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog
ReplyDeletehadoop training in chennai adyar | big data training in chennai adyar