Saturday, May 25, 2013

Running Mahout's Logistic Regression



Logistic Regression(SGD) is one the algorithms available in Mahout. This blog post lists and provides all that is required for the same. To install Mahout on your machine, you can refer to my previous post.

Logistic Regression executes in two major phases:
  • Train the model: This step is about creating a model using some train data, that can further be used for the classification of any input data rather I would say test data.
  • Test the model: This step tests the generated model in step 1 by evaluating the results of classification of test data, and measuring the accuracy, scores and confusion matrix.

Steps for running Mahout’s LR

Step-I: Get the input data file called donut.csv, which is present in the mahout setup. But for your ready reference I have also shared it. You can download it from here.

Step-II: Next cd to the MAHOUT_HOME. Here we would be running the “org.apache.mahout.classifier.sgd.TrainLogistic” class that would train the model for us using the “donut.csv” file what we would be providing as train data. Here’s the command to be run from within MAHOUT_HOME:


bin/mahout org.apache.mahout.classifier.sgd.TrainLogistic --passes 1 --rate 1 --lambda 0.5 --input loc_of_file/donut.csv --features 21 --output any_loc_on_your_machine/donut.model --target color --categories 2 --predictors x y xx xy yy a b c --types n n

If the Mahout version is 0.7 you are likely to face the error below:


Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver

    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:96)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.util.ProgramDriver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 1 more

Don’t worry, all you need to do is:


export CLASSPATH=${CLASSPATH}:your_MAHOUT_HOME/mahout-distribution-0.7/lib/hadoop/hadoop-core-0.20.204.0.jar 

After editing the CLASSPATH as mentioned above the command should run successfully and print something like:


color ~ -0.016*Intercept Term + -0.016*xy + -0.016*yy
      Intercept Term -0.01559
                  xy -0.01559
                  yy -0.01559
    0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000    -0.015590929     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000     0.000000000 
13/05/26 02:14:02 INFO driver.MahoutDriver: Program took 588 ms (Minutes: 0.0098)

The most important parameters influencing the execution of the Training process are:

"--passes": the number of times to pass over the input data
"--lambda": the amount of coefficient decay to use
"--rate": the learning rate

You can vary the values of these 3 variables and see the change in performance of the algorithm. Also, you can now see that the model would have been created at the location which you had specified in the command.

Step-III: Now it’s time to run the classifier using the model that has been trained in Step-II. As the test data we would be using the same donut.csv file that we used for training or you can even split the file in some ratio for eg. 70-30 and use the 70% file for training of the model and 30% file for testing. Here’s the command for testing the model and running the classifier:


bin/mahout org.apache.mahout.classifier.sgd.RunLogistic --input loc_of_file/donut.csv  --model loc_of_model/donut.model --auc --scores --confusion

which should print an output something like:


"target","model-output","log-likelihood"
0,0.496,-0.685284
0,0.490,-0.674055
0,0.491,-0.675162
1,0.495,-0.703361
1,0.493,-0.706289
0,0.495,-0.683275
0,0.496,-0.685282
0,0.492,-0.677191
1,0.494,-0.704222
0,0.495,-0.684107
0,0.496,-0.684765
1,0.494,-0.705209
0,0.491,-0.675272
1,0.495,-0.703438
0,0.496,-0.685121
0,0.496,-0.684886
0,0.490,-0.672500
0,0.495,-0.682445
0,0.496,-0.684872
1,0.495,-0.703070
0,0.490,-0.672511
0,0.495,-0.683643
0,0.492,-0.677610
1,0.492,-0.708915
0,0.496,-0.684744
1,0.494,-0.704766
0,0.492,-0.677496
1,0.492,-0.708679
0,0.496,-0.685222
1,0.495,-0.703604
0,0.492,-0.677846
0,0.490,-0.672702
0,0.492,-0.676980
0,0.494,-0.681450
1,0.495,-0.702845
0,0.493,-0.679049
0,0.496,-0.684262
1,0.493,-0.706564
1,0.495,-0.704016
0,0.490,-0.672624
AUC = 0.52
confusion: [[27.0, 13.0], [0.0, 0.0]]
entropy: [[-0.7, -0.4], [-0.7, -0.5]]
13/05/26 02:16:19 INFO driver.MahoutDriver: Program took 474 ms (Minutes: 0.0079)                                                                          

Similarly, you can try on a variety of data sets that you might have. I have seen upto 93% accuracy of results of classification on a different data set.
All the best !!!

25 comments:

  1. Hi,
    I had tried to run the regression as per your steps. however i am not able to read the data from hadoop while mentioning in the input path. However the same file is being read from the OS. Can you please help on any steps of reading data from hadoop to mahout directly. I am using mahout 0.7 version on cloudera CDH4.2.

    Thank you.

    ReplyDelete
    Replies
    1. Is there any error message you could send?
      That would help me understand your problem.

      Delete
    2. It says that the input file does not exist. But i can see the file when i execute hdfs dfs -tail command

      Delete
    3. By any chance, have you set the MAHOUT_LOCAL env. varialbe. If that is set, the command searches for the input path on the local fs instead of HDFS.

      Delete
  2. Gracias Jayati!! Todo salio bien!!

    ReplyDelete
  3. Jayati,

    Here it's mentioned that the csv/model are being put in local file system.

    Q1. When we run Mahout algorithms on it, does it internally move those files to hdfs or it is operating in non hadoop mode ?
    Q2. In case it's not running in distributed fashion , why it asks for Hadoop jars on it's build path ?

    Regards,
    Aparnesh

    ReplyDelete
    Replies
    1. Hi,

      The steps in the blog are for running the algorithm on local fs.

      The Hadoop Jars are required because the algorithm uses various hadoop data types and classes such as SequenceFile while execution, even though its not running on HDFS.

      Jayati

      Delete
  4. I'm trying to understand the "scores" output.

    "target","model-output","log-likelihood"
    0,0.496,-0.685284

    Does this mean the mode saw 0, the model predicted 0.496 (very very very slight lean to 0 over 1 or unfilled to filled) and 68.5% chance of being accurate? Or how do I understand these 3 column values?

    ReplyDelete
    Replies
    1. Your interpretation is correct to a great extent."target" is what the output should have been, "model-output" is what the model predicted and for log-likelihood I've found the following on the web:

      "The likelihood of a set of parameter values, θ, given outcomes x, is equal to the probability of those observed outcomes given those parameter values, that is \mathcal{L}(\theta |x) = P(x | \theta)."

      Hope that makes it clear.

      Delete
  5. Thanks for the tutorial, I managed to get the same output.

    Just want to ask, what does the data inside donut.csv represent? Which one is the outcome (what are we predicting)?

    ReplyDelete
    Replies
    1. The data in donut.csv is sample/practice data only. We are trying to predict the value of the field "color" based upon the values of other fields called predictor variables, which in our example are "x y xx xy yy a b c".

      Delete
  6. Whats new here ?? this example is already present in mahout in action in detail.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Nice tutorial Jayanti! The training command went through...when I open donut.model i see some garbled data like @E @^@^@^@^@^@AnB
    When I run the STEP III, I see the following error message:
    Unexpected mahout/trunk/logreg/donut.model while processing --help|--quiet|--auc|--scores|--confusion||--model


    What do you think is happening here?

    ReplyDelete
  9. What if I wanna train and test a csv file with two columns "word" and "class" that each "class" shows the corresponding "word" is positive or negative. Can you give me those two terminal shells? Thank you so much!

    ReplyDelete
  10. Besides, I could successfully run this tutorial with donut.csv.

    ReplyDelete
  11. Hi,
    Thank you very much for the tutorial! :) I got the same result.
    I wonder if there is mahout command to custom split data to train and testing. Do you have any idea about it?

    ReplyDelete
    Replies
    1. Thanks.

      I am not aware of such a command in Mahout. Would have to check.

      Delete
  12. Hi Jayati,

    I have been trying to classify 20 newsgroup data using SGD algorithm but getting the following error.

    [cloudera@localhost classify_scripts]$ mahout trainTestSGD.trainTestSGD.TrainNewsGroups ${WORK_DIR}/20news-bydate/20news-bydate-train/
    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
    MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.4.0-job.jar
    14/08/17 23:24:13 WARN driver.MahoutDriver: Unable to add class: trainTestSGD.trainTestSGD.TrainNewsGroups
    java.lang.ClassNotFoundException: trainTestSGD.trainTestSGD.TrainNewsGroups
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:169)
    at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:129)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
    14/08/17 23:24:13 WARN driver.MahoutDriver: No trainTestSGD.trainTestSGD.TrainNewsGroups.props found on classpath, will use command-line arguments only
    Unknown program 'trainTestSGD.trainTestSGD.TrainNewsGroups' chosen.

    Can you please help me with this in anyway.

    ReplyDelete
  13. have you implemented any other algo in classification .

    ReplyDelete
  14. can we do logistic regression with input data in mahout ??can we integrate the result of it with hadoop in ubuntu??

    ReplyDelete
  15. after reading this blog i learnt more useful information about mahout from this blog..

    hadoop training in chennai adyar | big data training in chennai adyar

    ReplyDelete
  16. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training in chennai adyar | big data training in chennai adyar

    ReplyDelete