Jayati Tiwari: WEKA

Showing posts with label WEKA. Show all posts

Monday, June 15, 2015

Running Naive Bayes Classification algorithm using Weka

Wiki says, "Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable."

Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:

http://prdownloads.sourceforge.net/weka/weka-3-6-12.zip

or

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.12/weka-3-6-12.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.

We shall be proceeding step-by-step here onwards.

Step-1: Download a dataset to run the classification on

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

So, first we shall create a folder to store our dataset and then download it.

mkdir ~/WekaDataSet
cd ~/WekaDataSet
wget http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/bank.zip
unzip bank.zip

Step-2: Convert the data in CSV data format to ARFF

First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.

cd bank
head -1000 bank-full.csv >> bank-subset.csv
java -cp $WEKA_HOME/weka.jar weka.core.converters.CSVLoader bank-subset.csv > bank-subset-preprocessed.arff

You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.

Step-3: Convert the Numeric data to Nominal using Weka's utility

Weka's filter called 'NumericToNominal' is meant for turning numeric attributes into nominal ones. Unlike discretization, it just takes all numeric values and adds them to the list of nominal values of that attribute. Useful after CSV imports, to enforce certain attributes to become nominal, e.g., the class attribute, containing values from 1 to 5.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i bank-subset-preprocessed.arff -o bank-subset-preprocessed.nominal.arff

Step-4: Divide a part of the data as train and test data

Let's keep the entire 1000 records in the train dataset. We shall be using another utility from Weka called RemovePercentage. In the option -P we need to specify the percentage we wish to remove.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 0 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-train.nominal.arff

For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 60 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-test.nominal.arff

Step-5: Train the model

Using the Naive Bayes Classifier of Weka "weka.classifiers.bayes.NaiveBayes", we shall first train the model.
-t option: Specify the location of the train data file
-d option: Specify the name and location of the model file you wish to be generated

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -t bank-subset-preprocessed-train.nominal.arff -d bank-subset-preprocessed-model.arff

Step-6: Test the model

This is the final step. We would test the model for accuracy using the same classifier but with a different option set.
-T option: Specify the location of the test data file
-l option: Specify the location of the created model file

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -T bank-subset-preprocessed-test.nominal.arff -l bank-subset-preprocessed-model.arff

That's it. You can also try the same with different percentages and different datasets.

Hope it helped.

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.

2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.

2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O

2. SparkMLlib

3. Sparkling Water

4. Weka

In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.

Sunday, May 26, 2013

Running Weka's Logistic Regression using Command Line

Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest release of WEKA. One of useful links working at the time of writing this post is:

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.9/weka-3-6-9.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-9”. We would call it WEKA_HOME for reference in this blog post.

You might want to run Weka’s logistic regression algorithm on two types of input data.

One is the sample data files in ARFF format already available in “WEKA_HOME/data”
Other is over some data files you already have in CSV format with you. For example, donut.csv file provided by Mahout for running it’s Logistic Regression over it.

Running LR over ARFF files

We would be using the file “WEKA_HOME/data/weather.nominal.arff” for running the algorithm. Cd to WEKA_HOME and run the following command

java -cp ./weka.jar weka.classifiers.functions.Logistic -t WEKA_HOME/weather.nominal.arff -T WEKA_HOME/weather.nominal.arff -d /some_location_on_your_machine/weather.nominal.model.arff

which should generate the trained model at “/some_location_on_your_machine/weather.nominal.model.arff” and the console output should look something like:

Logistic Regression with ridge parameter of 1.0E-8
Coefficients...
                                    Class
Variable                              yes
=========================================
outlook=sunny                    -45.2378
outlook=overcast                  57.5375
outlook=rainy                     -5.9067
temperature=hot                   -8.3327
temperature=mild                  44.8546
temperature=cool                 -45.4929
humidity                         118.1425
windy                             72.9648
Intercept                        -89.2032

Odds Ratios...
                                    Class
Variable                              yes
=========================================
outlook=sunny                           0
outlook=overcast      9.73275593611619E24
outlook=rainy                      0.0027
temperature=hot                    0.0002
temperature=mild     3.020787521374072E19
temperature=cool                        0
humidity            2.0353933107400553E51
windy                4.877521304260806E31

Time taken to build model: 0.12 seconds
Time taken to test model on training data: 0.01 seconds

=== Error on training data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

=== Error on test data ===

Correctly Classified Instances          14              100      %
Incorrectly Classified Instances         0                0      %
Kappa statistic                          1
Mean absolute error                      0
Root mean squared error                  0
Relative absolute error                  0.0002 %
Root relative squared error              0.0008 %
Total Number of Instances               14

=== Confusion Matrix ===

a b   <-- classified as
9 0 | a = yes
0 5 | b = no

Here the three arguments mean:

-t <name of training file> : Sets training file.
-T <name of test file> : Sets test file. If missing, a cross-validation will be performed on the training data.
-d <name of output file> : Sets model output file. In case the filename ends with '.xml', only the options are saved to the XML file, not the model.

For help on all available arguments, try running the following command from WEKA_HOME:

java -cp ./weka.jar weka.classifiers.functions.Logistic -h

Running LR over CSV files

For running Weka’s LR over a CSV file, you’ll need to convert it into ARFF format using a converter provided by WEKA. Using command line in linux, here are the steps:

Step-I: Convert the data into arff format, for converting from CSV to ARFF, run the following command from WEKA_HOME:

java -cp ./weka.jar weka.core.converters.CSVLoader someCSVFile.csv > outputARFFFile.arff

Step-II: Run the NumericToNominal filter over the arff file

java -cp ./weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i outputARFFFile.arff -o outputARFFFile.nominal.arff

Step-III: Run the classifier over the outputARFFFile.nominal.arff

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.nominal.arff -T outputARFFFile.nominal.arff -d outputARFFFile.nominal.model.arff

You might encounter an exception stating

"Cannot handle unary class!"

To resolve this, apply the attribute filter and eliminate the attribute which has same value for all the records in the file using:

java -cp ./weka.jar weka.filters.AttributeFilter -i outputARFFFile.nominal.arff -o outputARFFFile.filtered.nominal.arff -R 8

where the value of “–R” would vary depending upon your input file and the id of attribute to be eliminated in the input arff file.

After this, try running the classifier on the obtained “outputARFFFile.filtered.nominal.arff” file as in:

java -cp ./weka.jar weka.classifiers.functions.Logistic -t outputARFFFile.filtered.nominal.arff -T outputARFFFile.filtered.nominal.arff -d outputARFFFile.nominal.model.arff

The output should appear somewhat like we got when running the classifier over the provided sample data mentioned above.

With these steps, you are ready to play with WEKA. Go for it. Cheers !!!