Monday, June 15, 2015

Running Naive Bayes Classification algorithm using Weka

Wiki says, "Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable."

Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:

http://prdownloads.sourceforge.net/weka/weka-3-6-12.zip

or

http://sourceforge.net/projects/weka/files/weka-3-6/3.6.12/weka-3-6-12.zip/download

Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.

We shall be proceeding step-by-step here onwards.


Step-1: Download a dataset to run the classification on


The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing

So, first we shall create a folder to store our dataset and then download it.



mkdir ~/WekaDataSet
cd ~/WekaDataSet
wget http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/bank.zip                                   
unzip bank.zip


Step-2: Convert the data in CSV data format to ARFF


First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.



cd bank
head -1000 bank-full.csv >> bank-subset.csv
java -cp $WEKA_HOME/weka.jar weka.core.converters.CSVLoader bank-subset.csv > bank-subset-preprocessed.arff

You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.


Step-3: Convert the Numeric data to Nominal using Weka's utility


Weka's filter called 'NumericToNominal' is meant for turning numeric attributes into nominal ones. Unlike discretization, it just takes all numeric values and adds them to the list of nominal values of that attribute. Useful after CSV imports, to enforce certain attributes to become nominal, e.g., the class attribute, containing values from 1 to 5.



java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i bank-subset-preprocessed.arff -o bank-subset-preprocessed.nominal.arff

Step-4: Divide a part of the data as train and test data


Let's keep the entire 1000 records in the train dataset. We shall be using another utility from Weka called RemovePercentage. In the option -P we need to specify the percentage we wish to remove.



java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 0 -i bank-subset-preprocessed.nominal.arff  -o  bank-subset-preprocessed-train.nominal.arff

For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.



java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 60 -i bank-subset-preprocessed.nominal.arff  -o  bank-subset-preprocessed-test.nominal.arff

Step-5: Train the model


Using the Naive Bayes Classifier of Weka "weka.classifiers.bayes.NaiveBayes", we shall first train the model.
-t option: Specify the location of the train data file
-d option: Specify the name and location of the model file you wish to be generated



java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -t bank-subset-preprocessed-train.nominal.arff -d bank-subset-preprocessed-model.arff

Step-6: Test the model


This is the final step. We would test the model for accuracy using the same classifier but with a different option set.
-T option: Specify the location of the test data file
-l option: Specify the location of the created model file



java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -T bank-subset-preprocessed-test.nominal.arff -l bank-subset-preprocessed-model.arff

That's it. You can also try the same with different percentages and different datasets.

Hope it helped.

No comments:

Post a Comment