Wiki says, "Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable."
Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:
http://prdownloads.sourceforge.net/weka/weka-3-6-12.zip
or
http://sourceforge.net/projects/weka/files/weka-3-6/3.6.12/weka-3-6-12.zip/download
Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.
We shall be proceeding step-by-step here onwards.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
So, first we shall create a folder to store our dataset and then download it.
First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.
You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.
Weka's filter called 'NumericToNominal' is meant for turning numeric attributes into nominal ones. Unlike discretization, it just takes all numeric values and adds them to the list of nominal values of that attribute. Useful after CSV imports, to enforce certain attributes to become nominal, e.g., the class attribute, containing values from 1 to 5.
Let's keep the entire 1000 records in the train dataset. We shall be using another utility from Weka called RemovePercentage. In the option -P we need to specify the percentage we wish to remove.
For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.
Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:
http://prdownloads.sourceforge.net/weka/weka-3-6-12.zip
or
http://sourceforge.net/projects/weka/files/weka-3-6/3.6.12/weka-3-6-12.zip/download
Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.
We shall be proceeding step-by-step here onwards.
Step-1: Download a dataset to run the classification on
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing
So, first we shall create a folder to store our dataset and then download it.
mkdir ~/WekaDataSet
cd ~/WekaDataSet wget http://mlr.cs.umass.edu/ml/machine-learning-databases/00222/bank.zip unzip bank.zip |
Step-2: Convert the data in CSV data format to ARFF
First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.
cd bank
head -1000 bank-full.csv >> bank-subset.csv java -cp $WEKA_HOME/weka.jar weka.core.converters.CSVLoader bank-subset.csv > bank-subset-preprocessed.arff |
You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.
Step-3: Convert the Numeric data to Nominal using Weka's utility
java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i bank-subset-preprocessed.arff -o bank-subset-preprocessed.nominal.arff
|
Step-4: Divide a part of the data as train and test data
java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 0 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-train.nominal.arff
|
For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.
java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 60 -i bank-subset-preprocessed.nominal.arff -o bank-subset-preprocessed-test.nominal.arff
|
Step-5: Train the model
-t option: Specify the location of the train data file
-d option: Specify the name and location of the model file you wish to be generated
java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -t bank-subset-preprocessed-train.nominal.arff -d bank-subset-preprocessed-model.arff
|
Step-6: Test the model
-T option: Specify the location of the test data file
-l option: Specify the location of the created model file
java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -T bank-subset-preprocessed-test.nominal.arff -l bank-subset-preprocessed-model.arff
|
That's it. You can also try the same with different percentages and different datasets.
Hope it helped.
No comments:
Post a Comment