Saturday, July 25, 2015

Mahout’s Naïve Bayes: Train Phase

Mahout’s Naïve Bayes Classification algorithm executes in two phases:
  1. Train Phase: Trains a model using pre-processed train data
  2. Test Phase: Classify documents (pre-processed) with the help of the model
 This blog post provides code-level understanding of the training process in the algorithm. And testing phase is covered in the next blog. The Mahout command "trainnb" is used to train a Naive Bayes model in Mahout.

"trainnb" command in Mahout

For reference, above is the structure of the train data directory specified as an input (similar to the one used in data pre-processing).

Command Line options for “trainnb”

Generic Options

 -archives <paths>: comma separated archives to be unarchived on the compute machines.                
 -conf <configuration file>:  specify an application configuration file
 -D <property=value>: use value for given property
 -files <paths>: comma separated files to be copied to the map reduce cluster
 -fs<local|namenode:port>: specify a namenode
 -jt<local|jobtracker:port>: specify a job tracker
 -libjars<paths>: comma separated jar files to include inthe classpath.
 -tokenCacheFile<tokensFile>:  name of the file with the tokens

Job-Specific Options

  --input (-i) input: Path to job input directory.                 
  --output (-o) output: The directory pathname for output.                                                                     
  --labels (-l) labels:comma-separated list of labels to include in training                                     
  --extractLabels (-el):Extract the labels from the input            
  --alphaI (-a) alphaI: smoothing parameter                          
  --trainComplementary (-c):train complementary?                         
  --labelIndex (-li) labelIndex: The path to store the label index in         
  --overwrite (-ow): If present, overwrite the output directory before running job                           
  --help (-h):Print out help                               
  --tempDirtempDir: Intermediate output directory                
  --startPhasestartPhase:First phase to run                           
  --endPhaseendPhase: Last phase to run         

Flow of execution of the "trainnb" command

Hope it helped!

No comments:

Post a Comment