Friday, July 24, 2015

Understanding Data Pre-processing in Mahout – Part I

Two most common commands used for pre-processing of train or test data when running Mahout algorithms are:
  • seqdirectory:  Turns raw text in a directory into mahout sequence file.
  • seq2sparse: Creates tfidf weighted vector from the sequence files created in “seqdirectory”.
This blog post describes the first command “seqdirectory” to the code level and the next blog post shall focus on the second one.

Let’s start then!

Command Options

Firstly have a look at the Command-line options that can be used with “seqdirectory”.

 -archives <paths>: comma separated archives to be unarchived on the compute machines.
 -conf <configuration file>:  specify an application configuration file
 -D <property=value>: use value for given property
 -files <paths>: comma separated files to be copied to the map reduce cluster
 -fs <local|namenode:port>: specify a namenode
 -jt <local|jobtracker:port>: specify a job tracker
 -libjars <paths>: comma separated jar files to include in the classpath.
 -tokenCacheFile <tokensFile>: name of the file with the tokens

Job-specific Options

  --input (-i) input: Path to job input directory.
  --output (-o) output:  The directory pathname for output.                      
  --overwrite (-ow): If present, overwrite the output directory before running job                  
  --chunkSize (-chunk) chunkSize: The chunkSize in MegaBytes. Defaults to 64              
  --fileFilterClass (-filter) fileFilterClass: The name of the class to use for file parsing.
  --keyPrefix (-prefix) keyPrefix: The prefix to be prepended to the key                      
  --charset (-c) charset: The name of the character encoding of the input files.
  --tempDir tempDir: Intermediate output directory
  --startPhase startPhase: First phase to run          
  --endPhase endPhase: Last phase to run          

Code-level Explanation

The following image segregates the internal execution of the command into steps, and then further breaks down each step into tasks to be able to give a clear picture of what happens behind the scenes.

Code-level understanding of "seqdirectory" command

The next post describes the "seq2sparse" command in a similar fashion. Hope this helped.

No comments:

Post a Comment