Two most common commands used for pre-processing of train or test data when running Mahout algorithms are:
Let’s start then!
Firstly have a look at the Command-line options that can be used with “seqdirectory”.
The following image segregates the internal execution of the command into steps, and then further breaks down each step into tasks to be able to give a clear picture of what happens behind the scenes.
The next post describes the "seq2sparse" command in a similar fashion. Hope this helped.
- seqdirectory: Turns raw text in a directory into mahout sequence file.
- seq2sparse: Creates tfidf weighted vector from the sequence files created in “seqdirectory”.
Let’s start then!
Command Options
Firstly have a look at the Command-line options that can be used with “seqdirectory”.
-archives <paths>: comma separated archives to be unarchived on the compute machines.
-conf <configuration file>: specify an application configuration file -D <property=value>: use value for given property -files <paths>: comma separated files to be copied to the map reduce cluster -fs <local|namenode:port>: specify a namenode -jt <local|jobtracker:port>: specify a job tracker -libjars <paths>: comma separated jar files to include in the classpath. -tokenCacheFile <tokensFile>: name of the file with the tokens |
Job-specific Options
--input (-i) input: Path to job input directory.
--output (-o) output: The directory pathname for output. --overwrite (-ow): If present, overwrite the output directory before running job --chunkSize (-chunk) chunkSize: The chunkSize in MegaBytes. Defaults to 64 --fileFilterClass (-filter) fileFilterClass: The name of the class to use for file parsing. --keyPrefix (-prefix) keyPrefix: The prefix to be prepended to the key --charset (-c) charset: The name of the character encoding of the input files. --tempDir tempDir: Intermediate output directory --startPhase startPhase: First phase to run --endPhase endPhase: Last phase to run |
Code-level Explanation
The following image segregates the internal execution of the command into steps, and then further breaks down each step into tasks to be able to give a clear picture of what happens behind the scenes.
Code-level understanding of "seqdirectory" command |
The next post describes the "seq2sparse" command in a similar fashion. Hope this helped.
No comments:
Post a Comment