Jayati Tiwari: Data Pre-processing

Friday, July 24, 2015

Understanding Data Pre-processing in Mahout–Part II

In continuation to my previous post where first one of the two commonly used commands for data pre-processing in Mahout is described, we shall continue with the second one i.e. “seq2sparse” in this post.

The command expects sequence files as an input, which have been formed using the “seqdirectory” command. During its processing, this command creates a couple of sub-directories in the output directory, like tokenized-documents, tf-vectors etc. The flow described below explains the execution of the command on the basis of the sequence of formation of these sub-directories. The input directory would be assumed to contain two sub-directories each of which has two files:

Command options

Usage:

[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options:

--minSupport (-s) minSupport : (Optional) Minimum Support. Default Value: 2
--analyzerName (-a) analyzerName: The class name of the analyzer
--chunkSize (-chunk) chunkSize: The chunkSize in MegaBytes. 100-10000 MB
--output (-o) output: The directory pathname for output.
--input (-i) input: Path to job input directory.
--minDF (-md) minDF: The minimum document frequency. Default is 1
--maxDFSigma (-xs) maxDFSigma: What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less then 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent: The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. If maxDFSigma is also set, it will override this value.
--weight (-wt) weight: The kind of weight to use. Currently TF or TFIDF
--norm (-n) norm: The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize
--minLLR (-ml) minLLR: (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers: (Optional) Number of reduce tasks. Default Value: 1
--maxNGramSize (-ng) ngramSize: (Optional) The maximum size of ngrams create (2 = bigrams, 3 = trigrams, etc) Default Value:1
--overwrite (-ow): If set, overwrite the output directory
--help (-h): Print out help
--sequentialAccessVector(-seq):(Optional)Whether output vectors should be SequentialAccessVectors. If set else false
--namedVector (-nv): (Optional) Whether output vectors should be NamedVectors. If set true else false
--logNormalize (-lnorm): (Optional) Whether output vectors should be logNormalize. If set true else false

Code-level Explanation

Code-level understanding of "seq2sparse" command

Hope this helped!

Understanding Data Pre-processing in Mahout – Part I

Two most common commands used for pre-processing of train or test data when running Mahout algorithms are:

seqdirectory: Turns raw text in a directory into mahout sequence file.
seq2sparse: Creates tfidf weighted vector from the sequence files created in “seqdirectory”.

This blog post describes the first command “seqdirectory” to the code level and the next blog post shall focus on the second one.

Let’s start then!

Command Options

Firstly have a look at the Command-line options that can be used with “seqdirectory”.

-archives <paths>: comma separated archives to be unarchived on the compute machines.
-conf <configuration file>: specify an application configuration file
-D <property=value>: use value for given property
-files <paths>: comma separated files to be copied to the map reduce cluster
-fs <local|namenode:port>: specify a namenode
-jt <local|jobtracker:port>: specify a job tracker
-libjars <paths>: comma separated jar files to include in the classpath.
-tokenCacheFile <tokensFile>: name of the file with the tokens

Job-specific Options

--input (-i) input: Path to job input directory.
--output (-o) output: The directory pathname for output.
--overwrite (-ow): If present, overwrite the output directory before running job
--chunkSize (-chunk) chunkSize: The chunkSize in MegaBytes. Defaults to 64
--fileFilterClass (-filter) fileFilterClass: The name of the class to use for file parsing.
--keyPrefix (-prefix) keyPrefix: The prefix to be prepended to the key
--charset (-c) charset: The name of the character encoding of the input files.
--tempDir tempDir: Intermediate output directory
--startPhase startPhase: First phase to run
--endPhase endPhase: Last phase to run

Code-level Explanation

The following image segregates the internal execution of the command into steps, and then further breaks down each step into tasks to be able to give a clear picture of what happens behind the scenes.

Code-level understanding of "seqdirectory" command

The next post describes the "seq2sparse" command in a similar fashion. Hope this helped.