Jayati Tiwari: SparkMLlib

Monday, June 15, 2015

Installing SparkMLlib on Linux and Running SparkMLlib implementations

SparkMLlib is a machine learning library which ships with Apache Spark and can run on any Hadoop2/YARN cluster without any pre-installation. It is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

The key features of SparkMLlib include:

1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components

There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.

Running Logistic Regression on SparkMllib

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").

Spark provides 'spark-submit.sh’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.

We shall be now running Logistic Regression as below:

Step-1: Export the required environment variables

export JAVA_HOME='your_java_home'
export SPARK_HOME='your_spark_home'

Step-2: Gather the dataset to run the algorithm on

mkdir ~/SparkMLlib
cd ~/SparkMLlib/
wget https://sites.google.com/site/jayatiatblogs/attachments/sample_binary_classification_data.txt

Now that you have the data set, copy it to HDFS.

hdfs dfs -mkdir -p /user/${USER}/classification_data
hdfs dfs -put -f $HOME/SparkMLlib/sample_binary_classification_data.txt /user/${USER}/classification_data/

Step-3: Submit the job to run Logistic Regression using the 'spark-submit.sh’ script

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master local[2]

$SPARK_HOME/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar --algorithm LR --regType L2 --regParam 1.0 /user/${USER}/classification_data/sample_binary_classification_data.txt

If all works fine, you must see the following after a long log message:

Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0.

Let’s do some cleaning of your HDFS.

hdfs dfs -rm -r -skipTrash /user/${USER}/classification_data

You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.

Good luck.

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.

2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.

2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O

2. SparkMLlib

3. Sparkling Water

4. Weka

In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.