Monday, June 15, 2015

Installing SparkMLlib on Linux and Running SparkMLlib implementations

SparkMLlib is a machine learning library which ships with Apache Spark and can run on any Hadoop2/YARN cluster without any pre-installation. It is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

The key features of SparkMLlib include:

1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components

There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.

Running Logistic Regression on SparkMllib


Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").

Spark provides 'spark-submit.sh’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.

We shall be now running Logistic Regression as below:

Step-1: Export the required environment variables



export JAVA_HOME='your_java_home'                                                                                            
export SPARK_HOME='your_spark_home'

Step-2: Gather the dataset to run the algorithm on



mkdir ~/SparkMLlib
cd ~/SparkMLlib/
wget https://sites.google.com/site/jayatiatblogs/attachments/sample_binary_classification_data.txt                       

Now that you have the data set, copy it to HDFS.


hdfs dfs -mkdir -p /user/${USER}/classification_data
hdfs dfs -put -f $HOME/SparkMLlib/sample_binary_classification_data.txt /user/${USER}/classification_data/                                                                             

Step-3: Submit the job to run Logistic Regression using the 'spark-submit.sh’ script



$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master local[2]

$SPARK_HOME/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar --algorithm LR --regType L2 --regParam 1.0 /user/${USER}/classification_data/sample_binary_classification_data.txt          

If all works fine, you must see the following after a long log message:


Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0.                                                                                                                 

Let’s do some cleaning of your HDFS.


hdfs dfs -rm -r -skipTrash /user/${USER}/classification_data                                                          

You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.

Good luck.

No comments:

Post a Comment