SparkMLlib is a machine learning library which ships with Apache Spark and can run on any Hadoop2/YARN cluster without any pre-installation. It is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
The key features of SparkMLlib include:
1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components
There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").
Spark provides 'spark-submit.sh’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.
We shall be now running Logistic Regression as below:
Now that you have the data set, copy it to HDFS.
If all works fine, you must see the following after a long log message:
Let’s do some cleaning of your HDFS.
You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.
Good luck.
The key features of SparkMLlib include:
1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components
There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.
Running Logistic Regression on SparkMllib
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").
Spark provides 'spark-submit.sh’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.
We shall be now running Logistic Regression as below:
Step-1: Export the required environment variables
export JAVA_HOME='your_java_home'
export SPARK_HOME='your_spark_home' |
Step-2: Gather the dataset to run the algorithm on
mkdir ~/SparkMLlib
cd ~/SparkMLlib/ wget https://sites.google.com/site/jayatiatblogs/attachments/sample_binary_classification_data.txt |
Now that you have the data set, copy it to HDFS.
hdfs dfs -mkdir -p /user/${USER}/classification_data
hdfs dfs -put -f $HOME/SparkMLlib/sample_binary_classification_data.txt /user/${USER}/classification_data/ |
Step-3: Submit the job to run Logistic Regression using the 'spark-submit.sh’ script
$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master local[2]
$SPARK_HOME/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar --algorithm LR --regType L2 --regParam 1.0 /user/${USER}/classification_data/sample_binary_classification_data.txt |
If all works fine, you must see the following after a long log message:
Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0. |
Let’s do some cleaning of your HDFS.
hdfs dfs -rm -r -skipTrash /user/${USER}/classification_data
|
You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.
Good luck.
No comments:
Post a Comment