Monday, June 15, 2015

Installing sparkling-water and Running sparkling-water's Deep Learning

Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on each node in the Spark cluster and access data stored in data structures of Spark and H2O.

Sparkling Water provides transparent integration for the H2O engine and its machine learning algorithms into the Spark platform, enabling:

1. Use of H2O algorithms in Spark workflow
2. Transformation between H2O and Spark data structures
3. Use of Spark RDDs as input for H2O algorithms
4. Transparent execution of Sparkling Water applications on top of Spark

To install Sparkling Water, Spark installation is a prerequisite. You can follow this link to install Spark in standalone mode if not already done.

Installing Sparkling Water


Create a working directory for Sparkling Water


mkdir $HOME/SparklingWater
cd $HOME/SparklingWater/                                                                                                                                                                     

Clone Sparkling Water for linux


git clone https://github.com/0xdata/sparkling-water.git                                                                                                                                                         

Running Deep Learning on Sparkling Water


Deep Learning is a new area of Machine Learning research which is closer to Artificial Intelligence. Deep Learning algorithms are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. They are part of the broader machine learning field of learning representations of data. Also they learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

1. Download a prebuilt spark setup. This is needed since the Spark installation directory is read-only and the examples we shall run would need to write to the Spark folder.


wget http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0.tgz                                                                                                                                                         

2. Export the Spark home


export SPARK_HOME='$HOME/SparklingWater/spark-1.2.0-bin-hadoop2.3'                                                                                                                                                        

3. Run the DeepLearningDemo example from Sparkling Water. It runs DeepLearning on a subset of airlines dataset (see dataset here sparkling-water/examples/smalldata/allyears2k_headers.csv.gz).


bin/run-example.sh DeepLearningDemo                                                                                                                                                        

4. In the long logs of the running job, try to see the following snippets:


Sparkling Water started, status of context:
Sparkling Water Context:
 * number of executors: 3
 * list of used executors:
  (executorId, host, port)
  ------------------------
  (0,127.0.0.1,54325)
  (1,127.0.0.1,54327)
  (2,127.0.0.1,54321)
  ------------------------
Output of jobs

===> Number of all flights via RDD#count call: 43978
===> Number of all flights via H2O#Frame#count: 43978
===> Number of flights with destination in SFO: 1331
====>Running DeepLearning on the result of SQL query
                                                                                                                                                                

To stop the job press Ctrl+C. Logs similar to the above provide a lot of information about the job. You can also try running other algorithm implementation likewise.

Good Luck.

No comments:

Post a Comment