Jayati Tiwari: SparklingWater

Monday, June 15, 2015

Installing sparkling-water and Running sparkling-water's Deep Learning

Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on each node in the Spark cluster and access data stored in data structures of Spark and H2O.

Sparkling Water provides transparent integration for the H2O engine and its machine learning algorithms into the Spark platform, enabling:

1. Use of H2O algorithms in Spark workflow
2. Transformation between H2O and Spark data structures
3. Use of Spark RDDs as input for H2O algorithms
4. Transparent execution of Sparkling Water applications on top of Spark

To install Sparkling Water, Spark installation is a prerequisite. You can follow this link to install Spark in standalone mode if not already done.

Installing Sparkling Water

Create a working directory for Sparkling Water

mkdir $HOME/SparklingWater
cd $HOME/SparklingWater/

Clone Sparkling Water for linux

git clone https://github.com/0xdata/sparkling-water.git

Running Deep Learning on Sparkling Water

Deep Learning is a new area of Machine Learning research which is closer to Artificial Intelligence. Deep Learning algorithms are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. They are part of the broader machine learning field of learning representations of data. Also they learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

1. Download a prebuilt spark setup. This is needed since the Spark installation directory is read-only and the examples we shall run would need to write to the Spark folder.

wget http://www.apache.org/dyn/closer.cgi/spark/spark-1.2.0/spark-1.2.0.tgz

2. Export the Spark home

export SPARK_HOME='$HOME/SparklingWater/spark-1.2.0-bin-hadoop2.3'

3. Run the DeepLearningDemo example from Sparkling Water. It runs DeepLearning on a subset of airlines dataset (see dataset here sparkling-water/examples/smalldata/allyears2k_headers.csv.gz).

bin/run-example.sh DeepLearningDemo

4. In the long logs of the running job, try to see the following snippets:

Sparkling Water started, status of context:
Sparkling Water Context:
* number of executors: 3
* list of used executors:
(executorId, host, port)
------------------------
(0,127.0.0.1,54325)
(1,127.0.0.1,54327)
(2,127.0.0.1,54321)
------------------------
Output of jobs

===> Number of all flights via RDD#count call: 43978
===> Number of all flights via H2O#Frame#count: 43978
===> Number of flights with destination in SFO: 1331
====>Running DeepLearning on the result of SQL query

To stop the job press Ctrl+C. Logs similar to the above provide a lot of information about the job. You can also try running other algorithm implementation likewise.

Good Luck.

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.

2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.

2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O

2. SparkMLlib

3. Sparkling Water

4. Weka

In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.