Monday, April 27, 2015

Common Functions in R

Google "What's R", and you'll see there are many ways in which R has been defined. As per my understanding, firstly it's a programming language. Secondly, it's solely meant for statistical computing. It's not a generic programming language like Java. Now the question is "what all comes under statistical computing and graphics?" Wiki explains statistical computing as the interface between the mathematical science of statistics and computer science. And statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. So basically R is a programming language dedicated for being used to perform the study of the collection, analysis, interpretation, presentation, and organization of data. And now you know why is it so much in talks? Because in this Big Data Yug(meaning era), data is everything for everyone.

Hoping that you have R installed on your Linux/Windows machine, I hereby intend to showcase few common operations that one might almost always need to use. I have also written about how to perform all these operations using Hive in one of my previous posts. But hey, by the way if you don't have R installed, follow this link( to get it done.

1. Sorting in R
2. Searching in R
3. Joins in R
4. Sampling in R
5. Calculating median in R
6. Calculating mean in R
7. Finding elements from one vector which don't exist in vector in R

Open a new terminal on your machine and type 'R' to open an R shell. All the operations that follow will need the R shell.

Sorting a Vector in R

The sort() function in R takes the vector or data frame as inuput and sorts it.
#Create a vector
>a <- c(2,6,8,1,3,7,3,60,32)
#Display the vector
> a
[1]  2  6  8  1  3  7  3 60 32
#Sort the vector
> sort(a)
[1]  1  2  3  3  6  7  8 32 60                                                                 

Searching an element in Vector in R

grep() function needs to parameters, first parameter is the element to be searched and second is the vector name.
#Search for 60 in the vector 'a'                                                          
>grep(60, a)
[1] 8

Joining Files in R

Joining in R can be done on files. First the CSV files are read into variables and then 'merge' function joins the two files, only condition being that they should possess at least one column in common whose values can be used to join the files.
#Exit the R shell and in your terminal create two csv file as shown below:

vi ~/my_join_table.csv

#Paste the following content

vi ~/my_table.csv

#Paste the following content

#Reopen the R shell and load the two csv files to two variables. Replace "user_name" with your username.
mydata1 = read.csv("/home/user_name/my_join_table.csv", header=T)
mydata2 = read.csv("/home/user_name/my_table.csv", header=T)

#Merge the two files
myfulldata = merge(mydata1, mydata2)

#Display the data in the merged file
  id age   phone  name  address
1  1  18 1111111   Ram add_1
2  2  19 2222222 Shyam add_2
3  3  17 3333333  Sita add_3
4  5  20 5555555  John  add_5

Sampling in R

The 'sample' function has the following syntax:

sample(x, size, replace = FALSE, prob = NULL)
OR, size = n, replace = FALSE, prob = NULL)


x: Either a vector of one or more elements from which to choose, or a positive integer. 
n: a positive number, the number of items to choose from. 
size: a non-negative integer giving the number of items to choose.
replace: Should sampling be with replacement?
prob: A vector of probability weights for obtaining the elements of the vector being sampled.
> x <- c(4, 7, 2, 4, 9, 10, 55, 77, 1)

> sample(x, 5, replace = FALSE, prob = NULL)                             
[1] 7 2 4 4 1

Calculating median in R

The 'median method returns median of the elements in the vector
> median(x)
[1] 7                                                                                                   

Calculating mean in R

Similarly, 'mean' method is used to calculate the mean of all the elements of a vector. 
> mean(x)                                                                                          
[1] 18.77778

Find what's in one vector and not in another using R

This operation is like an A-B operation and can be accomplished using setdiff() function which takes two parameters
x <- c(1,2,3,4)
y <- c(2,3,4)
> setdiff(x, y)                                                                                     

Hope that helps!

Friday, April 24, 2015

Running Naive Bayes Classification algorithm using Weka

Wiki says, "Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable."

Weka also provides a Naive Bayes Classification algorithm implementation. Running Weka’s algorithms from command line, requires a very simple setup of Weka to be in place. All you need is to download latest (3-6-12 being the latest stable one) release of WEKA. Some useful links working at the time of writing this post is:


Next, you’ll need to unzip this setup, which would give you a directory with name “weka-3-6-12”. We would call it WEKA_HOME for reference in this blog post.

We shall be proceeding step-by-step here onwards.

Step-1: Download a dataset to run the classification on

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y). You can read more about the dataset here

So, first we shall create a folder to store our dataset and then download it.

mkdir ~/WekaDataSet
cd ~/WekaDataSet

Step-2: Convert the data in CSV data format to ARFF

First we shall create a subset of the entire dataset so as to do a quick test. You can run the test on the entire dataset or other datasets as well later on.

cd bank
head -1000 bank-full.csv >> bank-subset.csv
java -cp $WEKA_HOME/weka.jar weka.core.converters.CSVLoader bank-subset.csv > bank-subset-preprocessed.arff

You should see a file called 'bank-subset-preprocessed.arff' in the 'bank' folder.

Step-3: Convert the Numeric data to Nominal using Weka's utility

Weka's filter called 'NumericToNominal' is meant for turning numeric attributes into nominal ones. Unlike discretization, it just takes all numeric values and adds them to the list of nominal values of that attribute. Useful after CSV imports, to enforce certain attributes to become nominal, e.g., the class attribute, containing values from 1 to 5.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.attribute.NumericToNominal -i bank-subset-preprocessed.arff -o bank-subset-preprocessed.nominal.arff

Step-4: Divide a part of the data as train and test data

Let's keep the entire 1000 records in the train dataset. We shall be using another utility from Weka called RemovePercentage. In the option -P we need to specify the percentage we wish to remove.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 0 -i bank-subset-preprocessed.nominal.arff  -o  bank-subset-preprocessed-train.nominal.arff

For the test dataset we shall be using 40 percent of the dataset and the -p option needs to be 60.

java -cp $WEKA_HOME/weka.jar weka.filters.unsupervised.instance.RemovePercentage -P 60 -i bank-subset-preprocessed.nominal.arff  -o  bank-subset-preprocessed-test.nominal.arff

Step-5: Train the model

Using the Naive Bayes Classifier of Weka "weka.classifiers.bayes.NaiveBayes", we shall first train the model.
-t option: Specify the location of the train data file
-d option: Specify the name and location of the model file you wish to be generated

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -t bank-subset-preprocessed-train.nominal.arff -d bank-subset-preprocessed-model.arff

Step-6: Test the model

This is the final step. We would test the model for accuracy using the same classifier but with a different option set.
-T option: Specify the location of the test data file
-l option: Specify the location of the created model file

java -cp $WEKA_HOME/weka.jar weka.classifiers.bayes.NaiveBayes -T bank-subset-preprocessed-test.nominal.arff -l bank-subset-preprocessed-model.arff

That's it. You can also try the same with different percentages and different datasets.

Hope it helped.

Thursday, April 23, 2015

Installing sparkling-water and Running sparkling-water's Deep Learning

Sparkling Water is designed to be executed as a regular Spark application. It provides a way to initialize H2O services on each node in the Spark cluster and access data stored in data structures of Spark and H2O.

Sparkling Water provides transparent integration for the H2O engine and its machine learning algorithms into the Spark platform, enabling:

1. Use of H2O algorithms in Spark workflow
2. Transformation between H2O and Spark data structures
3. Use of Spark RDDs as input for H2O algorithms
4. Transparent execution of Sparkling Water applications on top of Spark

To install Sparkling Water, Spark installation is a prerequisite. You can follow this link to install Spark in standalone mode if not already done.

Installing Sparkling Water

Create a working directory for Sparkling Water

mkdir $HOME/SparklingWater
cd $HOME/SparklingWater/                                                                                                                                                                       

Clone Sparkling Water for linux

git clone                                                                                                                                                           

Running Deep Learning on Sparkling Water

Deep Learning is a new area of Machine Learning research which is closer to Artificial Intelligence. Deep Learning algorithms are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. They are part of the broader machine learning field of learning representations of data. Also they learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

1. Download a prebuilt spark setup. This is needed since the Spark installation directory is read-only and the examples we shall run would need to write to the Spark folder.


2. Export the Spark home

export SPARK_HOME='$HOME/SparklingWater/spark-1.2.0-bin-hadoop2.3'                                                                                                                                                          

3. Run the DeepLearningDemo example from Sparkling Water. It runs DeepLearning on a subset of airlines dataset (see dataset here sparkling-water/examples/smalldata/allyears2k_headers.csv.gz).

bin/ DeepLearningDemo                                                                                                                                                          

4. In the long logs of the running job, try to see the following snippets:

Sparkling Water started, status of context:
Sparkling Water Context:
 * number of executors: 3
 * list of used executors:
  (executorId, host, port)
Output of jobs

===> Number of all flights via RDD#count call: 43978
===> Number of all flights via H2O#Frame#count: 43978
===> Number of flights with destination in SFO: 1331
====>Running DeepLearning on the result of SQL query

To stop the job press Ctrl+C. Logs similar to the above provide a lot of information about the job. You can also try running other algorithm implementation likewise.

Good Luck.

Installing SparkMLlib on Linux and Running SparkMLlib implementations

SparkMLlib is a machine learning library which ships with Apache Spark and can run on any Hadoop2/YARN cluster without any pre-installation. It is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

The key features of SparkMLlib include:

1. Scalability
2. Performance
3. User-friendly APIs
4. Integration with Spark and its other components

There is nothing special about MLlib installation, it is already included in Spark. So if your machine already has Spark installed and running, you have nothing to do especially for Spark MLlib. You can follow this link to install Spark in standalone mode if not already done.

Running Logistic Regression on SparkMllib

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables, which are usually continuous, by estimating probabilities. Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types (for example, "dead" vs. "alive"). Multinomial logistic regression deals with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C").

Spark provides '’ script to submit jobs to the Spark cluster. The jar spark-assembly-*-cdh*-hadoop*-cdh*.jar comprises all the algorithm implementations.

We shall be now running Logistic Regression as below:

Step-1: Export the required environment variables

export JAVA_HOME='your_java_home'                                                                                            
export SPARK_HOME='your_spark_home'

Step-2: Gather the dataset to run the algorithm on

mkdir ~/SparkMLlib
cd ~/SparkMLlib/

Now that you have the data set, copy it to HDFS.

hdfs dfs -mkdir -p /user/${USER}/classification_data
hdfs dfs -put -f $HOME/SparkMLlib/sample_binary_classification_data.txt /user/${USER}/classification_data/                                                                               

Step-3: Submit the job to run Logistic Regression using the '’ script

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master local[2]

$SPARK_HOME/lib/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar --algorithm LR --regType L2 --regParam 1.0 /user/${USER}/classification_data/sample_binary_classification_data.txt          

If all works fine, you must see the following after a long log message:

Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0.                                                                                                                   

Let’s do some cleaning of your HDFS.

hdfs dfs -rm -r -skipTrash /user/${USER}/classification_data                                                          

You can run the other implementations of SparkMLlib as well in a similar fashion with the required data.

Good luck.

Wednesday, April 22, 2015

Installing H2O and Running ML Implementations of H2O

H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use.

As per it's description, it intelligently combines unique features not currently found in other machine learning platforms including:

1.Best of Breed Open Source Technology: H2O leverages the most popular OpenSource products like Apache Hadoop and Spark to give customers the flexibility to solve their most challenging data problems.
2.Easy-to-use WebUI and Familiar Interfaces: Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
3.Data Agnostic Support for all Common Database and File Types: Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
4. Massively Scalable Big Data Analysis: Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
5. Real-time Data Scoring: Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.

Installing H2O on Linux

Installling H2O on you Linux machine (this section is tested with Centos 6.6) is very straight forward. Follow the steps below:

#Create a local directory for installation
mkdir H2O
cd H2O
#Download the latest release of H2O
#Unzip the downloaded file
cd h2o-
#Start H2O
java -jar h2o.jar

You must see a log like the below:

INFO WATER: ----- H2O started -----
INFO WATER: Build git branch: rel-noether
INFO WATER: Build git hash: 4089ab3911999c73dcb611ab2f51cfc9bb86898b
INFO WATER: Build git describe: jenkins-rel-noether-4
INFO WATER: Build project version:
INFO WATER: Built by: 'jenkins'
INFO WATER: Built on: 'Sat Feb  7 13:39:20 PST 2015'
INFO WATER: Java availableProcessors: 16
INFO WATER: Java heap totalMemory: 1.53 gb
INFO WATER: Java heap maxMemory: 22.75 gb
INFO WATER: Java version: Java 1.7.0_75 (from Oracle Corporation)
INFO WATER: OS   version: Linux 2.6.32-504.3.3.el6.x86_64 (amd64)
INFO WATER: Machine physical memory: 102.37 gb

You can access the Web UI at http://localhost:54321

Running H2O's GLM function on R

We shall be running H2O's GLM on R here. We could also have done it without R using only the Linux command line. But I found it easier this way. 

GLM is Generalized Linear Model, a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution

If you don't have R already installed on your linux box, follow this link.

So we shall be performing couple of tasks to get GLM running on H2O.

Install H2O on R

You have installed H2O, then R and now we need to install H2O on R.

Open the R shell by typing "R" in your terminal and then enter the following commands there.   

Now in your linux terminal type:

cd /location_of_your_H2O_setup/h2o-
install.packages("location_of_your_H2O_setup/h2o-", repos = NULL, type = "source")

If all went fine, congratulate yourself. You have H2O and R and H2O on R installed :-)

Running a Demo

H2O packages examples to demostrate how its algorithm implementations work. The GLM is also a part of those demos. It would download the data called prostate.csv from authorized location on the web and use it as input. This demo would perform Logistic Regression of Prostate Cancer Data.

All you have to do is:

cd /location_of_your_H2O_setup/h2o-

You should see logs like the below:


        ---- ~~~~~~~
> # This is a demo of H2O's GLM function
> # It imports a data set, parses it, and prints a summary
> # Then, it runs GLM with a binomial link function using 10-fold cross-validation
> # Note: This demo runs H2O on localhost:54321
> library(h2o)
> localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE)
Successfully connected to http://localhost:54321
R is connected to H2O cluster:
    H2O cluster uptime:         1 hours 45 minutes
    H2O cluster version:
    H2O cluster name:           jayati.tiwari
   H2O cluster total nodes:    1    H2O cluster total memory:   22.75 GB
    H2O cluster total cores:    16
    H2O cluster allowed cores:  16
    H2O cluster healthy:        TRUE
> prostate.hex = h2o.uploadFile(localH2O, path = system.file("extdata", "prostate.csv", package="h2o"), key = "prostate.hex")
  |======================================================================| 100%
> summary(prostate.hex)
 ID               CAPSULE          AGE             RACE        
 Min.   :  1.00   Min.   :0.0000   Min.   :43.00   Min.   :0.000
 1st Qu.: 95.75   1st Qu.:0.0000   1st Qu.:62.00   1st Qu.:1.000
 Median :190.50   Median :0.0000   Median :67.00   Median :1.000
 Mean   :190.50   Mean   :0.4026   Mean   :66.04   Mean   :1.087
 3rd Qu.:285.25   3rd Qu.:1.0000   3rd Qu.:71.00   3rd Qu.:1.000
 Max.   :380.00   Max.   :1.0000   Max.   :79.00   Max.   :2.000
 DPROS           DCAPS           PSA               VOL          
 Min.   :1.000   Min.   :1.000   Min.   :  0.300   Min.   : 0.00
 1st Qu.:1.000   1st Qu.:1.000   1st Qu.:  5.000   1st Qu.: 0.00
 Median :2.000   Median :1.000   Median :  8.725   Median :14.25
 Mean   :2.271   Mean   :1.108   Mean   : 15.409   Mean   :15.81
 3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 17.125   3rd Qu.:26.45
 Max.   :4.000   Max.   :2.000   Max.   :139.700   Max.   :97.60
 Min.   :0.000
 1st Qu.:6.000
 Median :6.000
 Mean   :6.384
 3rd Qu.:7.000
 Max.   :9.000

> prostate.glm = h2o.glm(x = c("AGE","RACE","PSA","DCAPS"), y = "CAPSULE", data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)
  |======================================================================| 100%
> print(prostate.glm)
IP Address: localhost
Port      : 54321
Parsed Data Key: prostate.hex
GLM2 Model Key: GLMModel__ba962660a263d41ab4531103562b4422
      AGE      RACE     DCAPS       PSA Intercept
 -0.01104  -0.63136   1.31888   0.04713  -1.10896
Normalized Coefficients:
      AGE      RACE     DCAPS       PSA Intercept
 -0.07208  -0.19495   0.40972   0.94253  -0.33707
Degrees of Freedom: 379 Total (i.e. Null);  375 Residual
Null Deviance:     512.3
Residual Deviance: 461.3  AIC: 471.3
Deviance Explained: 0.09945
 Best Threshold: 0.328
Confusion Matrix:
Actual   false true   Error
  false    127  100 0.44053
  true      51  102 0.33333
  Totals   178  202 0.39737

AUC =  0.6887507 (on train)
Cross-Validation Models:
Nonzeros       AUC Deviance          Explained
Model 1         4 0.6532738          0.8965221
Model 2         4 0.6316527          0.8752008
Model 3         4 0.7100840          0.8955293
Model 4         4 0.8268698          0.9099155
Model 5         4 0.6354167          0.9079152
Model 6         4 0.6888889          0.8881883
Model 7         4 0.7366071          0.9091687
Model 8         4 0.6711310          0.8917893
Model 9         4 0.7803571          0.9178481
Model 10        4 0.7435897          0.9065831
> myLabels = c(prostate.glm@model$x, "Intercept")
> plot(prostate.glm@model$coefficients, xaxt = "n", xlab = "Coefficients", ylab = "Values")
> axis(1, at = 1:length(myLabels), labels = myLabels)
> abline(h = 0, col = 2, lty = 2)
> title("Coefficients from Logistic Regression\n of Prostate Cancer Data")
> barplot(prostate.glm@model$coefficients, main = "Coefficients from Logistic Regression\n of Prostate Cancer Data")

Great ! Your demo ran fine.

Starting H2O from R

Before we try running GLM from the R shell, we need to start H2O. We shall achieve this from within the R shell itself.

localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '4g')

You should see something like:

Successfully connected to http://localhost:54321
R is connected to H2O cluster:
    H2O cluster uptime:         2 hours 3 minutes 
    H2O cluster version: 
    H2O cluster name:           jayati.tiwari 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   22.75 GB 
    H2O cluster total cores:    16 
    H2O cluster allowed cores:  16 
    H2O cluster healthy:        TRUE 

This starts H2O. 

Running H2O's GLM from R

In the same R shell continue to run the GLM example now.

prostate.hex = h2o.importFile(localH2O, path = "", key = "prostate.hex")

h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)

This command should output the following on your terminal

|======================================================================| 100%
IP Address: localhost 
Port      : 54321 
Parsed Data Key: prostate.hex 
GLM2 Model Key: GLMModel__8efb9141cab4671715fc8319eae54ca8
      AGE      RACE     DCAPS       PSA Intercept 
 -0.01104  -0.63136   1.31888   0.04713  -1.10896 
Normalized Coefficients:
      AGE      RACE     DCAPS       PSA Intercept 
 -0.07208  -0.19495   0.40972   0.94253  -0.33707 
Degrees of Freedom: 379 Total (i.e. Null);  375 Residual
Null Deviance:     512.3
Residual Deviance: 461.3  AIC: 471.3
Deviance Explained: 0.09945 
 Best Threshold: 0.328
Confusion Matrix:
Actual   false true   Error
  false    127  100 0.44053
  true      51  102 0.33333
  Totals   178  202 0.39737
AUC =  0.6887507 (on train) 
Cross-Validation Models:
Nonzeros       AUC Deviance Explained
Model 1         4 0.6532738          0.8965221
Model 2         4 0.6316527          0.8752008
Model 3         4 0.7100840          0.8955293
Model 4         4 0.8268698          0.9099155
Model 5         4 0.6354167          0.9079152
Model 6         4 0.6888889          0.8881883
Model 7         4 0.7366071          0.9091687
Model 8         4 0.6711310          0.8917893
Model 9         4 0.7803571          0.9178481
Model 10        4 0.7435897          0.9065831

As you can see, you have predictions in place and the accuracy score as well.  

Hope it helped !!

Tuesday, April 21, 2015

Feature comparison of Machine Learning Libraries

Machine learning is a subfield of computer science stemming from research into artificial intelligence. It is a scientific discipline that explores the construction and study of algorithms that can learn from data. Such algorithms operate by building a model from example inputs and using that to make predictions or decisions, rather than following strictly static program instructions.

Every machine learning algorithm constitutes two phases:

1. Training Phase: When the algorithm learns from the input data and creates a model for reference.
2. Testing Phase: When the algorithm predicts the results based on it’s learnings stored in the model.

Machine learning is categorized into:

1. Supervised Learning: In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs.
2. Unsupervised Learning: In unsupervised learning, all the observations are assumed to be caused by latent variables, that is, the observations are assumed to be at the end of the causal chain.

There is wide range of machine learning libraries that provide implementations of various classes of algorithms. In my coming posts, we shall be evaluating the following open-source machine learning APIs on performance, scalability, range of algorithms provided and extensibility.

1. H2O
2. SparkMLlib
3. Sparkling Water
4. Weka

In the following posts, we shall be installing each of the above libraries and run one implementation of an algorithm available in all. This would give us an insight into the ease of use/execution, performance of the algorithm and accuracy of the algorithm.

Commonly Used Hive Queries

Hello Readers, 

I feel extremely fortunate to be hearing questions from many of you on my personal email id about why haven't I blogged since long. So, I am returning to my blog after about two years, with a determination to addup a lot more content (which has piled up in these two years) that might be of any help to anyone working on Big Data. Thanks for the motivation friends. :) So here I go.

To start with, I chose to write the most commonly used Hive queries. You can pick on this blog any time you forget the syntax of any of them ;)

Just like in SQL, there is a common set of Hive queries that tend to be used the most. These include queries for operations like:

1. Sorting
2. Searching
3. Joins
4. Sampling
5. Calculating median
6. Calculating mean
7. Finding records from one table which don't exist in another

Assuming you have a running Hadoop cluster and Hive on top of it already setup, the following sections of the tutorial will help you understand how to go about the following operations in Hive.

Data Set Preparation

Let's assume you have the following tables and data in Hive:



and another table 'my_join_table':


Creating tables in Hive

To create tables like above in Hive, follow the steps below:

1. Create two files 'my_table.csv' and 'my_join_table.csv' and copy the respective comma separated content shown above in each file.

2. Place these files in HDFS

hdfs dfs -mkdir hive_tutorial_my_table                                                                                            
hdfs dfs -put my_table.csv hive_tutorial_my_table
hdfs dfs -mkdir hive_tutorial_my_join_table
hdfs dfs -put my_join_table.csv hive_tutorial_my_join_table

Yes, we have intentionally placed the two files in two different tables, since to load the file into a hive table using Create table statement, you need to specify a folder location. File paths do not work. And by default Hive adds the data of all the files in that folder to the Hive table. Strange right?

3. Create it!

Remember to replace the 'your_hdfs_user_name' with a valid name else you'll get a depressing error statement. ;)

create external table my_table (id int, name string, address string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_table' tblproperties ("skip.header.line.count"="1");

create external table my_join_table (id int, age int, phone string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_join_table' tblproperties ("skip.header.line.count"="1");

Cool. So now you have all that's required to try the queries.

Running the Hive Queries

Sorting in Hive

Sorting by Integer: By default, Hive sorts an integer column in the Ascending order.

Query: select id, age, phone from my_join_table order by age;                                                    

Sorting by String: And in the lexicographical order for String columns.

Query: select id, name, address from my_table order by name;                                                    

If you have a very large data set, the ORDER BY clause might take a long long time because it pushes all data through just one reducer which is unacceptable for large datasets. To optimize, use CLUSTER BY. CLUSTER BY ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. You can go for this if you are okay with joining the multiple output files yourself.

Query: select id, name, address from my_table cluster by name;                                                      

Searching in Hive

Searching can be performed using the 'where' clause.

Query: select id, age, phone from my_join_table where age>18;                                                   

Joins in Hive

Inner Join: The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.

Query: select,, a.address, b.age, from my_table a join my_join_table b on;

Left Join: The Left Outer Join keyword returns each row that satisfies the join of the first table with the second table. It also returns any rows from the first table that had no matching rows in the second table. The non-matching rows in the second table are returned as null values.

Query: select,, a.address, b.age, from my_table a left join my_join_table b on;

Right Join: The RIGHT JOIN keyword returns all rows from the second table, with the matching rows in the first table. The result is NULL in the left side when there is no match.

Query: select,, a.address, b.age, from my_table a right join my_join_table b on;

Full Outer Join: The FULL OUTER JOIN keyword returns all rows from the first table and from the second table. The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.

Query: select,, a.address, b.age, from my_table a full outer join my_join_table b on;

Sampling in Hive

Hive supports sampling by different parameters as follows:

Sampling by percentage:

Query: select id, name, address from my_table TABLESAMPLE(0.1 PERCENT);                          

Sampling by size:

Query: select id, name, address from my_table TABLESAMPLE(10M) s;                                       

Sampling by number of records:

Query: select id, name, address from my_table TABLESAMPLE(2 ROWS) s;                               

Sampling by Bucketing:

Query: create table my_join_table_bucketed(id INT, age INT, phone STRING) comment 'A bucketed copy of my_join_table' clustered by(age) into 4 buckets;
Query: insert into table my_join_table_bucketed select id, age, phone from my_join_table;
Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 3 OUT OF 4 ON rand()) s;
Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 1 OUT OF 4 ON age);

Calculating median

The function percentile with the following syntax can be used to calculate median in hive. It returns the exact pth percentile of a column in the group (does not work with floating point types). To calculate the median, p should be equal to 0.5. Otherwise it can be any value between 0 and 1.

percentile(BIGINT col, p)                                                                                                                  

We shall use this to calculate the median of the column 'age' in 'my_join_table'

Query: select percentile(cast(a.age as bigint), 0.5) AS age_median from my_join_table a;            

Calculating mean

Calculating mean of the values in a column is pretty straight forward with the help of the function 'avg(col)'

Query: select avg(a.age) AS age_mean from my_join_table a;                                                      

Finding records from one table which don't exist in another

Now, this is the trickiest one. We intend to find those records of table A which do not exist in table B. Sort of A-B. For this, the schema of the tables need not always be the same. The concept is very simple, we would join the two tables and apply a check statement to include only those records where the join column of the first table is NOT NULL. This way we get those records which are in the first table and not in the second.

Add a row to my_table with id not in table my_join_table.

Query: insert into my_table values (6, 'Jayati', 'add_6');

Query: select,, a.address, b.age, from my_table a join my_join_table b on where ( IS NOT NULL);

Hope this tutorial gave a good start with Hive!