H2O is an open source predictive analytics platform. Unlike traditional analytics tools, H2O provides a combination of extraordinary math and high performance parallel processing with unrivaled ease of use.
As per it's description, it intelligently combines unique features not currently found in other machine learning platforms including:
1.Best of Breed Open Source Technology: H2O leverages the most popular OpenSource products like Apache Hadoop and Spark to give customers the flexibility to solve their most challenging data problems.
2.Easy-to-use WebUI and Familiar Interfaces: Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
3.Data Agnostic Support for all Common Database and File Types: Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
4. Massively Scalable Big Data Analysis: Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
5. Real-time Data Scoring: Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.
Installling H2O on you Linux machine (this section is tested with Centos 6.6) is very straight forward. Follow the steps below:
You must see a log like the below:
You can access the Web UI at http://localhost:54321
As per it's description, it intelligently combines unique features not currently found in other machine learning platforms including:
1.Best of Breed Open Source Technology: H2O leverages the most popular OpenSource products like Apache Hadoop and Spark to give customers the flexibility to solve their most challenging data problems.
2.Easy-to-use WebUI and Familiar Interfaces: Set up and get started quickly using either H2O’s intuitive Web-based user interface or familiar programming environ- ments like R, Java, Scala, Python, JSON, and through our powerful APIs.
3.Data Agnostic Support for all Common Database and File Types: Easily explore and model big data from within Microsoft Excel, R Studio, Tableau and more. Connect to data from HDFS, S3, SQL and NoSQL data sources. Install and deploy anywhere
4. Massively Scalable Big Data Analysis: Train a model on complete data sets, not just small samples, and iterate and develop models in real-time with H2O’s rapid in-memory distributed parallel processing.
5. Real-time Data Scoring: Use the Nanofast Scoring Engine to score data against models for accurate predictions in just nanoseconds in any environment. Enjoy 10X faster scoring and predictions than the next nearest technology in the market.
Installing H2O on Linux
Installling H2O on you Linux machine (this section is tested with Centos 6.6) is very straight forward. Follow the steps below:
#Create a local directory for installation
mkdir H2O cd H2O #Download the latest release of H2O wget http://h2o-release.s3.amazonaws.com/h2o/rel-noether/4/h2o-2.8.4.4.zip #Unzip the downloaded file unzip h2o-2.8.4.4.zip cd h2o-2.8.4.4 #Start H2O java -jar h2o.jar |
You must see a log like the below:
INFO WATER: ----- H2O started -----
INFO WATER: Build git branch: rel-noether INFO WATER: Build git hash: 4089ab3911999c73dcb611ab2f51cfc9bb86898b INFO WATER: Build git describe: jenkins-rel-noether-4 INFO WATER: Build project version: 2.8.4.4 INFO WATER: Built by: 'jenkins' INFO WATER: Built on: 'Sat Feb 7 13:39:20 PST 2015' INFO WATER: Java availableProcessors: 16 INFO WATER: Java heap totalMemory: 1.53 gb INFO WATER: Java heap maxMemory: 22.75 gb INFO WATER: Java version: Java 1.7.0_75 (from Oracle Corporation) INFO WATER: OS version: Linux 2.6.32-504.3.3.el6.x86_64 (amd64) INFO WATER: Machine physical memory: 102.37 gb |
You can access the Web UI at http://localhost:54321
Running H2O's GLM function on R
We shall be running H2O's GLM on R here. We could also have done it without R using only the Linux command line. But I found it easier this way.
GLM is Generalized Linear Model, a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
If you don't have R already installed on your linux box, follow this link.
So we shall be performing couple of tasks to get GLM running on H2O.
Install H2O on R
You have installed H2O, then R and now we need to install H2O on R.
Open the R shell by typing "R" in your terminal and then enter the following commands there.
install.packages("RCurl");install.packages("rjson"); install.packages("statmod"); install.packages("survival"); q() |
Now in your linux terminal type:
cd /location_of_your_H2O_setup/h2o-2.8.4.4
R install.packages("location_of_your_H2O_setup/h2o-2.8.4.4/R/h2o_2.8.4.4.tar.gz", repos = NULL, type = "source") library(h2o) q() |
If all went fine, congratulate yourself. You have H2O and R and H2O on R installed :-)
Running a Demo
H2O packages examples to demostrate how its algorithm implementations work. The GLM is also a part of those demos. It would download the data called prostate.csv from authorized location on the web and use it as input. This demo would perform Logistic Regression of Prostate Cancer Data.
All you have to do is:
cd /location_of_your_H2O_setup/h2o-2.8.4.4
R demo(h2o.glm) |
You should see logs like the below:
demo(h2o.glm)
demo(h2o.glm) ---- ~~~~~~~ > # This is a demo of H2O's GLM function > # It imports a data set, parses it, and prints a summary > # Then, it runs GLM with a binomial link function using 10-fold cross-validation > # Note: This demo runs H2O on localhost:54321 > library(h2o) > localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE) Successfully connected to http://localhost:54321 R is connected to H2O cluster: H2O cluster uptime: 1 hours 45 minutes H2O cluster version: 2.8.4.4 H2O cluster name: jayati.tiwari H2O cluster total nodes: 1 H2O cluster total memory: 22.75 GB H2O cluster total cores: 16 H2O cluster allowed cores: 16 H2O cluster healthy: TRUE > prostate.hex = h2o.uploadFile(localH2O, path = system.file("extdata", "prostate.csv", package="h2o"), key = "prostate.hex") |======================================================================| 100% > summary(prostate.hex) ID CAPSULE AGE RACE Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000 1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000 Median :190.50 Median :0.0000 Median :67.00 Median :1.000 Mean :190.50 Mean :0.4026 Mean :66.04 Mean :1.087 3rd Qu.:285.25 3rd Qu.:1.0000 3rd Qu.:71.00 3rd Qu.:1.000 Max. :380.00 Max. :1.0000 Max. :79.00 Max. :2.000 DPROS DCAPS PSA VOL Min. :1.000 Min. :1.000 Min. : 0.300 Min. : 0.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.: 0.00 Median :2.000 Median :1.000 Median : 8.725 Median :14.25 Mean :2.271 Mean :1.108 Mean : 15.409 Mean :15.81 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 17.125 3rd Qu.:26.45 Max. :4.000 Max. :2.000 Max. :139.700 Max. :97.60 GLEASON Min. :0.000 1st Qu.:6.000 Median :6.000 Mean :6.384 3rd Qu.:7.000 Max. :9.000 > prostate.glm = h2o.glm(x = c("AGE","RACE","PSA","DCAPS"), y = "CAPSULE", data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5) |======================================================================| 100% > print(prostate.glm) IP Address: localhost Port : 54321 Parsed Data Key: prostate.hex GLM2 Model Key: GLMModel__ba962660a263d41ab4531103562b4422 Coefficients: AGE RACE DCAPS PSA Intercept -0.01104 -0.63136 1.31888 0.04713 -1.10896 Normalized Coefficients: AGE RACE DCAPS PSA Intercept -0.07208 -0.19495 0.40972 0.94253 -0.33707 Degrees of Freedom: 379 Total (i.e. Null); 375 Residual Null Deviance: 512.3 Residual Deviance: 461.3 AIC: 471.3 Deviance Explained: 0.09945 Best Threshold: 0.328 Confusion Matrix: Predicted Actual false true Error false 127 100 0.44053 true 51 102 0.33333 Totals 178 202 0.39737 AUC = 0.6887507 (on train) Cross-Validation Models: Nonzeros AUC Deviance Explained Model 1 4 0.6532738 0.8965221 Model 2 4 0.6316527 0.8752008 Model 3 4 0.7100840 0.8955293 Model 4 4 0.8268698 0.9099155 Model 5 4 0.6354167 0.9079152 Model 6 4 0.6888889 0.8881883 Model 7 4 0.7366071 0.9091687 Model 8 4 0.6711310 0.8917893 Model 9 4 0.7803571 0.9178481 Model 10 4 0.7435897 0.9065831 > myLabels = c(prostate.glm@model$x, "Intercept") > plot(prostate.glm@model$coefficients, xaxt = "n", xlab = "Coefficients", ylab = "Values") > axis(1, at = 1:length(myLabels), labels = myLabels) > abline(h = 0, col = 2, lty = 2) > title("Coefficients from Logistic Regression\n of Prostate Cancer Data") > barplot(prostate.glm@model$coefficients, main = "Coefficients from Logistic Regression\n of Prostate Cancer Data") |
Great ! Your demo ran fine.
Starting H2O from R
Before we try running GLM from the R shell, we need to start H2O. We shall achieve this from within the R shell itself.
R
library(h2o)
localH2O <- data-blogger-escaped-br="" data-blogger-escaped-h2o.init="" data-blogger-escaped-ip="localhost" data-blogger-escaped-max_mem_size="4g" data-blogger-escaped-port="54321,">
|
You should see something like:
Successfully connected to http://localhost:54321
R is connected to H2O cluster:
H2O cluster uptime: 2 hours 3 minutes
H2O cluster version: 2.8.4.4
H2O cluster name: jayati.tiwari
H2O cluster total nodes: 1
H2O cluster total memory: 22.75 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 16
H2O cluster healthy: TRUE
|
This starts H2O.
Running H2O's GLM from R
In the same R shell continue to run the GLM example now.
prostate.hex = h2o.importFile(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex")
h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)
|
This command should output the following on your terminal
|======================================================================| 100%
IP Address: localhost
Port : 54321
Parsed Data Key: prostate.hex
GLM2 Model Key: GLMModel__8efb9141cab4671715fc8319eae54ca8
Coefficients:
AGE RACE DCAPS PSA Intercept
-0.01104 -0.63136 1.31888 0.04713 -1.10896
Normalized Coefficients:
AGE RACE DCAPS PSA Intercept
-0.07208 -0.19495 0.40972 0.94253 -0.33707
Degrees of Freedom: 379 Total (i.e. Null); 375 Residual
Null Deviance: 512.3
Residual Deviance: 461.3 AIC: 471.3
Deviance Explained: 0.09945
Best Threshold: 0.328
Confusion Matrix:
Predicted
Actual false true Error
false 127 100 0.44053
true 51 102 0.33333
Totals 178 202 0.39737
AUC = 0.6887507 (on train)
Cross-Validation Models:
Nonzeros AUC Deviance Explained
Model 1 4 0.6532738 0.8965221
Model 2 4 0.6316527 0.8752008
Model 3 4 0.7100840 0.8955293
Model 4 4 0.8268698 0.9099155
Model 5 4 0.6354167 0.9079152
Model 6 4 0.6888889 0.8881883
Model 7 4 0.7366071 0.9091687
Model 8 4 0.6711310 0.8917893
Model 9 4 0.7803571 0.9178481
Model 10 4 0.7435897 0.9065831
|
As you can see, you have predictions in place and the accuracy score as well.
Hope it helped !!
No comments:
Post a Comment