Jayati Tiwari: Common Functions in R

Google "What's R", and you'll see there are many ways in which R has been defined. As per my understanding, firstly it's a programming language. Secondly, it's solely meant for statistical computing. It's not a generic programming language like Java. Now the question is "what all comes under statistical computing and graphics?" Wiki explains statistical computing as the interface between the mathematical science of statistics and computer science. And statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. So basically R is a programming language dedicated for being used to perform the study of the collection, analysis, interpretation, presentation, and organization of data. And now you know why is it so much in talks? Because in this Big Data Yug(meaning era), data is everything for everyone.

Hoping that you have R installed on your Linux/Windows machine, I hereby intend to showcase few common operations that one might almost always need to use. I have also written about how to perform all these operations using Hive in one of my previous posts. But hey, by the way if you don't have R installed, follow this link(http://cran.r-project.org/doc/manuals/r-devel/R-admin.html) to get it done.

1. Sorting in R
2. Searching in R
3. Joins in R
4. Sampling in R
5. Calculating median in R
6. Calculating mean in R
7. Finding elements from one vector which don't exist in vector in R

Open a new terminal on your machine and type 'R' to open an R shell. All the operations that follow will need the R shell.

Sorting a Vector in R

The sort() function in R takes the vector or data frame as inuput and sorts it.

#Create a vector
>a <- c(2,6,8,1,3,7,3,60,32)
#Display the vector
> a
[1] 2 6 8 1 3 7 3 60 32
#Sort the vector
> sort(a)
[1] 1 2 3 3 6 7 8 32 60

Searching an element in Vector in R

grep() function needs to parameters, first parameter is the element to be searched and second is the vector name.

#Search for 60 in the vector 'a'
>grep(60, a)
[1] 8

Joining Files in R

Joining in R can be done on files. First the CSV files are read into variables and then 'merge' function joins the two files, only condition being that they should possess at least one column in common whose values can be used to join the files.

#Exit the R shell and in your terminal create two csv file as shown below:

vi ~/my_join_table.csv

#Paste the following content
id,age,phone
1,18,1111111
2,19,2222222
3,17,3333333
6,23,4444444
5,20,5555555

vi ~/my_table.csv

#Paste the following content
id,name,address
1,Ram,add_1
2,Shyam,add_2
3,Sita,add_3
4,Ali,add_4
5,John,add_5

#Reopen the R shell and load the two csv files to two variables. Replace "user_name" with your username.
mydata1 = read.csv("/home/user_name/my_join_table.csv", header=T)
mydata2 = read.csv("/home/user_name/my_table.csv", header=T)

#Merge the two files
myfulldata = merge(mydata1, mydata2)

#Display the data in the merged file
myfulldata
id age phone name address
1 1 18 1111111 Ram add_1
2 2 19 2222222 Shyam add_2
3 3 17 3333333 Sita add_3
4 5 20 5555555 John add_5

Sampling in R

The 'sample' function has the following syntax:

sample(x, size, replace = FALSE, prob = NULL)
OR
sample.int(n, size = n, replace = FALSE, prob = NULL)

Arguments:

x: Either a vector of one or more elements from which to choose, or a positive integer.
n: a positive number, the number of items to choose from.
size: a non-negative integer giving the number of items to choose.
replace: Should sampling be with replacement?
prob: A vector of probability weights for obtaining the elements of the vector being sampled.

> x <- c(4, 7, 2, 4, 9, 10, 55, 77, 1)

> sample(x, 5, replace = FALSE, prob = NULL)
[1] 7 2 4 4 1

Calculating median in R

The 'median method returns median of the elements in the vector

> median(x)
[1] 7

Calculating mean in R

Similarly, 'mean' method is used to calculate the mean of all the elements of a vector.

> mean(x)
[1] 18.77778

Find what's in one vector and not in another using R

This operation is like an A-B operation and can be accomplished using setdiff() function which takes two parameters

x <- c(1,2,3,4)
y <- c(2,3,4)
> setdiff(x, y)

Hope that helps!

Jayati Tiwari

Monday, April 27, 2015

Common Functions in R