Google "What's R", and you'll see there are many ways in which R has been defined. As per my understanding, firstly it's a programming language. Secondly, it's solely meant for statistical computing. It's not a generic programming language like Java. Now the question is "what all comes under statistical computing and graphics?" Wiki explains statistical computing as the interface between the mathematical science of statistics and computer science. And statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. So basically R is a programming language dedicated for being used to perform the study of the collection, analysis, interpretation, presentation, and organization of data. And now you know why is it so much in talks? Because in this Big Data Yug(meaning era), data is everything for everyone.
Hoping that you have R installed on your Linux/Windows machine, I hereby intend to showcase few common operations that one might almost always need to use. I have also written about how to perform all these operations using Hive in one of my previous posts. But hey, by the way if you don't have R installed, follow this link(http://cran.r-project.org/doc/manuals/r-devel/R-admin.html) to get it done.
1. Sorting in R
2. Searching in R
3. Joins in R
4. Sampling in R
5. Calculating median in R
6. Calculating mean in R
7. Finding elements from one vector which don't exist in vector in R
Open a new terminal on your machine and type 'R' to open an R shell. All the operations that follow will need the R shell.
Sorting a Vector in R
#Create a vector
>a <- c(2,6,8,1,3,7,3,60,32) #Display the vector > a [1] 2 6 8 1 3 7 3 60 32 #Sort the vector > sort(a) [1] 1 2 3 3 6 7 8 32 60 |
Searching an element in Vector in R
#Search for 60 in the vector 'a'
>grep(60, a) [1] 8 |
Joining Files in R
#Exit the R shell and in your terminal create two csv file as shown below:
vi ~/my_join_table.csv #Paste the following content id,age,phone 1,18,1111111 2,19,2222222 3,17,3333333 6,23,4444444 5,20,5555555 vi ~/my_table.csv #Paste the following content id,name,address 1,Ram,add_1 2,Shyam,add_2 3,Sita,add_3 4,Ali,add_4 5,John,add_5 #Reopen the R shell and load the two csv files to two variables. Replace "user_name" with your username. mydata1 = read.csv("/home/user_name/my_join_table.csv", header=T) mydata2 = read.csv("/home/user_name/my_table.csv", header=T) #Merge the two files myfulldata = merge(mydata1, mydata2) #Display the data in the merged file myfulldata id age phone name address 1 1 18 1111111 Ram add_1 2 2 19 2222222 Shyam add_2 3 3 17 3333333 Sita add_3 4 5 20 5555555 John add_5 |
Sampling in R
sample(x, size, replace = FALSE, prob = NULL)
OR
sample.int(n, size = n, replace = FALSE, prob = NULL)
Arguments:
x: Either a vector of one or more elements from which to choose, or a positive integer.
n: a positive number, the number of items to choose from.
size: a non-negative integer giving the number of items to choose.
replace: Should sampling be with replacement?
prob: A vector of probability weights for obtaining the elements of the vector being sampled.
> x <- c(4, 7, 2, 4, 9, 10, 55, 77, 1)
> sample(x, 5, replace = FALSE, prob = NULL) [1] 7 2 4 4 1 |
Calculating median in R
> median(x)
[1] 7 |
Calculating mean in R
> mean(x)
[1] 18.77778 |
Find what's in one vector and not in another using R
x <- c(1,2,3,4)
y <- c(2,3,4) > setdiff(x, y) |
Hope that helps!