Hello Readers,
I feel extremely fortunate to be hearing questions from many of you on my personal email id about why haven't I blogged since long. So, I am returning to my blog after about two years, with a determination to addup a lot more content (which has piled up in these two years) that might be of any help to anyone working on Big Data. Thanks for the motivation friends. :) So here I go.
To start with, I chose to write the most commonly used Hive queries. You can pick on this blog any time you forget the syntax of any of them ;)
Just like in SQL, there is a common set of Hive queries that tend to be used the most. These include queries for operations like:
1. Sorting
2. Searching
3. Joins
4. Sampling
5. Calculating median
6. Calculating mean
7. Finding records from one table which don't exist in another
Assuming you have a running Hadoop cluster and Hive on top of it already setup, the following sections of the tutorial will help you understand how to go about the following operations in Hive.
'my_table'
and another table 'my_join_table':
1. Create two files 'my_table.csv' and 'my_join_table.csv' and copy the respective comma separated content shown above in each file.
2. Place these files in HDFS
Yes, we have intentionally placed the two files in two different tables, since to load the file into a hive table using Create table statement, you need to specify a folder location. File paths do not work. And by default Hive adds the data of all the files in that folder to the Hive table. Strange right?
3. Create it!
Remember to replace the 'your_hdfs_user_name' with a valid name else you'll get a depressing error statement. ;)
Cool. So now you have all that's required to try the queries.
Sorting by Integer: By default, Hive sorts an integer column in the Ascending order.
Sorting by String: And in the lexicographical order for String columns.
If you have a very large data set, the ORDER BY clause might take a long long time because it pushes all data through just one reducer which is unacceptable for large datasets. To optimize, use CLUSTER BY. CLUSTER BY ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. You can go for this if you are okay with joining the multiple output files yourself.
Searching can be performed using the 'where' clause.
Inner Join: The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
Left Join: The Left Outer Join keyword returns each row that satisfies the join of the first table with the second table. It also returns any rows from the first table that had no matching rows in the second table. The non-matching rows in the second table are returned as null values.
Right Join: The RIGHT JOIN keyword returns all rows from the second table, with the matching rows in the first table. The result is NULL in the left side when there is no match.
Full Outer Join: The FULL OUTER JOIN keyword returns all rows from the first table and from the second table. The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
Hive supports sampling by different parameters as follows:
Sampling by percentage:
Sampling by size:
Sampling by number of records:
Sampling by Bucketing:
The function percentile with the following syntax can be used to calculate median in hive. It returns the exact pth percentile of a column in the group (does not work with floating point types). To calculate the median, p should be equal to 0.5. Otherwise it can be any value between 0 and 1.
We shall use this to calculate the median of the column 'age' in 'my_join_table'
Calculating mean of the values in a column is pretty straight forward with the help of the function 'avg(col)'
Now, this is the trickiest one. We intend to find those records of table A which do not exist in table B. Sort of A-B. For this, the schema of the tables need not always be the same. The concept is very simple, we would join the two tables and apply a check statement to include only those records where the join column of the first table is NOT NULL. This way we get those records which are in the first table and not in the second.
Add a row to my_table with id not in table my_join_table.
Hope this tutorial gave a good start with Hive!
I feel extremely fortunate to be hearing questions from many of you on my personal email id about why haven't I blogged since long. So, I am returning to my blog after about two years, with a determination to addup a lot more content (which has piled up in these two years) that might be of any help to anyone working on Big Data. Thanks for the motivation friends. :) So here I go.
To start with, I chose to write the most commonly used Hive queries. You can pick on this blog any time you forget the syntax of any of them ;)
Just like in SQL, there is a common set of Hive queries that tend to be used the most. These include queries for operations like:
1. Sorting
2. Searching
3. Joins
4. Sampling
5. Calculating median
6. Calculating mean
7. Finding records from one table which don't exist in another
Assuming you have a running Hadoop cluster and Hive on top of it already setup, the following sections of the tutorial will help you understand how to go about the following operations in Hive.
Data Set Preparation
Let's assume you have the following tables and data in Hive:'my_table'
id,name,address
1,Ram,add_1 2,Shyam,add_2 3,Sita,add_3 4,Ali,add_4 5,John,add_5 |
and another table 'my_join_table':
id,age,phone
1,18,1111111 2,19,2222222 3,17,3333333 6,23,4444444 5,20,5555555 |
Creating tables in Hive
To create tables like above in Hive, follow the steps below:1. Create two files 'my_table.csv' and 'my_join_table.csv' and copy the respective comma separated content shown above in each file.
2. Place these files in HDFS
hdfs dfs -mkdir hive_tutorial_my_table
hdfs dfs -put my_table.csv hive_tutorial_my_table hdfs dfs -mkdir hive_tutorial_my_join_table hdfs dfs -put my_join_table.csv hive_tutorial_my_join_table |
Yes, we have intentionally placed the two files in two different tables, since to load the file into a hive table using Create table statement, you need to specify a folder location. File paths do not work. And by default Hive adds the data of all the files in that folder to the Hive table. Strange right?
3. Create it!
Remember to replace the 'your_hdfs_user_name' with a valid name else you'll get a depressing error statement. ;)
create external table my_table (id int, name string, address string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_table' tblproperties ("skip.header.line.count"="1");
create external table my_join_table (id int, age int, phone string) row format delimited fields terminated by ',' lines terminated by '\n' location '/user/your_hdfs_user_name/hive_tutorial_my_join_table' tblproperties ("skip.header.line.count"="1"); |
Cool. So now you have all that's required to try the queries.
Running the Hive Queries
Sorting in Hive
Sorting by Integer: By default, Hive sorts an integer column in the Ascending order.
Query: select id, age, phone from my_join_table order by age;
|
Sorting by String: And in the lexicographical order for String columns.
Query: select id, name, address from my_table order by name;
|
If you have a very large data set, the ORDER BY clause might take a long long time because it pushes all data through just one reducer which is unacceptable for large datasets. To optimize, use CLUSTER BY. CLUSTER BY ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. You can go for this if you are okay with joining the multiple output files yourself.
Query: select id, name, address from my_table cluster by name;
|
Searching in Hive
Searching can be performed using the 'where' clause.
Query: select id, age, phone from my_join_table where age>18;
|
Joins in Hive
Inner Join: The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.
Query: select a.id, a.name, a.address, b.age, b.phone from my_table a join my_join_table b on a.id=b.id;
|
Left Join: The Left Outer Join keyword returns each row that satisfies the join of the first table with the second table. It also returns any rows from the first table that had no matching rows in the second table. The non-matching rows in the second table are returned as null values.
Query: select a.id, a.name, a.address, b.age, b.phone from my_table a left join my_join_table b on a.id=b.id;
|
Right Join: The RIGHT JOIN keyword returns all rows from the second table, with the matching rows in the first table. The result is NULL in the left side when there is no match.
Query: select a.id, a.name, a.address, b.age, b.phone from my_table a right join my_join_table b on a.id=b.id;
|
Full Outer Join: The FULL OUTER JOIN keyword returns all rows from the first table and from the second table. The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.
Query: select a.id, a.name, a.address, b.age, b.phone from my_table a full outer join my_join_table b on a.id=b.id;
|
Sampling in Hive
Hive supports sampling by different parameters as follows:
Sampling by percentage:
Query: select id, name, address from my_table TABLESAMPLE(0.1 PERCENT);
|
Sampling by size:
Query: select id, name, address from my_table TABLESAMPLE(10M) s;
|
Sampling by number of records:
Query: select id, name, address from my_table TABLESAMPLE(2 ROWS) s;
|
Sampling by Bucketing:
Query: create table my_join_table_bucketed(id INT, age INT, phone STRING) comment 'A bucketed copy of my_join_table' clustered by(age) into 4 buckets;
Query: insert into table my_join_table_bucketed select id, age, phone from my_join_table; Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 3 OUT OF 4 ON rand()) s; Query: SELECT * FROM my_join_table_bucketed TABLESAMPLE(BUCKET 1 OUT OF 4 ON age); |
Calculating median
The function percentile with the following syntax can be used to calculate median in hive. It returns the exact pth percentile of a column in the group (does not work with floating point types). To calculate the median, p should be equal to 0.5. Otherwise it can be any value between 0 and 1.
percentile(BIGINT col, p)
|
We shall use this to calculate the median of the column 'age' in 'my_join_table'
Query: select percentile(cast(a.age as bigint), 0.5) AS age_median from my_join_table a;
|
Calculating mean
Calculating mean of the values in a column is pretty straight forward with the help of the function 'avg(col)'
Query: select avg(a.age) AS age_mean from my_join_table a;
|
Finding records from one table which don't exist in another
Now, this is the trickiest one. We intend to find those records of table A which do not exist in table B. Sort of A-B. For this, the schema of the tables need not always be the same. The concept is very simple, we would join the two tables and apply a check statement to include only those records where the join column of the first table is NOT NULL. This way we get those records which are in the first table and not in the second.
Add a row to my_table with id not in table my_join_table.
Query: insert into my_table values (6, 'Jayati', 'add_6');
Query: select a.id, a.name, a.address, b.age, b.phone from my_table a join my_join_table b on a.id=b.id where (a.id IS NOT NULL); |
Hope this tutorial gave a good start with Hive!
I was really impressed with this blog..its nice blog..
ReplyDeletego to the below links for more like this..
http://bit.ly/2bTOB2W
no innovative here if any practical approach
ReplyDeleteno innovative here if any practical approach
ReplyDelete