In this post, I will try to explain how to create partitioned tables in Hive and the logic behind it. Partitioned tables in Hive are commanly used to improve query response times and manage the data in a better view. When a Hive table is partitioned according to column that is used to filter the data, query will only reach a subset of a table in the hadoop file system and perform better. We will walk through an example to understand the dynamics behind the partitioning.
We have two tables : movie and movierating. What we want to do here is to find the average rating of the movies that are released in 2000. Basically, our hiveql looks like below.
hive> select mv.name,avg(mvr.rating) from movie mv join movierating mvr on mv.id=mvr.movieid and mv.year=2000 group by mv.name;
Our result set returns at 27 seconds. Now we are going to create a partitioned table in hive to see the performance improvement. Let’s create an empty partitioned table.
hive> CREATE TABLE partitioned_movies(id INT, name STRING) PARTITIONED BY (year int) STORED AS TEXTFILE;
As we all know, when we create a table in hive, it actually creates a folder under ‘/user/hive/warehouse/’ with the table name to store the data in hdfs file system. Let’s check out to see if we actually have the folder. Let’s type the command below.
hadoop fs -ls /user/hive/warehouse/
When we also ‘ls’ the movie folder, we can see the parts of the file created by map tasks. However, we can not see any files created under the partitioned_movies folder as we did not put any data in it yet.
Now, we are actullay going to put some data into our partitioned table to see how it works behind the scenes. But before we start to insert data to our partitioned table, i would like to change a parameter in my enviroment to enable dynamic partition creation. If I do not change this parameter, I would have to filter my inserted data according to the partition that i want to work with. This would require a lot of unnecessary work for us.
Let’s run the commands below to change settings and make the insert.
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> INSERT OVERWRITE TABLE partitioned_movies PARTITION(year) SELECT id,name,year from movie;
After the insert process is complete, we need to see folders under the partitioned_movies folder with year_info. Let’s run the ‘ls’ command again to check.
So far we have created a partitioned hive table and seen how it works. Now, it is time to check if it would really give us any performance improvements. Let’s run the query again but this time using our partitioned movie table.
hive> select mv.name,avg(mvr.rating) from partitioned_movies mv join movierating mvr on mv.id=mvr.movieid and mv.year=2000 group by mv.name;
Result set returns almost five seconds faster than the previous query. My data is not big enough to have a dramatic change in time but the results still proved that partitioning in hive tables would allow you to increase performance in some queries. It also allows you to take advantage of some other utilities that comes with partitioning such as partition exchanging and changing the file format of a specific partition. We are not going go through them in this post.