Bucketing in Hive
With Bucketing in Hive, we can group similar kinds of data and write it to one single file. This allows better performance while reading data & when joining two tables. That is why bucketing is often used in conjunction with partitioning. Let us understand the details of Bucketing in Hive in this article.
What is Bucketing in Hive
Bucketing is easier to explain with an example. We have already learned about partitioning. Consider our table orders which we have partitioned on “year” and “month”. That means all of the orders of one month are placed in one folder. What if we have a huge number of orders every month? Our query performance will still be slow. And creating smaller partitions like on Day is not advisable as it will create a lot of small files.
This is where we can use bucketing. With bucketing, we can tell hive group data in few “Buckets”. Hive writes that data in a single file. And when we want to retrieve that data, hive knows which partition to check and in which bucket that data is.
For example, for our orders table, we have specified to keep data in 4 buckets and this data should be grouped on basis of order it then hive will create 4 files and use Hash Algorithm to separate orders in 4 groups and write them into 4 files.
One thing to note is, in bucketing data is written to files. Where as in partitioning, data is split in to directories.
Hive Facts
Creating Bucketed Table in Hive
Let us create a bucketed table in a hive. We can specify bucketing with CLUSTER BY clause. We can specify the number of buckets that we want and also on which column we want to create bucketing.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE order_partitioned_bucketed ( order_id INT, order_date STRING, order_status STRING ) PARTITIONED BY ( year STRING, month STRING ) CLUSTERED BY (order_id ) INTO 4 BUCKETS ; |
You can notice CLUSTERED BY clause in above query. We have specified that we want 4 buckets which should be grouped by “order_id” column. Let us insert data in to this table.
1 2 3 4 5 6 |
INSERT OVERWRITE TABLE order_partitioned_bucketed PARTITION (year, month) SELECT order_id, order_date, order_status ,SUBSTR(order_date,1,4) as year ,SUBSTR(order_date, 6,2) as month from orders; |
Once this query is completed, let us check how data is written to HDFS.
1 2 3 4 5 6 |
hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07; Found 4 items -rwxrwxrwx 2 maheshmogal hdfs 14207 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000000_0 -rwxrwxrwx 2 maheshmogal hdfs 14248 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000001_0 -rwxrwxrwx 2 maheshmogal hdfs 14226 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000002_0 -rwxrwxrwx 2 maheshmogal hdfs 14265 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000003_0 |
And there are our 4 files in partitioned directory representing our 4 buckets. There are a few things we can observe when using bucketing.
- We will have data of each bucket in a separate file, unlike partitioning which only creates directories.
- If we insert new data into this table, the Hive will create 4 new files and add data to it.
- When we insert data into a bucketed table, the number of reducers will be in multiple of the number of buckets of that table.
- We do not need to specify anything different in query while inserting data into a bucketed table.
Using Only Bucketing for Hive Table
Though bucketing is used with partitioning most of the time, it is a standalone feature. We can have a table which is only bucketed. This works similarly to the partitioned bucketed table.
We can use a simple bucketed table when we do not have a good column to make partitions on. We can use a bucketed table to improve the performance of such a table with bucketing.
Conclusion
We have covered the basics of bucketing in this article, we will revisit it for advanced concepts of bucketing in a future article. With partitioning and bucketing, we can improve the performance of Hive queries. So I will urge you to understand these concepts thoroughly and use them while modeling data. See you in the next blog :).
Hi Mahesh, Just found your Blog & read some articles. You have explained everything in such a simple manner with code also. I have been searching for this kind of a blog from some time.
Actually, I am preparing for Big Data Interviews & there are some concepts/topics which I have searched on google/youtube but haven’t found satisfactory explanations.
I have a request, can you please write a post on –
a) How to handle Data Skewedness using Key Salting or other methods.
b) How to load 100 tables out of 500 tables using SQOOP.
Hello Inderjeet, Thanks for your kind words. This will help me write more useful blogs. I am working on spark blogs so you will soon see update on your request.