Bucketing in Hive

With Bucketing in Hive, we can group similar kinds of data and write it to one single file. This allows better performance while reading data & when joining two tables. That is why bucketing is often used in conjunction with partitioning. Let us understand the details of Bucketing in Hive in this article.

What is Bucketing in Hive

Bucketing is easier to explain with an example. We have already learned about partitioning. Consider our table orders which we have partitioned on “year” and “month”. That means all of the orders of one month are placed in one folder. What if we have a huge number of orders every month? Our query performance will still be slow. And creating smaller partitions like on Day is not advisable as it will create a lot of small files.

This is where we can use bucketing. With bucketing, we can tell hive group data in few “Buckets”. Hive writes that data in a single file. And when we want to retrieve that data, hive knows which partition to check and in which bucket that data is.

Hive Bucketing
Hive Bucketing Diagram

For example, for our orders table, we have specified to keep data in 4 buckets and this data should be grouped on basis of order it then hive will create 4 files and use Hash Algorithm to separate orders in 4 groups and write them into 4 files.

One thing to note is, in bucketing data is written to files. Where as in partitioning, data is split in to directories.

Hive Facts

Creating Bucketed Table in Hive

Let us create a bucketed table in a hive. We can specify bucketing with CLUSTER BY clause. We can specify the number of buckets that we want and also on which column we want to create bucketing.

You can notice CLUSTERED BY clause in above query. We have specified that we want 4 buckets which should be grouped by “order_id” column. Let us insert data in to this table.

Once this query is completed, let us check how data is written to HDFS.

And there are our 4 files in partitioned directory representing our 4 buckets. There are a few things we can observe when using bucketing.

  • We will have data of each bucket in a separate file, unlike partitioning which only creates directories.
  • If we insert new data into this table, the Hive will create 4 new files and add data to it.
  • When we insert data into a bucketed table, the number of reducers will be in multiple of the number of buckets of that table.
  • We do not need to specify anything different in query while inserting data into a bucketed table.

Using Only Bucketing for Hive Table

Though bucketing is used with partitioning most of the time, it is a standalone feature. We can have a table which is only bucketed. This works similarly to the partitioned bucketed table.

We can use a simple bucketed table when we do not have a good column to make partitions on. We can use a bucketed table to improve the performance of such a table with bucketing.

Conclusion

We have covered the basics of bucketing in this article, we will revisit it for advanced concepts of bucketing in a future article. With partitioning and bucketing, we can improve the performance of Hive queries. So I will urge you to understand these concepts thoroughly and use them while modeling data. See you in the next blog :).

Similar Posts

Leave a Reply

Your email address will not be published.

2 Comments

  1. Hi Mahesh, Just found your Blog & read some articles. You have explained everything in such a simple manner with code also. I have been searching for this kind of a blog from some time.
    Actually, I am preparing for Big Data Interviews & there are some concepts/topics which I have searched on google/youtube but haven’t found satisfactory explanations.
    I have a request, can you please write a post on –
    a) How to handle Data Skewedness using Key Salting or other methods.
    b) How to load 100 tables out of 500 tables using SQOOP.