Bucketing in Hive

Updated On February 12, 2021 | By Mahesh Mogal

With Bucketing in Hive, we can group similar kinds of data and write it to one single file. This allows better performance while reading data & when joining two tables. That is why bucketing is often used in conjunction with partitioning. Let us understand the details of Bucketing in Hive in this article.

What is Bucketing in Hive

Bucketing is easier to explain with an example. We have already learned about partitioning. Consider our table orders which we have partitioned on "year" and "month". That means all of the orders of one month are placed in one folder. What if we have a huge number of orders every month? Our query performance will still be slow. And creating smaller partitions like on Day is not advisable as it will create a lot of small files.

This is where we can use bucketing. With bucketing, we can tell hive group data in few "Buckets". Hive writes that data in a single file. And when we want to retrieve that data, hive knows which partition to check and in which bucket that data is.

Hive Bucketing
Hive Bucketing Diagram

For example, for our orders table, we have specified to keep data in 4 buckets and this data should be grouped on basis of order it then hive will create 4 files and use Hash Algorithm to separate orders in 4 groups and write them into 4 files.

One thing to note is, in bucketing data is written to files. Where as in partitioning, data is split in to directories.

Hive Facts

Creating Bucketed Table in Hive

Let us create a bucketed table in a hive. We can specify bucketing with CLUSTER BY clause. We can specify the number of buckets that we want and also on which column we want to create bucketing.

You can notice CLUSTERED BY clause in above query. We have specified that we want 4 buckets which should be grouped by "order_id" column. Let us insert data in to this table.

Once this query is completed, let us check how data is written to HDFS.

And there are our 4 files in partitioned directory representing our 4 buckets. There are a few things we can observe when using bucketing.

  • We will have data of each bucket in a separate file, unlike partitioning which only creates directories.
  • If we insert new data into this table, the Hive will create 4 new files and add data to it.
  • When we insert data into a bucketed table, the number of reducers will be in multiple of the number of buckets of that table.
  • We do not need to specify anything different in query while inserting data into a bucketed table.

Using Only Bucketing for Hive Table

Though bucketing is used with partitioning most of the time, it is a standalone feature. We can have a table which is only bucketed. This works similarly to the partitioned bucketed table.

We can use a simple bucketed table when we do not have a good column to make partitions on. We can use a bucketed table to improve the performance of such a table with bucketing.

Conclusion

We have covered the basics of bucketing in this article, we will revisit it for advanced concepts of bucketing in a future article. With partitioning and bucketing, we can improve the performance of Hive queries. So I will urge you to understand these concepts thoroughly and use them while modeling data. See you in the next blog :).

bucketing in hive

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Where and Filter in Spark Dataframes

In this blog, we will learn how to filter rows from spark dataframe using Where and Filter functions.

Where and Filter in Spark Dataframes
Read More
Distinct Rows and Distinct Count from Spark Dataframe

Getting distinct values from columns or rows is one of most used operations. We will learn how to get distinct values as well as count of distinct values.

Distinct Rows and Distinct Count from Spark Dataframe
Read More
Sorting in Spark Dataframe

In this blog, we will learn how to sort rows in spark dataframe based on some column values.

Sorting in Spark Dataframe
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap