Bucketing in Hive

Updated On August 11, 2020 | By Mahesh Mogal

With Bucketing in Hive, we can group similar kinds of data and write it to one single file. This allows better performance while reading data & when joining two tables. That is why bucketing is often used in conjunction with partitioning. Let us understand the details of Bucketing in Hive in this article.

What is Bucketing in Hive

Bucketing is easier to explain with an example. We have already learned about partitioning. Consider our table orders which we have partitioned on "year" and "month". That means all of the orders of one month are placed in one folder. What if we have a huge number of orders every month? Our query performance will still be slow. And creating smaller partitions like on Day is not advisable as it will create a lot of small files.

This is where we can use bucketing. With bucketing, we can tell hive group data in few "Buckets". Hive writes that data in a single file. And when we want to retrieve that data, hive knows which partition to check and in which bucket that data is.

Hive Bucketing
Hive Bucketing Diagram

For example, for our orders table, we have specified to keep data in 4 buckets and this data should be grouped on basis of order it then hive will create 4 files and use Hash Algorithm to separate orders in 4 groups and write them into 4 files.

One thing to note is, in bucketing data is written to files. Where as in partitioning, data is split in to directories.

Hive Facts

Creating Bucketed Table in Hive

Let us create a bucketed table in a hive. We can specify bucketing with CLUSTER BY clause. We can specify the number of buckets that we want and also on which column we want to create bucketing.

CREATE TABLE order_partitioned_bucketed (
   order_id INT,
    order_date STRING,
    order_status STRING
)
PARTITIONED BY (
    year STRING,
    month STRING
    )
CLUSTERED BY (order_id ) INTO 4 BUCKETS ;

You can notice CLUSTERED BY clause in above query. We have specified that we want 4 buckets which should be grouped by "order_id" column. Let us insert data in to this table.

INSERT OVERWRITE TABLE order_partitioned_bucketed
PARTITION (year, month)
SELECT order_id, order_date, order_status
,SUBSTR(order_date,1,4) as year
,SUBSTR(order_date, 6,2) as month
from orders;

Once this query is completed, let us check how data is written to HDFS.

hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07;
Found 4 items
-rwxrwxrwx   2 maheshmogal hdfs      14207 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000000_0
-rwxrwxrwx   2 maheshmogal hdfs      14248 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000001_0
-rwxrwxrwx   2 maheshmogal hdfs      14226 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000002_0
-rwxrwxrwx   2 maheshmogal hdfs      14265 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000003_0

And there are our 4 files in partitioned directory representing our 4 buckets. There are a few things we can observe when using bucketing.

  • We will have data of each bucket in a separate file, unlike partitioning which only creates directories.
  • If we insert new data into this table, the Hive will create 4 new files and add data to it.
  • When we insert data into a bucketed table, the number of reducers will be in multiple of the number of buckets of that table.
  • We do not need to specify anything different in query while inserting data into a bucketed table.

Using Only Bucketing for Hive Table

Though bucketing is used with partitioning most of the time, it is a standalone feature. We can have a table which is only bucketed. This works similarly to the partitioned bucketed table.

We can use a simple bucketed table when we do not have a good column to make partitions on. We can use a bucketed table to improve the performance of such a table with bucketing.

Conclusion

We have covered the basics of bucketing in this article, we will revisit it for advanced concepts of bucketing in a future article. With partitioning and bucketing, we can improve the performance of Hive queries. So I will urge you to understand these concepts thoroughly and use them while modeling data. See you in the next blog :).

bucketing in hive
.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap