Bucketing in Hive

With Bucketing in Hive, we can group similar kinds of data and write it to one single file. This allows better performance while reading data & when joining two tables. That is why bucketing is often used in conjunction with partitioning. Let us understand the details of Bucketing in Hive in this article.

What is Bucketing in Hive

Bucketing is easier to explain with an example. We have already learned about partitioning. Consider our table orders which we have partitioned on “year” and “month”. That means all of the orders of one month are placed in one folder. What if we have a huge number of orders every month? Our query performance will still be slow. And creating smaller partitions like on Day is not advisable as it will create a lot of small files.

This is where we can use bucketing. With bucketing, we can tell hive group data in few “Buckets”. Hive writes that data in a single file. And when we want to retrieve that data, hive knows which partition to check and in which bucket that data is.

For example, for our orders table, we have specified to keep data in 4 buckets and this data should be grouped on basis of order it then hive will create 4 files and use Hash Algorithm to separate orders in 4 groups and write them into 4 files.

One thing to note is, in bucketing data is written to files. Where as in partitioning, data is split in to directories.
Hive Facts

Creating Bucketed Table in Hive

Let us create a bucketed table in a hive. We can specify bucketing with CLUSTER BY clause. We can specify the number of buckets that we want and also on which column we want to create bucketing.

CREATE TABLE order_partitioned_bucketed (
   order_id INT,
    order_date STRING,
    order_status STRING
)
PARTITIONED BY (
    year STRING,
    month STRING
    )
CLUSTERED BY (order_id ) INTO 4 BUCKETS ;

CREATE TABLE order_partitioned_bucketed (

order_id INT,

order_date STRING,

order_status STRING

)

PARTITIONED BY (

year STRING,

month STRING

)

CLUSTERED BY (order_id ) INTO 4 BUCKETS ;

You can notice CLUSTERED BY clause in above query. We have specified that we want 4 buckets which should be grouped by “order_id” column. Let us insert data in to this table.

INSERT OVERWRITE TABLE order_partitioned_bucketed
PARTITION (year, month)
SELECT order_id, order_date, order_status
,SUBSTR(order_date,1,4) as year
,SUBSTR(order_date, 6,2) as month
from orders;

INSERT OVERWRITE TABLE order_partitioned_bucketed

PARTITION (year, month)

SELECT order_id, order_date, order_status

,SUBSTR(order_date,1,4) as year

,SUBSTR(order_date, 6,2) as month

from orders;

Once this query is completed, let us check how data is written to HDFS.

hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07;
Found 4 items
-rwxrwxrwx   2 maheshmogal hdfs      14207 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000000_0
-rwxrwxrwx   2 maheshmogal hdfs      14248 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000001_0
-rwxrwxrwx   2 maheshmogal hdfs      14226 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000002_0
-rwxrwxrwx   2 maheshmogal hdfs      14265 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000003_0

hive (maheshmogal)> dfs -ls /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07;

Found 4 items

-rwxrwxrwx 2 maheshmogal hdfs 14207 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000000_0

-rwxrwxrwx 2 maheshmogal hdfs 14248 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000001_0

-rwxrwxrwx 2 maheshmogal hdfs 14226 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000002_0

-rwxrwxrwx 2 maheshmogal hdfs 14265 2020-06-17 01:16 /apps/hive/warehouse/maheshmogal.db/order_partitioned_bucketed/year=2013/month=07/000003_0

And there are our 4 files in partitioned directory representing our 4 buckets. There are a few things we can observe when using bucketing.

We will have data of each bucket in a separate file, unlike partitioning which only creates directories.
If we insert new data into this table, the Hive will create 4 new files and add data to it.
When we insert data into a bucketed table, the number of reducers will be in multiple of the number of buckets of that table.
We do not need to specify anything different in query while inserting data into a bucketed table.

Using Only Bucketing for Hive Table

Though bucketing is used with partitioning most of the time, it is a standalone feature. We can have a table which is only bucketed. This works similarly to the partitioned bucketed table.

We can use a simple bucketed table when we do not have a good column to make partitions on. We can use a bucketed table to improve the performance of such a table with bucketing.

Conclusion

We have covered the basics of bucketing in this article, we will revisit it for advanced concepts of bucketing in a future article. With partitioning and bucketing, we can improve the performance of Hive queries. So I will urge you to understand these concepts thoroughly and use them while modeling data. See you in the next blog :).

What is Bucketing in Hive

Creating Bucketed Table in Hive

Using Only Bucketing for Hive Table

Conclusion

Alter Table Partitions in Hive

Hive – What is the difference between Collect Set and Collect List

Create, Alter, Delete Tables in Hive

Inserting Data In Hive Partitioned tables

Hive – Advanced Aggregations with Grouping sets, Rollup and cube

Partitioning in Hive

What is Bucketing in Hive

Creating Bucketed Table in Hive

Using Only Bucketing for Hive Table

Conclusion

Similar Posts