MSCK Repair - Fixing Partitions in Hive Table

Updated On August 11, 2020 | By Mahesh Mogal

We know we can add extra partitions using ALTER TABLE command to the Hive table. But what if there is a need and we need to add 100s of partitions? What if we are pointing our external table to already partitioned data in HDFS? Do we add each partition manually using a query? Enter the MSCK REPAIR query. Using it we can fix broken partition in the Hive table. Let us learn how we can use it.

MSCK REPAIR - Adding All Partitions from Directory

Consider the below example. we have already partitioned data in year and month for orders. ( Or this could be placed where each day logs are getting dumped and you need to pint logs table here). Now we are creating an external table and pointing to this location. would we see partitions directly in our new table? No, we won't.

Partitioned Data in HDFS
Partitioned Data in HDFS
CREATE EXTERNAL TABLE order_partition_extrenal (
    order_id INT,
    order_date STRING,
    order_status STRING
)
PARTITIONED BY (
    year STRING,
    month STRING
    )
LOCATION '/apps/hive/warehouse/maheshmogal.db/order_partition';

hive (maheshmogal)> show partitions order_partition_extrenal;;
OK
Time taken: 0.501 seconds
hive (maheshmogal)> 

No partitions. You can say that its easy. we can add each partition using alter command right? You wont be wrong. But there is better way.

We can MSCK REPAIR command. This goes to the directory where the table is pointing to and then creates a tree of directories and subdirectories, check table metadata, and adds all missing partitions. Sounds like magic is not it? This command saves a lot of time as we do not need to add each partition manually. Let us see it in action.

hive (maheshmogal)> MSCK REPAIR TABLE order_partition_extrenal;
OK
Partitions not in metastore:    order_partition_extrenal:year=2013/month=07     
order_partition_extrenal:year=2013/month=08     order_partition_extrenal:year=2013/month=09     
order_partition_extrenal:year=2013/month=10     order_partition_extrenal:year=2013/month=11     
order_partition_extrenal:year=2013/month=12     order_partition_extrenal:year=2014/month=01     
order_partition_extrenal:year=2014/month=02     order_partition_extrenal:year=2014/month=03     
order_partition_extrenal:year=2014/month=04     order_partition_extrenal:year=2014/month=05     
order_partition_extrenal:year=2014/month=06     order_partition_extrenal:year=2014/month=07
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=07
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=08
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=09
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=10
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=11
Repair: Added partition to metastore order_partition_extrenal:year=2013/month=12
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=01
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=02
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=03
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=04
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=05
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=06
Repair: Added partition to metastore order_partition_extrenal:year=2014/month=07
Time taken: 1.462 seconds, Fetched: 14 row(s)

You can see that once we ran this query on our table, it has gone through all folders and added partitions to our table metadata. We can now check our partitions.

hive (maheshmogal)> show partitions order_partition_extrenal;;
OK
year=2013/month=07
year=2013/month=08
year=2013/month=09
year=2013/month=10
year=2013/month=11
year=2013/month=12
year=2014/month=01
year=2014/month=02
year=2014/month=03
year=2014/month=04
year=2014/month=05
year=2014/month=06
year=2014/month=07
Time taken: 0.466 seconds, Fetched: 13 row(s)

Yeyyy. we have all of our partitions showing up in our table. And all it took is one single command.

Adding New Partitions to Table

MSCK REPAIR can also add new partitions to already existing table. I have created new directory under this location with year=2019 and month=11. Let us run MSCK query and see if it adds that entry to our table.

dfs -mkdir -p /apps/hive/warehouse/maheshmogal.db/order_partition/year=2019/month=11;

MSCK REPAIR TABLE order_partition_extrenal;
OK
Partitions not in metastore:    order_partition_extrenal:year=2019/month=11
Repair: Added partition to metastore order_partition_extrenal:year=2019/month=11
Time taken: 0.698 seconds, Fetched: 2 row(s)

show partitions order_partition_extrenal;
OK
year=2013/month=07
year=2013/month=08
year=2013/month=09
year=2013/month=10
year=2013/month=11
year=2013/month=12
year=2014/month=01
year=2014/month=02
year=2014/month=03
year=2014/month=04
year=2014/month=05
year=2014/month=06
year=2014/month=07
year=2019/month=11
Time taken: 0.464 seconds, Fetched: 14 row(s)

Well yes it has added new partition to our table. So should we forget ALTER TABLE command and use MSCK query when we want to add single partitions as well?

No, MSCK REPAIR is a resource-intensive query. It needs to traverses all subdirectories. Maintain that structure and then check table metadata if that partition is already present or not and add an only new partition. This is overkill when we want to add an occasional one or two partitions to the table. We should use an ALTER TABLE query in such cases.

MSCK REPAIR is a resource-intensive query and using it to add single partition is not recommended especially when you huge number of partitions.

Hive Facts

Conclusion

MSCK REPAIR is a useful command and it had saved a lot of time for me. We can easily create tables on already partitioned data and use MSCK REPAIR to get all of its partitions metadata. Like most things in life, it is not a perfect thing and we should not use it when we need to add 1-2 partitions to the table. I hope This will help you. See you next article.

msck repair hive
.

Mahesh Mogal

I am passionate about Cloud, Data Analytics, Machine Learning, and Artificial Intelligence. I like to learn and try out new things. I have started blogging about my experience while learning these exciting technologies.

Stay Updated with Latest Blogs

Get latest blogs delivered to your mail directly.

Recent Posts

Select Expr in Spark Dataframe

In this blog, we will learn how to use select and expr in the Spark data frame. We will learn multiple use cases along with selectExpr.

Read More
Add, Rename, Drop Columns in Spark Dataframe

We will go through common column operations like add, rename, list, select, and dropping a column from spark dataframe.

Read More
MSCK Repair - Fixing Partitions in Hive Table

We will learn how to add multiple partitions to hive table using msck repair table command in hive.

msck repair hive
Read More

Leave a Reply

Your email address will not be published. Required fields are marked *

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
Share via
Copy link
Powered by Social Snap