MSCK Repair – Fixing Partitions in Hive Table

We know we can add extra partitions using ALTER TABLE command to the Hive table. But what if there is a need and we need to add 100s of partitions? What if we are pointing our external table to already partitioned data in HDFS? Do we add each partition manually using a query? Enter the MSCK REPAIR query. Using it we can fix broken partition in the Hive table. Let us learn how we can use it.

MSCK REPAIR – Adding All Partitions from Directory

Consider the below example. we have already partitioned data in year and month for orders. ( Or this could be placed where each day logs are getting dumped and you need to pint logs table here). Now we are creating an external table and pointing to this location. would we see partitions directly in our new table? No, we won’t.

Partitioned Data in HDFS
Partitioned Data in HDFS

No partitions. You can say that its easy. we can add each partition using alter command right? You wont be wrong. But there is better way.

We can MSCK REPAIR command. This goes to the directory where the table is pointing to and then creates a tree of directories and subdirectories, check table metadata, and adds all missing partitions. Sounds like magic is not it? This command saves a lot of time as we do not need to add each partition manually. Let us see it in action.

You can see that once we ran this query on our table, it has gone through all folders and added partitions to our table metadata. We can now check our partitions.

Yeyyy. we have all of our partitions showing up in our table. And all it took is one single command.

Adding New Partitions to Table

MSCK REPAIR can also add new partitions to already existing table. I have created new directory under this location with year=2019 and month=11. Let us run MSCK query and see if it adds that entry to our table.

Well yes it has added new partition to our table. So should we forget ALTER TABLE command and use MSCK query when we want to add single partitions as well?

No, MSCK REPAIR is a resource-intensive query. It needs to traverses all subdirectories. Maintain that structure and then check table metadata if that partition is already present or not and add an only new partition. This is overkill when we want to add an occasional one or two partitions to the table. We should use an ALTER TABLE query in such cases.

MSCK REPAIR is a resource-intensive query and using it to add single partition is not recommended especially when you huge number of partitions.

Hive Facts

Conclusion

MSCK REPAIR is a useful command and it had saved a lot of time for me. We can easily create tables on already partitioned data and use MSCK REPAIR to get all of its partitions metadata. Like most things in life, it is not a perfect thing and we should not use it when we need to add 1-2 partitions to the table. I hope This will help you. See you next article.

Similar Posts

Leave a Reply

Your email address will not be published.