MSCK Repair – Fixing Partitions in Hive Table
We know we can add extra partitions using ALTER TABLE command to the Hive table. But what if there is a need and we need to add 100s of partitions? What if we are pointing our external table to already partitioned data in HDFS? Do we add each partition manually using a query? Enter the MSCK REPAIR query. Using it we can fix broken partition in the Hive table. Let us learn how we can use it.
MSCK REPAIR – Adding All Partitions from Directory
Consider the below example. we have already partitioned data in year and month for orders. ( Or this could be placed where each day logs are getting dumped and you need to pint logs table here). Now we are creating an external table and pointing to this location. would we see partitions directly in our new table? No, we won’t.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE EXTERNAL TABLE order_partition_extrenal ( order_id INT, order_date STRING, order_status STRING ) PARTITIONED BY ( year STRING, month STRING ) LOCATION '/apps/hive/warehouse/maheshmogal.db/order_partition'; hive (maheshmogal)> show partitions order_partition_extrenal;; OK Time taken: 0.501 seconds |
No partitions. You can say that its easy. we can add each partition using alter command right? You wont be wrong. But there is better way.
We can MSCK REPAIR command. This goes to the directory where the table is pointing to and then creates a tree of directories and subdirectories, check table metadata, and adds all missing partitions. Sounds like magic is not it? This command saves a lot of time as we do not need to add each partition manually. Let us see it in action.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
hive (maheshmogal)> MSCK REPAIR TABLE order_partition_extrenal; OK Partitions not in metastore: order_partition_extrenal:year=2013/month=07 order_partition_extrenal:year=2013/month=08 order_partition_extrenal:year=2013/month=09 order_partition_extrenal:year=2013/month=10 order_partition_extrenal:year=2013/month=11 order_partition_extrenal:year=2013/month=12 order_partition_extrenal:year=2014/month=01 order_partition_extrenal:year=2014/month=02 order_partition_extrenal:year=2014/month=03 order_partition_extrenal:year=2014/month=04 order_partition_extrenal:year=2014/month=05 order_partition_extrenal:year=2014/month=06 order_partition_extrenal:year=2014/month=07 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=07 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=08 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=09 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=10 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=11 Repair: Added partition to metastore order_partition_extrenal:year=2013/month=12 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=01 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=02 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=03 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=04 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=05 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=06 Repair: Added partition to metastore order_partition_extrenal:year=2014/month=07 Time taken: 1.462 seconds, Fetched: 14 row(s) |
You can see that once we ran this query on our table, it has gone through all folders and added partitions to our table metadata. We can now check our partitions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
hive (maheshmogal)> show partitions order_partition_extrenal;; OK year=2013/month=07 year=2013/month=08 year=2013/month=09 year=2013/month=10 year=2013/month=11 year=2013/month=12 year=2014/month=01 year=2014/month=02 year=2014/month=03 year=2014/month=04 year=2014/month=05 year=2014/month=06 year=2014/month=07 Time taken: 0.466 seconds, Fetched: 13 row(s) |
Yeyyy. we have all of our partitions showing up in our table. And all it took is one single command.
Adding New Partitions to Table
MSCK REPAIR can also add new partitions to already existing table. I have created new directory under this location with year=2019 and month=11. Let us run MSCK query and see if it adds that entry to our table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
dfs -mkdir -p /apps/hive/warehouse/maheshmogal.db/order_partition/year=2019/month=11; MSCK REPAIR TABLE order_partition_extrenal; OK Partitions not in metastore: order_partition_extrenal:year=2019/month=11 Repair: Added partition to metastore order_partition_extrenal:year=2019/month=11 Time taken: 0.698 seconds, Fetched: 2 row(s) show partitions order_partition_extrenal; OK year=2013/month=07 year=2013/month=08 year=2013/month=09 year=2013/month=10 year=2013/month=11 year=2013/month=12 year=2014/month=01 year=2014/month=02 year=2014/month=03 year=2014/month=04 year=2014/month=05 year=2014/month=06 year=2014/month=07 year=2019/month=11 Time taken: 0.464 seconds, Fetched: 14 row(s) |
Well yes it has added new partition to our table. So should we forget ALTER TABLE command and use MSCK query when we want to add single partitions as well?
No, MSCK REPAIR is a resource-intensive query. It needs to traverses all subdirectories. Maintain that structure and then check table metadata if that partition is already present or not and add an only new partition. This is overkill when we want to add an occasional one or two partitions to the table. We should use an ALTER TABLE query in such cases.
MSCK REPAIR is a resource-intensive query and using it to add single partition is not recommended especially when you huge number of partitions.
Hive Facts
Conclusion
MSCK REPAIR is a useful command and it had saved a lot of time for me. We can easily create tables on already partitioned data and use MSCK REPAIR to get all of its partitions metadata. Like most things in life, it is not a perfect thing and we should not use it when we need to add 1-2 partitions to the table. I hope This will help you. See you next article.