Static vs Dynamic Partitioning in Hive
There are two types of Partitions in Hive. Static and Dynamic. They both operate on the same basic principles of Partitioning. But they are suitable for particular use cases. In this tutorial, we are going to learn, static & dynamic partitioning, the difference between static and dynamic partition in Hive and when should we use one of them.
Static Partitioning in Hive
In Static Partitioning, we have to manually decide how many partitions tables will have and also value for those partitions. Consider we have employ table and we want to partition it based on department name. There are a limited number of departments, hence a limited number of partitions.
One of the major drawbacks of static partitioning is, when we are loading data to some partition, we have to make sure we are putting the right data in the right partition. In the below example, while loading HR data, it is our responsibility to make sure that the file only has HR data. (Learn more about Load hive command here).
1 2 3 4 5 6 7 8 9 10 |
CREATE EXTERNAL TABLE employee_dept ( emp_id INT, emp_name STRING ) PARTITIONED BY ( dept_name STRING ) location '/user/maheshmogal/employee_dept'; LOAD DATA LOCAL INPATH 'hr.txt' INTO TABLE employee_dept PARTITION (dept_name='HR'); |
Above queries will create table on partition on department name and then load data from our local system to this table in partitioned name “HR”.
Now you should see another drawback with this method. What if you have 1000 different departments or you have some other partitioning column like year and month? You have to create those manually and that is no FUN!! This is where dynamic partitions help us.
With all these drawbacks you might think, static partitions are not that useful. But when you have limited and well know a set of values (like departments or state names etc), you can use static partitioning. Static partitioning is faster for loading data as all of the information is already present in the query. Also, it will stop you from creating unnecessary partitions creating performance issues the very thing which partitions are supposed to help us with. (read more on performance issues with a huge number of partitions)
Dynamic Partitioning In Hive
In most cases, you will find yourself using Dynamic partitions. Dynamic partitions provide us with flexibility and create partitions automatically depending on the data that we are inserting into the table.
In last tutorial, we have created orders table. We can insert data in to that table with following query.
1 2 3 |
hive (maheshmogal)> insert overwrite table order_partition partition (year,month) > select order_id, order_date, order_status, substr(order_date,1,4) year, substr(order_date,5,2) month from orders; FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict |
Oh we got error? what is this about?
By default, Hive does not enable dynamic partition. This is to protect us, from creating from a huge number of partitions accidentally. In dynamic partition, we are telling hive which column to use for dynamic partition. If we select the wrong column (say order id) we can end up with millions of partitions.
So how do we create dynamic partitions? We have to enable hive dynamic partition first (which is disabled by default). But when it is enabled, it is in strict mode. This means there has to be at least one static partition present in this table. Then hive will let us create other partitions dynamically.
An easier way is, we can set Hive’s dynamic property mode to nonstrict using the following command. This will allow us to create dynamic partitions in the table without any static partition.
1 2 |
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; |
Now if you run the insert query, it will create all required dynamic partitions and insert correct data into each partition. This all happens in one single query and we do not have to manually check data and correct partition.
Static VS Dynamic Partitioning
In most cases, we can use dynamic partitioning. It provides us a lot of flexibility. The FOllowing table will help you understand the difference between Static and dynamic partitioning.
# | Static Partitioning | Dynamic Partitioning |
---|---|---|
Partition Creation | We need to manually create each partition before inserting data into a partition | Partitions will be created dynamically based on input data to the table. |
Suitable for | We need to know all partitions in advance. So it is suitable for use cases where partitions are defined well ahead and are small in number | Dynamic partitions are suitable when we have lot of partitions and we can not predict in advance new partitions ahead of time. |
Examples | Departments, State Names, etc | Date, city names etc |
Conclusion
Partitioning is one of the important topics in the Hive and it is used to improve the performance of queries. We have learned the basics of static and dynamic partition in this tutorial. In the next blog, we will learn different ways to insert data in the partition table and also learn about altering and dropping partitions.