Inserting Data In Hive Partitioned tables
We have created partitioned tables in Hive. We have also learned about static vs dynamic partitions in the last two blogs. One this missing is how do we put data to these partitioned tables. Let us learn how we can insert data in partitioned tables in different ways.
Inserting Data In Partitioned Table
We have seen bits and pieces of information about how to insert data in the Partitioned table. Say we have a file in our local system and we want to put that in the employee table. Then we can run the following query.
1 2 3 |
LOAD DATA LOCAL INPATH 'hr.txt' INTO TABLE employee_dept PARTITION (dept_name='HR'); |
Here we are inserting data to HR partition. We need to make sure that this data belongs to HR partition only.
Instead of a file, we may have an un-partitioned employee table. And we want to improve performance so we decide to put this data in a partitioned table. We can do this with a multi-table insert statement like below. This will take data from the base table in insert into partitions.
1 2 3 4 5 6 7 8 9 10 |
FROM staged_employees se INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR') SELECT * WHERE se.cnty = 'US' AND se.st = 'OR' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'CA') SELECT * WHERE se.cnty = 'US' AND se.st = 'CA' INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'IL') SELECT * WHERE se.cnty = 'US' AND se.st = 'IL'; |
Inserting Data In Dynamic Partitions
This all good. But we are using static partitioning here. If we have 100’s of partitions then it is not optimal way to write 100 clauses in query. We can use dynamic partitioning for this. Before Using dynamic partitioning we need to tell hive that we want to use dynamic partitioning.
There are two properties to let hive know that we want o use dynamic partitioning. First, we need to enable dynamic partitions. This enables partitioning in strict mode. That means we need to at least have one static partition in the table before creating other partitions dynamically. If we want to avoid creating static partitions altogether, then we can set this mode to non-strict. (use below hive properties to do so).
1 2 |
set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; |
Once we set these properties we can run below query to insert data dynamically to each partition.
1 2 3 |
insert overwrite table order_partition partition (year,month) select order_id, order_date, order_status, substr(order_date,1,4) ye, substr(order_date,5,2) mon from orders; |
This will insert data to year and month partitions for the order table. Hive takes partition values from the last two columns “ye” and “mon”. I have given different names than partitioned column names to emphasize that there is no column name relationship between data nad partitioned columns. Hive always takes last column/s as partitioned column information. That is why when inserting data in the partitioned table, we have to make sure partitioned columns are last in our select query.
When inserting data to partitioned table using select query, we need to make sure that partitioned columns are at last of select query. Hive will pick those values as partitioned columns directly
Hive Facts
Mixing Static and Dynamic Partitions in Insert Queries
While inserting data in partitioned tables, we can mix static and dynamic partition in one single query. Let us take a look at query below.
1 2 3 4 5 6 |
INSERT OVERWRITE TABLE order_partition partition (year='2019',month) SELECT order_id, order_date, order_status, substr(order_date,1,4) ye, substr(order_date,5,2) mon FROM orders WHERE substr(order_date,1,4) ="2019"; |
In this query, we have set year value to 2019 only. This will create all month partitions under 2019. We also need to make sure we select data for the 2019 year only. This is why we are using where clause. This is how we can combine using static and dynamic partitions in a single insert query.
Conclusion
We have learned different ways to insert data in dynamic partitioned tables. In static partitioning, we have to give partitioned values. With dynamic partitioning, hive picks partition values directly from the query. We can also mix static and dynamic partition while inserting data into the table. I hope you found this article helpful. See you in the next one.
very good one and useful for new techs