Inserting Data to Partitions in Hive Table

We have created partitioned tables in Hive. We have also learned about static vs dynamic partitions in the last two blogs. One this missing is how do we put data to these partitioned tables. Let us learn how we can insert data in partitioned tables in different ways.

Inserting Data In Partitioned Table

We have seen bits and pieces of information about how to insert data in the Partitioned table. Say we have a file in our local system and we want to put that in the employee table. Then we can run the following query.

LOAD DATA LOCAL INPATH 'hr.txt'
INTO TABLE employee_dept
PARTITION (dept_name='HR');

LOAD DATA LOCAL INPATH 'hr.txt'

INTO TABLE employee_dept

PARTITION (dept_name='HR');

Here we are inserting data to HR partition. We need to make sure that this data belongs to HR partition only.

Instead of a file, we may have an un-partitioned employee table. And we want to improve performance so we decide to put this data in a partitioned table. We can do this with a multi-table insert statement like below. This will take data from the base table in insert into partitions.

FROM staged_employees se
INSERT OVERWRITE TABLE employees
 PARTITION (country = 'US', state = 'OR')
 SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
 PARTITION (country = 'US', state = 'CA')
 SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
 PARTITION (country = 'US', state = 'IL')
 SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';

FROM staged_employees se

INSERT OVERWRITE TABLE employees

PARTITION (country = 'US', state = 'OR')

SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'

INSERT OVERWRITE TABLE employees

PARTITION (country = 'US', state = 'CA')

SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'

INSERT OVERWRITE TABLE employees

PARTITION (country = 'US', state = 'IL')

SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';

Inserting Data In Dynamic Partitions

This all good. But we are using static partitioning here. If we have 100’s of partitions then it is not optimal way to write 100 clauses in query. We can use dynamic partitioning for this. Before Using dynamic partitioning we need to tell hive that we want to use dynamic partitioning.

There are two properties to let hive know that we want o use dynamic partitioning. First, we need to enable dynamic partitions. This enables partitioning in strict mode. That means we need to at least have one static partition in the table before creating other partitions dynamically. If we want to avoid creating static partitions altogether, then we can set this mode to non-strict. (use below hive properties to do so).

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

1 2	set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;

Once we set these properties we can run below query to insert data dynamically to each partition.

insert overwrite table order_partition partition (year,month)
 select order_id, order_date, order_status, substr(order_date,1,4) ye,
 substr(order_date,5,2) mon from orders;

insert overwrite table order_partition partition (year,month)

select order_id, order_date, order_status, substr(order_date,1,4) ye,

substr(order_date,5,2) mon from orders;

This will insert data to year and month partitions for the order table. Hive takes partition values from the last two columns “ye” and “mon”. I have given different names than partitioned column names to emphasize that there is no column name relationship between data nad partitioned columns. Hive always takes last column/s as partitioned column information. That is why when inserting data in the partitioned table, we have to make sure partitioned columns are last in our select query.

When inserting data to partitioned table using select query, we need to make sure that partitioned columns are at last of select query. Hive will pick those values as partitioned columns directly
Hive Facts

Mixing Static and Dynamic Partitions in Insert Queries

While inserting data in partitioned tables, we can mix static and dynamic partition in one single query. Let us take a look at query below.

INSERT OVERWRITE TABLE order_partition 
partition (year='2019',month)
 SELECT order_id, order_date, order_status, 
substr(order_date,1,4) ye, substr(order_date,5,2) mon 
FROM orders
WHERE substr(order_date,1,4) ="2019";

INSERT OVERWRITE TABLE order_partition

partition (year='2019',month)

SELECT order_id, order_date, order_status,

substr(order_date,1,4) ye, substr(order_date,5,2) mon

FROM orders

WHERE substr(order_date,1,4) ="2019";

In this query, we have set year value to 2019 only. This will create all month partitions under 2019. We also need to make sure we select data for the 2019 year only. This is why we are using where clause. This is how we can combine using static and dynamic partitions in a single insert query.

Conclusion

We have learned different ways to insert data in dynamic partitioned tables. In static partitioning, we have to give partitioned values. With dynamic partitioning, hive picks partition values directly from the query. We can also mix static and dynamic partition while inserting data into the table. I hope you found this article helpful. See you in the next one.

Inserting Data In Hive Partitioned tables

Inserting Data In Partitioned Table

Inserting Data In Dynamic Partitions

Mixing Static and Dynamic Partitions in Insert Queries

Conclusion

Where and Filter in Spark Dataframes

Adding Custom Schema to Spark Dataframe

Hive – Advanced Aggregations with Grouping sets, Rollup and cube

Pivot rows to columns in Hive

Data Types in Hive

Adding White Spaces to Data in Spark Dataframe

One Comment

Inserting Data In Partitioned Table

Inserting Data In Dynamic Partitions

Mixing Static and Dynamic Partitions in Insert Queries

Conclusion

Similar Posts

One Comment