Loading Data to Hive Tables

By now, we have created different types of tables with a variety of data types in the Hive. But we have not seen how we can populate these tables with data either from files in our local system or some already existing data in HDFS. In today’s tutorial, we will learn about different ways in which we can fill data into Hive tables.

Loading Local System Data to HDFS Table

For this tutorial, consider simple table below which does not have any data in it now.

CREATE TABLE employee(id INT, name STRING, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 0.382 seconds

CREATE TABLE employee(id INT, name STRING, state STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Time taken: 0.382 seconds

Using HDFS command below, we can copy data from our local system into a Hive table. We can either specify OVERWRITE which clears existing data in table and loads new data or we can omit OVERWRITE which appends new data to existing table data. Let us see it in action below.

hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 0.959 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.228 seconds, Fetched: 3 row(s)

hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;

Loading data to table maheshmogal.employee

Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]

Time taken: 0.959 seconds

hive (maheshmogal)> select * from employee;

1 abc CA

2 xyz NY

3 pqr CA

Time taken: 0.228 seconds, Fetched: 3 row(s)

Here we can see that the employee table has been loaded with data from our local file system. We can also try the same same query without the OVERWRITE clause.

-- command without overwrite will append data to existing table. That is why we have duplicates in table
hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=2, numRows=0, totalSize=54, rawDataSize=0]
OK
Time taken: 1.203 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.677 seconds, Fetched: 6 row(s)
-- here we can see that when we use OVERWRITE old data is delayed and only new data is in table.
hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 1.144 seconds
hive (maheshmogal)> select * from employee;
OK
1       abc     CA
2       xyz     NY
3       pqr     CA
Time taken: 0.218 seconds, Fetched: 3 row(s)
hive (maheshmogal)>

-- command without overwrite will append data to existing table. That is why we have duplicates in table

hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' INTO TABLE employee;

Loading data to table maheshmogal.employee

Table maheshmogal.employee stats: [numFiles=2, numRows=0, totalSize=54, rawDataSize=0]

Time taken: 1.203 seconds

hive (maheshmogal)> select * from employee;

1 abc CA

2 xyz NY

3 pqr CA

1 abc CA

2 xyz NY

3 pqr CA

Time taken: 0.677 seconds, Fetched: 6 row(s)

-- here we can see that when we use OVERWRITE old data is delayed and only new data is in table.

hive (maheshmogal)> LOAD DATA LOCAL INPATH 'emp.txt' OVERWRITE INTO TABLE employee;

Loading data to table maheshmogal.employee

Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]

Time taken: 1.144 seconds

hive (maheshmogal)> select * from employee;

1 abc CA

2 xyz NY

3 pqr CA

Time taken: 0.218 seconds, Fetched: 3 row(s)

hive (maheshmogal)>

Loading Data From HDFS Location into Hive Table

We can use the same command as above to load data from HDFS location to Hive table. We only have to remove the “LOCAL” keyword from command. There is another thing to note while loading data from HDFS location to hive table.

When we load data from Local system to Hive table LOAD DATA command copies(copy+paste) file from system to HDFS. But when we use LOAD DATA with HDFS file location then it moves (cut + paste)data from source location to target location.
Hive Facts

hive (maheshmogal)> dfs -ls ;
Found 10 items
drwx------   - maheshmogal hdfs          0 2020-05-26 08:04 .Trash
drwx------   - maheshmogal hdfs          0 2020-05-26 02:24 .staging
-rw-r--r--   2 maheshmogal hdfs         27 2020-05-26 13:20 emp.txt
hive (maheshmogal)> LOAD DATA INPATH '/user/maheshmogal/emp.txt' OVERWRITE INTO TABLE employee;
Loading data to table maheshmogal.employee
Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]
OK
Time taken: 1.348 seconds
hive (maheshmogal)> dfs -ls ;
Found 9 items
drwx------   - maheshmogal hdfs          0 2020-05-26 08:04 .Trash
drwx------   - maheshmogal hdfs          0 2020-05-26 02:24 .staging
hive (maheshmogal)>

hive (maheshmogal)> dfs -ls ;

Found 10 items

drwx------ - maheshmogal hdfs 0 2020-05-26 08:04 .Trash

drwx------ - maheshmogal hdfs 0 2020-05-26 02:24 .staging

-rw-r--r-- 2 maheshmogal hdfs 27 2020-05-26 13:20 emp.txt

hive (maheshmogal)> LOAD DATA INPATH '/user/maheshmogal/emp.txt' OVERWRITE INTO TABLE employee;

Loading data to table maheshmogal.employee

Table maheshmogal.employee stats: [numFiles=1, numRows=0, totalSize=27, rawDataSize=0]

Time taken: 1.348 seconds

hive (maheshmogal)> dfs -ls ;

Found 9 items

drwx------ - maheshmogal hdfs 0 2020-05-26 08:04 .Trash

drwx------ - maheshmogal hdfs 0 2020-05-26 02:24 .staging

hive (maheshmogal)>

In the above example, we can see that the LOAD data command has copied the file from our HDFS location into the hive table and also deleted that file from the original location.

Loading Data From One Table to Another Table in Hive

We can also load data from one hive table to another. you may have one large table with all kinds of information and you need very specific data from it. In that case you can use the INSERT query like below to add data to your table.

hive (maheshmogal)> create table sample_emp (id int, name string);
OK
Time taken: 0.537 seconds
hive (maheshmogal)> INSERT OVERWRITE TABLE sample_emp
                  > SELECT id, name FROM employee;
Query ID = maheshmogal_20200526132900_e2a5c20f-bcce-4dc6-9247-fe7cfacb189c
Total jobs = 3
Launching Job 1 out of 3
-- map reduce output 
Time taken: 19.256 seconds
hive (maheshmogal)> select * from sample_emp;
OK
1       abc
2       xyz
3       pqr
Time taken: 0.248 seconds, Fetched: 3 row(s)

hive (maheshmogal)> create table sample_emp (id int, name string);

Time taken: 0.537 seconds

hive (maheshmogal)> INSERT OVERWRITE TABLE sample_emp

> SELECT id, name FROM employee;

Query ID = maheshmogal_20200526132900_e2a5c20f-bcce-4dc6-9247-fe7cfacb189c

Total jobs = 3

Launching Job 1 out of 3

-- map reduce output

Time taken: 19.256 seconds

hive (maheshmogal)> select * from sample_emp;

1 abc

2 xyz

3 pqr

Time taken: 0.248 seconds, Fetched: 3 row(s)

Insert command will trigger a map-reduce job and insert data from a source table to a destination table. Just like the LOAD DATA command we can use the INTO clause instead of OVERWRITE, which will append data to an existing table.

We do not need to create a table beforehand to insert data into it. We can use the hive’s Create Table as a syntax to create table on the fly and insert data into it.

hive (maheshmogal)> drop table sample_emp;
OK
Time taken: 0.711 seconds
hive (maheshmogal)> CREATE TABLE sample_emp AS
                  > SELECT id, name FROM employee;
Query ID = maheshmogal_20200526133635_621e51d9-b86f-4938-856d-13fc13a5bb22
Total jobs = 3
Launching Job 1 out of 3
-- MR logs
hive (maheshmogal)> select * from sample_emp;
OK
1       abc
2       xyz
3       pqr
Time taken: 0.233 seconds, Fetched: 3 row(s)
hive (maheshmogal)>

hive (maheshmogal)> drop table sample_emp;

Time taken: 0.711 seconds

hive (maheshmogal)> CREATE TABLE sample_emp AS

> SELECT id, name FROM employee;

Query ID = maheshmogal_20200526133635_621e51d9-b86f-4938-856d-13fc13a5bb22

Total jobs = 3

Launching Job 1 out of 3

-- MR logs

hive (maheshmogal)> select * from sample_emp;

1 abc

2 xyz

3 pqr

Time taken: 0.233 seconds, Fetched: 3 row(s)

hive (maheshmogal)>

Multi Insert Command

In hive we can use multiple insert commands in a single query. This is useful when we want to scan the entire table once and divide it into smaller set of tables in one single query.

Consider the example, you have large tables with employees and you want to create a separate table for each state which has all employees of that state. We can do it with where clause and adding employees to the respective table. However this will cause our large table to be scanned multiple times. Instead of that, we can use the query below to process this requirement efficiently.

hive (maheshmogal)> FROM employee e
                  > INSERT OVERWRIE TABLE emp_ca
                  > SELECT * WHERE e.state = 'CA'
                  > INSERT OVERWRITE TABLE emp_ny
                  > SELECT * WHERE e.state = 'NY';

hive (maheshmogal)> FROM employee e

> INSERT OVERWRIE TABLE emp_ca

> SELECT * WHERE e.state = 'CA'

> INSERT OVERWRITE TABLE emp_ny

> SELECT * WHERE e.state = 'NY';

This feature is quite powerful as we can populate multiple tables in a single query. When we will learn about Partitions in the Hive, this feature will be even more useful to dynamically create partitions and fill them with proper data.

Conclusion

In this article, we have learned how to insert data into hive tables using different ways. I hope you have found this useful. See you in the next blog.

Hive Data Manipulation – Loading Data to Hive Tables

Loading Local System Data to HDFS Table

Loading Data From HDFS Location into Hive Table

Loading Data From One Table to Another Table in Hive

Multi Insert Command

Conclusion

External Vs Internal(Managed) Tables in Hive

Bucketing in Hive

String Functions in Spark

Create, Alter, Delete Tables in Hive

Adding Custom Schema to Spark Dataframe

MSCK Repair – Fixing Partitions in Hive Table

Leave a Reply Cancel reply

Loading Local System Data to HDFS Table

Loading Data From HDFS Location into Hive Table

Loading Data From One Table to Another Table in Hive

Multi Insert Command

Conclusion

Similar Posts

Leave a Reply Cancel reply